- Big Data: Experiments on "massive" data
- Overview SDSL v2
- Example application for SDSL v2: Top-k frequent document retrieval
Time performance of a single access/rank/select operation depends on
rank_support_v<>
25% overhead on top of bit_vector
bit_vector_il<>
)
"BV access"
is the baseline of accessing a random bit of a bit_vector
.
"BV access"
is the baseline of accessing a random bit of a bit_vector
.
Performance of binary search select (SEL-BS*),
select_support_mcl<>
(SEL-C)
and Vigna's solutions in Sux
SEL-V9 and SEL-VS
Space breakdown for
rrr_vector<
K>
with varying block size.
Runtime for
rrr_vector<
K>
with varying block size.
Time-space trade-offs of different FM-indexes.
Construction of a cst_sct3<>
for WEB-5G (no HP)
Construction of a cst_sct3<>
for WEB-5G (HP)
#include <sdsl/wavelet_trees.hpp>
wt_pc<..>
: Prefix code wavelet tree
(parametrizable with shape, bitvector, rank, selects,
alphabet) using . Shape specializations:wt_blcd<..>
: Balanced-shaped WTwt_huff<..>
: Huffman-shaped WTwt_hutu<..>
: Hu-Tucker-shaped WTwt_int<..>
: Balanced-shaped WT for
large alphabets (parametrizable with bitvector, rank, selects)
wt_rlmn<..>
: Run-length WT (parametrizable with
bitvector, rank, select, and the underlying WT)Note:
wt_pc
and wt_rlmn
can be
parametrize to work on byte or integer alphabets.
wt_blcd
, wt_hutu
and wt_int
are order preserving WTs.
Build a byte-alphabet Hu-Tucker-shaped WT
(order preserving) for a byte sequence
and call lex_count
.
wt_hutu<rrr_vector<63>> wt; construct_im(wt, "情報学 研究所", 1); cout << wt << endl; auto t1 = wt.lex_count(0, wt.size(), 0x80); auto t2 = wt.lex_count(0, wt.size(), 0xbf); cout << "# of chars : " << wt.size() << endl; cout << "# of UTF-8 symbols: " << get<1>(t1)+get<2>(t2) << endl;
Output:
情報学 研究所 # of bytes : 19 # of UTF-8 symbols: 7
Build an integer-alphabet balanced WT (order preserving) for a integer sequence and do a range search (index range = [1..5], value range=[4..8]).
wt_int<rrr_vector<63>> wt; construct_im(wt, "6 1000 1 4 7 3 18 6 3", 'd'); auto res = wt.range_search_2d(1, 5, 4, 18); for ( auto point : res.second ) cout << point << " "; cout << endl;
Output (index/value pairs):
(3,4) (4,7)
#include <sdsl/suffix_arrays.hpp>
Any CSA provides the access method (operator[]
) and
the following members:
psi
,
lf
,
isa
,
bwt
,
text
,
L
,
F
,
C
,
char2comp
,
comp2char
There are three CSA types:
csa_bitcompressed<..> |
Based on SA and inverse SA (ISA) stored in an int_vector<> .
|
csa_sada<..> |
Based on $\Psi$-function. $\Psi$ is stored in an vector. The vector
type is a template parameter and enc_vector per default.
|
csa_wt<..>
|
Based on a WT over the Burrows-Wheeler-Transform (BWT);
aka FM-Index. WT is parametrizable, default WT is
wt_huff .
|
SA and ISA sampling density, alphabet strategy and the SA value sampling strategy can be specified via template arguments.
Build an byte-alphabet CSA in memory
(0-symbol is appended!). Output size,
alphabet size, and extracted text in
[0..csa.size()-1]
.
csa_bitcompressed<> csa; construct_im(csa, "abracadabra", 1); cout << "csa.size(): " << csa.size() << endl; cout << "csa.sigma : " << csa.sigma << endl; cout << csa << endl; cout << extract(csa, 0, csa.size()-1) << endl;
Output:
csa.size(): 12 csa.sigma : 6 11 10 7 0 3 5 8 1 4 6 9 2 abracadabra
#include <sdsl/suffix_trees.hpp>
Any CST contains members csa
and lcp
and provides the following methods:
nodes()
,
root()
,
begin()
,
end()
,
begin_bottom_up()
,
end_bottom_up()
size(v)
,
is_leaf(v)
,
degree(v)
,
depth(v)
,
node_depth(v)
,
edge(v,d)
,
lb(v)
, rb(v)
,
id(v)
, inv_id(i)
,
sn(v)
select_leaf(i)
,
node(lb,rb)
parent(v)
,
sibling(v)
,
lca(v,w)
,
select_child(v,i)
,
child(v,c)
,
sl(v)
,
wl(v,c)
,
leftmost_leaf(v)
,
rightmost_leaf(v)
There are two CSA types:
cst_sct3<..> |
Interval based node representation. |
cst_sada<..> | Node represented as position in balanced parentheses sequence. |
The underlying CSA, LCP array and the balanced parentheses structure can be specified via template arguments.
Use a CSTs to calculate the $k$-th order entropy of a integer sequence.
cst_sct3<csa_wt<wt_int<rrr_vector<>>>> cst; int_vector<> data(100000, 0, 10); for (size_t i=0; i < data.size(); ++i) data[i] = 1 + rand()%1023; construct_im(cst, data); cout << "cst.csa.sigma: " << cst.csa.sigma << endl; for (size_t k=0; k<3; ++k) cout << "H" << k << " : " << get<0>(Hk(cst, k)) << endl;
Output:
cst.csa.sigma: 1024 H0(data) : 9.99038 H1(data) : 6.52515 H2(data) : 0.0940461
Implementation by Culpepper et al. (ESA 2010)
Test case | SADA | GREEDY | QPROBING |
---|---|---|---|
PROTEINS | 870 (15.4) | 217 (3.84) | 217 (3.84) |
Faithful implementation using SDSL
Test case | SADA | GREEDY | QPROBING |
---|---|---|---|
PROTEINS | 148 (2.62) | 162 (2.87) | 162 (2.87) |
ENWIKI-SML | 199 (3.07) | 130 (2.01) | 130 (2.01) |
ENWIKI-BIG | 23,914 (2.80) | 27,043 (3.17) | 27,043 (3.17) |