New baselines for FM-indexes
The Pizza&Chili website provides a set of highly tuned compressed index implementation which can be used through the same interface and a text corpus consisting of texts of different size origin from several application domains. Ferragina et al. [JEA2008] published a good experimental study in which the indexes were evaluated. I was excited how the stated results change if I rerun the experiments of the paper on my machine and furthermore how good their results are compared to my own implementations. Therefore, I downloaded the indexes and test cases, built the programs (instructions to compile the indexes on a 64-bit system) and wrote a script which reproduces Table VI of the JEA paper. The table lists query time and the space of the data structures for count queries. I changed the presentation slightly. The LZ-index was not included but the run-length compressed FM-index implemented by Mäkinen and González and the succinct suffix array which uses the RRR implementation of Claude were added.
Before we come to the results, let us briefly revisit the experimental setup.
The machine was equipped with 8 Intel Xeon E5640 Dual Core Processors which share a 12MB smart cache
and 144GB of DDR3 DRAM, only one core was used for the experiments.
Operating system was Ubuntu Linux 11.10 and we used the gcc compiler version 4.6.1.
Where necessary we added the compile option -m32
and we added the compile option
-msse4.2
when using sdsl indexes.
Indexes were built and \(k=50,000\) pattern each of length
\(m=20\) were extracted from each test file. Subsequently, each index
was queried with the corresponding pattern set and the
total query time was divided by \(k\cdot m\) to get the average
time spent on matching one character of a pattern.
The tested implementations correspond to the following approaches:
- SSA: The succinct suffix array is just a Huffman-shaped1 wavelet tree of the Burrows-Wheeler transform (BWT) of the text. The space of the structure is about \(nH_0(\TEXT)\) bits.
- SSA_RRR: The same as SSA but the bitvector is replaced by the RRR implementation of Claude, which uses blocks of size \(15\) and a fast decoding method of the block counts2. Combining Huffman-shaped the wavelet tree with compressed bitvectors results in a \(H_k\)-compression scheme.
- CSA: Sadakane's implementation of his
compressed suffix array which is
based on the \(\Psi\)-function.
- RLFM: The FM-index based on a run-length encoded wavelet tree [Mäkinen and Navarro, 2004].
- AF-index: The alphabet-friendly FM-index (see [Fer2007]) which partitions the BWT in regions and compresses every region with a SSA. The AF-index uses a partitioning such that the overall space consumption is minimized.
- plain SA: The uncompressed suffix array (see [Manber and Myers 1993] and [Gonnet et al. 1992]); binary search is used to count patterns.
Here are the result of the experiments:
SSA | SSA_RRR | CSA | RLFM | AF-index | plain SA | |||||||
Text | Time | Space | Time | Space | Time | Space | Time | Space | Time | Space | Time | Space |
dblp.xml.200MB | 1.300 | 0.83 | 1.776 | 0.47 | 2.648 | 0.29 | 6.108 | 0.65 | 0.840 | 0.54 | 0.436 | 5.00 |
dna.200MB | 0.548 | 0.42 | 0.812 | 0.43 | 3.132 | 0.46 | 1.216 | 0.75 | 0.768 | 0.48 | 0.388 | 5.00 |
english.200MB | 1.156 | 0.73 | 1.676 | 0.53 | 2.776 | 0.44 | 1.744 | 0.80 | 1.048 | 0.65 | 0.372 | 5.00 |
proteins.200MB | 1.024 | 0.69 | 1.668 | 0.69 | 2.672 | 0.67 | 1.704 | 0.89 | 1.168 | 0.82 | 0.368 | 5.00 |
sources.200MB | 1.356 | 0.85 | 1.960 | 0.54 | 2.748 | 0.38 | 1.864 | 0.74 | 1.156 | 0.73 | 0.364 | 5.00 |
When you compare this with the table in the paper, you can make two observations:
- The runtimes are faster by almost a factor of 2. This is due to the use of recent hardware.
- Surprisingly, the space consumption was larger than in the paper, despite my configuration to use a very sparse sampling for information which is only needed for locating (suffix and inverse suffix array values). After contacting the authors it turned out, that a bitvector is used to indicate which samples are stored. This bitvector is not removed or compressed, even if there are very few or no samples.
So we are a little bit naughty and just subtract the space needed for this unnecessary part of the data structure and get the following table:
SSA | SSA_RRR | CSA | RLFM | AF-index | plain SA | |||||||
Text | Time | Space | Time | Space | Time | Space | Time | Space | Time | Space | Time | Space |
dblp.xml.200MB | 1.300 | 0.69 | 1.776 | 0.34 | 2.648 | 0.29 | 6.108 | 0.52 | 0.840 | 0.41 | 0.436 | 5.00 |
dna.200MB | 0.548 | 0.29 | 0.812 | 0.30 | 3.132 | 0.46 | 1.216 | 0.62 | 0.768 | 0.35 | 0.388 | 5.00 |
english.200MB | 1.156 | 0.60 | 1.676 | 0.40 | 2.776 | 0.44 | 1.744 | 0.67 | 1.048 | 0.52 | 0.372 | 5.00 |
proteins.200MB | 1.024 | 0.56 | 1.668 | 0.55 | 2.672 | 0.67 | 1.704 | 0.76 | 1.168 | 0.69 | 0.368 | 5.00 |
sources.200MB | 1.356 | 0.72 | 1.960 | 0.41 | 2.748 | 0.38 | 1.864 | 0.61 | 1.156 | 0.59 | 0.364 | 5.00 |
Let us now take some configuration of sdsl data structures, which more or less correspond to the selected structures above, and do exactly the same experiment again. Here are the results:
FM_HUFF | FM_HUFF_RRR15 | CSA_SADA | FM_RLMN | FM_HUFF_RRR63 | ||||||
Text | Time | Space | Time | Space | Time | Space | Time | Space | Time | Space |
dblp.xml.200MB | 0.760 | 0.70 | 1.352 | 0.32 | 2.676 | 0.29 | 1.820 | 0.34 | 2.072 | 0.17 |
dna.200MB | 0.288 | 0.29 | 0.720 | 0.28 | 3.460 | 0.50 | 1.228 | 0.79 | 1.656 | 0.24 |
english.200MB | 0.708 | 0.61 | 1.480 | 0.38 | 3.108 | 0.44 | 1.892 | 0.69 | 2.640 | 0.27 |
proteins.200MB | 0.632 | 0.56 | 1.564 | 0.53 | 3.360 | 0.65 | 1.796 | 0.89 | 3.288 | 0.48 |
sources.200MB | 0.868 | 0.73 | 1.684 | 0.39 | 2.900 | 0.37 | 1.964 | 0.53 | 2.832 | 0.26 |
Implementation details:
- FM_HUFF: Corresponds to this type (don't be scared by the length of the definition):
csa_wt<wt_huff<bit_vector,rank_support_v5<>,tNoSel,tNoSel>,64000,64000000>
. Explanation: We take a compressed suffix array based on a wavelet tree (csa_wt
). The first template parameter specifies the used wavelet tree. We take a Huffman-shaped one, i.e.wt_huff
. This wavelet tree uses the 5% overhead rank structurerank_support_v5
(proposed by Vigna) and we use a dummy class for the select functionality3, since we don't need select for counting. We set the sampling rate of suffix array values to \(64000\) and the rate for inverse suffix array values to \(64000000\). So we don't waste space for unnecessary locating information. - FM_HUFF_RRR15: Corresponds to
csa_wt<wt_huff<rrr_vector<15> >,64000,64000000>
. The only thing we have changed is the bitvector implemenation. Now we use the sdsl classrrr_vector
parametrized with block size \(15\)4. We don't specify the rank and select support structures, sincewt_huff
takes the default supporting structures5. - CSA_SADA: We use Elias-\(\delta\) code to compressed the pairwise
differences of the \(\Psi\)-function and sample every 128th value of the
\(\Psi\)-function. This translates into this type definition:
csa_sada<enc_vector<coder::elias_delta,128>,64000,64000000>
. - FM_RLMN: Corresponds to csa_wt
,64000,64000000>. We use the run-length compressed wavelet tree csa_wt
which uses the classsd_vector
as default compressed bitvector representation. - FM_HUFF_RRR63: The only difference to FM_HUFF_RRR15 is
the that now we use a block size of \(63\) for the
RRR structure:
csa_wt<wt_huff<rrr_vector<63> >,64000,64000000>
. The decoding of block is now not done with lookup tables but with the on-the-fly decoding approach proposed in the recent SEA paper by Navarro and Providel6.
Observations
- The FM_HUFF implementation can answer count queries in almost half the time than
SSA. This is not surprising, since the sdsl implementation uses the
SSE built function for popcount
on 64-bit words rather than multiple table-lookups.
The FM_HUFF implementation now even outperforms the
plain suffix array on the
dna.200MB
test case, since only about 4 rank queries have to be answered for each symbol of the count query. - The FM_HUFF_RRR15 implementation uses slightly less space than SSA_RRR, since we use bit-compression to store integer sequences. The query time is also slightly faster since we use a more advanced decoding technique4.
- CSA_SADA and CSA take about the same amount of memory but the query times
of CSA are slightly faster than the sdsl implementation. This is due
to some overhead produced by the coder class
coder::elias_delta
. - The query times of FM_RLMN is slightly slower than that of RLFM, except
for the first test case of highly compressible xml data. But in this
case the query time of RLFM is three times slower than that of FM_RLMN
and at the same time the sdsl implementation uses only (65%)
of the space of RLFM. This is a result of the sub data structures used.
The Pizza&Chili implementation uses two uncompressed bitvectors
with a constant time rank and logarithmic time select while
the RM_RLMN uses the
sd_vector
, which is a 64-bit implementation of the SD-array of Okanohara and Sadakane. The space-consumption ofsd_vector
is better if there are only a few runs in the BWT and therefore RM_RLMN outperforms RLFM in the test casesdblp.xml.200MB
andsources.200MB
. Furthermore the runtime of rank and select is constant and therefore don't get a peak for the dblp file. - The AF-index which is in theory the most space-efficient index uses more space than the indexes which use implicit compression boosting like SSA_RRR and FM_HUFF_RRR15. So here is space for more engineering.
- The implicit compression boosting index FM_HUFF_RRR63 uses the least space and its counting time is only slower by a small constant compared to more space-greedy solutions.
Conclusion
Using the sdsl library has several benefits:
- You can specify an index which outperforms all old
indexes in terms of runtime (e.g. FM_HUFF). Even faster
configurations are possible (e.g. use
rank_support_v
orbit_vector_interleaved
withcsa_wt
). - You can specify an index which outperforms all old indexes
in terms of space (e.g. FM_HUFF_RRR63). Even more
space-efficient configurations are possible (e.g.
use
rrr_vector<255>
, this reduces the space fordblp.xml.200MB
to \(14 \%\) and increases the query time to \(13.8 \mu\)secs ). - All sdsl indexes are 64-bit implementations and therefore can be used for texts larger than 512 MiB7.
- The runtime is predictable, since the select data structures
are not implemented using a binary search.
What was not considered in this post are indexes which use fixed-block length compression boosting proposed by Kärkkäinen and Puglisi at SPIRE 2011. From the data in their paper, the indexes are slightly smaller than SSA_RRR while the query times are equal or slightly faster than that of SSA.
- [1] Using a Huffman-shaped wavelet tree was first mention in [Mäkinen and Navarro, 2004] page 17.
- [2] The trick is, that only 4 bits are required to store the count for one block. Claude decodes two blocks at once with a lookup table, when possible. This makes the decoding significantly faster in practice but is not described in his paper.
- [3]
tNoSel
is a typedef forselect_support_dummy
. - [4] Note that for block size \(15\), we do also decode two blocks at once like in the implementation of Claude but whenever possible we decode even more blocks by calculating the prefix sum of multiple nibbles using the same technique which is used by Knuth to calculate the popcount operation.
- [5] Note: For each bitvector of class
X
there exists typedefsX::rank_1_type
,X::rank_0_type
,X::select_1_type
, andX::select_0_type
. - [6] Note: You can use a block size up to
256
in therrr_vector
. Our experiments show that the runtime for the on-the-fly decompression is linearly dependent on the block size. - [7] The limit is not 4 GiB, since we have to address bits.
blog comments powered by Disqus