New baselines for FM-indexes

The Pizza&Chili website provides a set of highly tuned compressed index implementation which can be used through the same interface and a text corpus consisting of texts of different size origin from several application domains. Ferragina et al. [JEA2008] published a good experimental study in which the indexes were evaluated. I was excited how the stated results change if I rerun the experiments of the paper on my machine and furthermore how good their results are compared to my own implementations. Therefore, I downloaded the indexes and test cases, built the programs (instructions to compile the indexes on a 64-bit system) and wrote a script which reproduces Table VI of the JEA paper. The table lists query time and the space of the data structures for count queries. I changed the presentation slightly. The LZ-index was not included but the run-length compressed FM-index implemented by Mäkinen and González and the succinct suffix array which uses the RRR implementation of Claude were added.

Before we come to the results, let us briefly revisit the experimental setup. The machine was equipped with 8 Intel Xeon E5640 Dual Core Processors which share a 12MB smart cache and 144GB of DDR3 DRAM, only one core was used for the experiments. Operating system was Ubuntu Linux 11.10 and we used the gcc compiler version 4.6.1. Where necessary we added the compile option -m32 and we added the compile option -msse4.2 when using sdsl indexes. Indexes were built and \(k=50,000\) pattern each of length \(m=20\) were extracted from each test file. Subsequently, each index was queried with the corresponding pattern set and the total query time was divided by \(k\cdot m\) to get the average time spent on matching one character of a pattern.

The tested implementations correspond to the following approaches:

SSA: The succinct suffix array is just a Huffman-shaped¹ wavelet tree of the Burrows-Wheeler transform (BWT) of the text. The space of the structure is about \(nH_0(\TEXT)\) bits.
SSA_RRR: The same as SSA but the bitvector is replaced by the RRR implementation of Claude, which uses blocks of size \(15\) and a fast decoding method of the block counts². Combining Huffman-shaped the wavelet tree with compressed bitvectors results in a \(H_k\)-compression scheme.
CSA: Sadakane's implementation of his compressed suffix array which is based on the \(\Psi\)-function.
RLFM: The FM-index based on a run-length encoded wavelet tree [Mäkinen and Navarro, 2004].
AF-index: The alphabet-friendly FM-index (see [Fer2007]) which partitions the BWT in regions and compresses every region with a SSA. The AF-index uses a partitioning such that the overall space consumption is minimized.
plain SA: The uncompressed suffix array (see [Manber and Myers 1993] and [Gonnet et al. 1992]); binary search is used to count patterns.

Here are the result of the experiments:

	SSA		SSA_RRR		CSA		RLFM		AF-index		plain SA
Text	Time	Space	Time	Space	Time	Space	Time	Space	Time	Space	Time	Space
dblp.xml.200MB	1.300	0.83	1.776	0.47	2.648	0.29	6.108	0.65	0.840	0.54	0.436	5.00
dna.200MB	0.548	0.42	0.812	0.43	3.132	0.46	1.216	0.75	0.768	0.48	0.388	5.00
english.200MB	1.156	0.73	1.676	0.53	2.776	0.44	1.744	0.80	1.048	0.65	0.372	5.00
proteins.200MB	1.024	0.69	1.668	0.69	2.672	0.67	1.704	0.89	1.168	0.82	0.368	5.00
sources.200MB	1.356	0.85	1.960	0.54	2.748	0.38	1.864	0.74	1.156	0.73	0.364	5.00

Table 1: Time and space taken by the implementations from the Pizza&Chili website. Time in \(\mu\)secs per pattern symbol and space as fraction of the original file size.

When you compare this with the table in the paper, you can make two observations:

The runtimes are faster by almost a factor of 2. This is due to the use of recent hardware.
Surprisingly, the space consumption was larger than in the paper, despite my configuration to use a very sparse sampling for information which is only needed for locating (suffix and inverse suffix array values). After contacting the authors it turned out, that a bitvector is used to indicate which samples are stored. This bitvector is not removed or compressed, even if there are very few or no samples.

So we are a little bit naughty and just subtract the space needed for this unnecessary part of the data structure and get the following table:

	SSA		SSA_RRR		CSA		RLFM		AF-index		plain SA
Text	Time	Space	Time	Space	Time	Space	Time	Space	Time	Space	Time	Space
dblp.xml.200MB	1.300	0.69	1.776	0.34	2.648	0.29	6.108	0.52	0.840	0.41	0.436	5.00
dna.200MB	0.548	0.29	0.812	0.30	3.132	0.46	1.216	0.62	0.768	0.35	0.388	5.00
english.200MB	1.156	0.60	1.676	0.40	2.776	0.44	1.744	0.67	1.048	0.52	0.372	5.00
proteins.200MB	1.024	0.56	1.668	0.55	2.672	0.67	1.704	0.76	1.168	0.69	0.368	5.00
sources.200MB	1.356	0.72	1.960	0.41	2.748	0.38	1.864	0.61	1.156	0.59	0.364	5.00

Table 2: Table 1 after subtracting the space of parts of the data structure, which are not needed for answering count queries.

Let us now take some configuration of sdsl data structures, which more or less correspond to the selected structures above, and do exactly the same experiment again. Here are the results:

	FM_HUFF		FM_HUFF_RRR15		CSA_SADA		FM_RLMN		FM_HUFF_RRR63
Text	Time	Space	Time	Space	Time	Space	Time	Space	Time	Space
dblp.xml.200MB	0.760	0.70	1.352	0.32	2.676	0.29	1.820	0.34	2.072	0.17
dna.200MB	0.288	0.29	0.720	0.28	3.460	0.50	1.228	0.79	1.656	0.24
english.200MB	0.708	0.61	1.480	0.38	3.108	0.44	1.892	0.69	2.640	0.27
proteins.200MB	0.632	0.56	1.564	0.53	3.360	0.65	1.796	0.89	3.288	0.48
sources.200MB	0.868	0.73	1.684	0.39	2.900	0.37	1.964	0.53	2.832	0.26

Table 3: Time and space taken by some configurations of sdsl data structures. Time in \(\mu\)secs per pattern symbol and space as fraction of the original file size. Click on the space values to get details on the space breakdown of the corresponding indexes.

Implementation details:

FM_HUFF: Corresponds to this type (don't be scared by the length of the definition): csa_wt<wt_huff<bit_vector,rank_support_v5<>,tNoSel,tNoSel>,64000,64000000>. Explanation: We take a compressed suffix array based on a wavelet tree (csa_wt). The first template parameter specifies the used wavelet tree. We take a Huffman-shaped one, i.e. wt_huff. This wavelet tree uses the 5% overhead rank structure rank_support_v5 (proposed by Vigna) and we use a dummy class for the select functionality³, since we don't need select for counting. We set the sampling rate of suffix array values to \(64000\) and the rate for inverse suffix array values to \(64000000\). So we don't waste space for unnecessary locating information.
FM_HUFF_RRR15: Corresponds to csa_wt<wt_huff<rrr_vector<15> >,64000,64000000>. The only thing we have changed is the bitvector implemenation. Now we use the sdsl class rrr_vector parametrized with block size \(15\)⁴. We don't specify the rank and select support structures, since wt_huff takes the default supporting structures⁵.
CSA_SADA: We use Elias-\(\delta\) code to compressed the pairwise differences of the \(\Psi\)-function and sample every 128th value of the \(\Psi\)-function. This translates into this type definition: csa_sada<enc_vector<coder::elias_delta,128>,64000,64000000>.
FM_RLMN: Corresponds to csa_wt,64000,64000000>. We use the run-length compressed wavelet tree csa_wt which uses the class sd_vector as default compressed bitvector representation.
FM_HUFF_RRR63: The only difference to FM_HUFF_RRR15 is the that now we use a block size of \(63\) for the RRR structure: csa_wt<wt_huff<rrr_vector<63> >,64000,64000000>. The decoding of block is now not done with lookup tables but with the on-the-fly decoding approach proposed in the recent SEA paper by Navarro and Providel ⁶.

Observations

The FM_HUFF implementation can answer count queries in almost half the time than SSA. This is not surprising, since the sdsl implementation uses the SSE built function for popcount on 64-bit words rather than multiple table-lookups. The FM_HUFF implementation now even outperforms the plain suffix array on the dna.200MB test case, since only about 4 rank queries have to be answered for each symbol of the count query.
The FM_HUFF_RRR15 implementation uses slightly less space than SSA_RRR, since we use bit-compression to store integer sequences. The query time is also slightly faster since we use a more advanced decoding technique⁴.
CSA_SADA and CSA take about the same amount of memory but the query times of CSA are slightly faster than the sdsl implementation. This is due to some overhead produced by the coder class coder::elias_delta.
The query times of FM_RLMN is slightly slower than that of RLFM, except for the first test case of highly compressible xml data. But in this case the query time of RLFM is three times slower than that of FM_RLMN and at the same time the sdsl implementation uses only (65%) of the space of RLFM. This is a result of the sub data structures used. The Pizza&Chili implementation uses two uncompressed bitvectors with a constant time rank and logarithmic time select while the RM_RLMN uses the sd_vector, which is a 64-bit implementation of the SD-array of Okanohara and Sadakane. The space-consumption of sd_vector is better if there are only a few runs in the BWT and therefore RM_RLMN outperforms RLFM in the test cases dblp.xml.200MB and sources.200MB. Furthermore the runtime of rank and select is constant and therefore don't get a peak for the dblp file.
The AF-index which is in theory the most space-efficient index uses more space than the indexes which use implicit compression boosting like SSA_RRR and FM_HUFF_RRR15. So here is space for more engineering.
The implicit compression boosting index FM_HUFF_RRR63 uses the least space and its counting time is only slower by a small constant compared to more space-greedy solutions.

Conclusion

Using the sdsl library has several benefits:

You can specify an index which outperforms all old indexes in terms of runtime (e.g. FM_HUFF). Even faster configurations are possible (e.g. use rank_support_v or bit_vector_interleaved with csa_wt).
You can specify an index which outperforms all old indexes in terms of space (e.g. FM_HUFF_RRR63). Even more space-efficient configurations are possible (e.g. use rrr_vector<255>, this reduces the space for dblp.xml.200MB to \(14 \%\) and increases the query time to \(13.8 \mu\)secs ).
All sdsl indexes are 64-bit implementations and therefore can be used for texts larger than 512 MiB⁷.
The runtime is predictable, since the select data structures are not implemented using a binary search.

What was not considered in this post are indexes which use fixed-block length compression boosting proposed by Kärkkäinen and Puglisi at SPIRE 2011. From the data in their paper, the indexes are slightly smaller than SSA_RRR while the query times are equal or slightly faster than that of SSA.

[1] Using a Huffman-shaped wavelet tree was first mention in [Mäkinen and Navarro, 2004] page 17.
[2] The trick is, that only 4 bits are required to store the count for one block. Claude decodes two blocks at once with a lookup table, when possible. This makes the decoding significantly faster in practice but is not described in his paper.
[3] tNoSel is a typedef for select_support_dummy.
[4] Note that for block size \(15\), we do also decode two blocks at once like in the implementation of Claude but whenever possible we decode even more blocks by calculating the prefix sum of multiple nibbles using the same technique which is used by Knuth to calculate the popcount operation.
[5] Note: For each bitvector of class X there exists typedefs X::rank_1_type, X::rank_0_type, X::select_1_type, and X::select_0_type.
[6] Note: You can use a block size up to 256 in the rrr_vector. Our experiments show that the runtime for the on-the-fly decompression is linearly dependent on the block size.
[7] The limit is not 4 GiB, since we have to address bits.

blog comments powered by Disqus

Published

28 August 2012

New baselines for FM-indexes

Observations

Conclusion

Published

Tags