Engineering Small Space Dictionary Matching
© Shoshana (Neuberger) Marcus and Dina Sokol
The dictionary matching problem is to locate occurrences of any pattern among a set of patterns in
a given text. Massive data sets abound and at the same time, there are many settings in which working
space is extremely limited. We developed the first dictionary matching software for the space-constrained
environment whose running time is close to linear. We use the compressed suffix tree as the underlying
data structure of our algorithm. The working space of our algorithm is proportional to the optimal
compression of the dictionary yet the running time is almost linear. We also contribute a succinct tool
for performing constant-time lowest marked ancestor queries on a tree that is succinctly encoded as a
sequence of balanced parentheses, with linear time preprocessing of the tree. This tool should be useful
in many other applications.
Installation Instructions
- Download and install SDSL, the Succinct Data Structures Library from https://github.com/simongog/sdsl
- Extract the files using the Linux command tar -xvzf dictMatchCST.gz
-
Compile the dictMatchCST.cpp file.
Compile program in Linux:
g++ -O3 -DNDEBUG -funroll-loops -I${HOME}/include/ -L${HOME}/lib/ -o dictMatch dictMatchCST.cpp -lsdsl -ldivsufsort -ldivsufsort64
Run program in Linux:
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:path_to_libdivsufsort_lib_directory ./dictMatch pats.txt text.txt
The file CST_dt.cpp specifies the data structure for the CST and implementations of compressed suffix array and compressed LCP arrays to use as components.
Input Files
- Dictionary of new-line separated patterns
- Text to search for patterns
NOTE: neither text nor any pattern should contain an end of line character or a #.
These special characters are specified in the file CST_LMA.cpp and are easily modified.