Engineering Small Space Dictionary Matching

© Shoshana (Neuberger) Marcus and Dina Sokol

The dictionary matching problem is to locate occurrences of any pattern among a set of patterns in a given text. Massive data sets abound and at the same time, there are many settings in which working space is extremely limited. We developed the first dictionary matching software for the space-constrained environment whose running time is close to linear. We use the compressed suffix tree as the underlying data structure of our algorithm. The working space of our algorithm is proportional to the optimal compression of the dictionary yet the running time is almost linear. We also contribute a succinct tool for performing constant-time lowest marked ancestor queries on a tree that is succinctly encoded as a sequence of balanced parentheses, with linear time preprocessing of the tree. This tool should be useful in many other applications.

Download here

Installation Instructions

  1. Download and install SDSL, the Succinct Data Structures Library from https://github.com/simongog/sdsl
  2. Extract the files using the Linux command tar -xvzf dictMatchCST.gz
  3. Compile the dictMatchCST.cpp file.
  4. Compile program in Linux:
    g++ -O3 -DNDEBUG -funroll-loops -I${HOME}/include/ -L${HOME}/lib/ -o dictMatch dictMatchCST.cpp -lsdsl -ldivsufsort -ldivsufsort64

    Run program in Linux:
    LD_LIBRARY_PATH=$LD_LIBRARY_PATH:path_to_libdivsufsort_lib_directory ./dictMatch pats.txt text.txt

    The file CST_dt.cpp specifies the data structure for the CST and implementations of compressed suffix array and compressed LCP arrays to use as components.

Input Files

  1. Dictionary of new-line separated patterns
  2. Text to search for patterns

NOTE: neither text nor any pattern should contain an end of line character or a #.
These special characters are specified in the file CST_LMA.cpp and are easily modified.