MAG for approximate pattern matching (MAGA)

About

MAGA (MAG for approximate pattern matching) is algorithm designed for approximage pattern matching. The algorithm is based on idea of partitioning into exact pattern matching. The MAGA is based on previous work MAG (http://www.stringology.org/event/2014/p07.html) which was designed for exact multiple pattern matching. MAGA adopts the algorithm to approximate pattern matching problem.

This is a source code of MAGA that was written for research purpose and has a minimal error checking. The code may be not be very readable and comments may not be adequate. There is no warranty, your use of this code is at your own risk.

Requirements

C++ compiler compatible with C++11 standard
Unix based 64-bit (should also work on 32-bit but performance is worse)
Python 3 (for testing)
Docker (optionally)

Compilation

To compile all variants of algorithm simply run:

git clone https://github.com/rsusik/maga maga
cd maga
make all

Usage

The make all compilation generates multiple variants of the algorithm. All of them can be executed from command line as follows:

./program PATTERNS_FILE PATERN_SIZE TEXT_FILE U K Q_SIZE SIGMA

where:

PATTERNS_FILE - location of file with patterns
PATERN_SIZE - pattern size (e.g. 32)
TEXT_FILE - text file
U - AOSO parameter
K - AOSO parameter
Q_SIZE - q-gram size
SIGMA - deprecated (put some constant e.g. 5)

For more details please refer to mentioned paper.

Example:

./approx_mag_l2_k1 /media/disk1/patterns.m16 16 /media/disk1/english.200MB 4 2 6 5

Testing

Scripts

generate_patterns.py - generate patterns
test.py - tests the algorithms

For testing purpose the test.py script can be used.

Note: Patterns, datasets and algorithms default locations can be changed in the script file.

Note: The script (if possible) will automaticaly download datasets from Pizza&Chilli Corpus (english, proteins, sources, xml.dblp, dna of size 50MB, 100MB and 200MB).

Following parameters are supported by test.py:

  -h, --help           show this help message and exit
  -r R, --npatterns R  number of patterns
  -a A, --algorithm A  algorithm[s] to be tested
  -c C, --corpus C     corpus
  -m M, --length M     pattern length[s] (e.g. 8,16,32)
  -u U, --faosou U     FAOSO parameter U
  -k K, --faosok K     FAOSO parameter k
  -q Q, --q-gram Q     q-gram size
  -s S, --sigma S      dest. alph. size

Example:

python3 test.py -a approx_mag_l3_k1 -c english.50MB -r 100 -m 8,16,32,64 -u 8 -k 1,2 -q 2,3,4,6,8 -s 5

Note: If the set of parameters is incorrect (e.g. q > m) the errors will be produced in output (which can be ignored).

Generate patterns

The script generate_patterns.py may be used to generate patterns.

Example:

$ python generate_patterns.py english.10MB 100 8,16,32,64
Patterns generated to file ./patterns.r100/patterns.english.10MB.8.bin
Patterns generated to file ./patterns.r100/patterns.english.10MB.16.bin
Patterns generated to file ./patterns.r100/patterns.english.10MB.32.bin
Patterns generated to file ./patterns.r100/patterns.english.10MB.64.bin

The above command generates 100 patterns of length m={8, 16, 32, 64}.

Note: Please make sure the folders patterns.r[r] (where [r] is the number of patterns) exists (e.g. ./patterns.r100).

Docker

The simplest way you can test the algorithm is by using docker. All you need to do is to:

build the docker image or pull from repository
and run the docker container

Pull or build the image:

Pull from repository steps:

 docker pull rsusik/maga
 docker tag docker.io/rsusik/maga maga

Build steps:

Clone the git repository:

 git clone https://github.com/rsusik/maga maga
 cd maga

Build the image:

 docker build -t maga .

Run test with default parameters:

docker run maga

This will execute test with default parameters and output the results.

Additionally you may provide all the parameters mentioned in section Testing as follows:

docker run maga -a approx_mag_l3_k1,approx_mag_l2_k1 -c english.100MB -r 10 -m 8,16,32,64 -u 8 -k 1,2 -q 2,3,4,6,8 -s 5

Above command supports only a mentioned Pizza&Chilli corpuses but you may provide your own datasets using -v parameter. Assuming you want to execute test for dataset mydataset.5MB that is located in /home/ok/mysets you have to execute following command:

docker run -v /home/ok/mysets:/app/sets maga -a maga -c mydataset.5MB

Citing

@article {S17,
  author  = {R. Susik},
  title   = {Applying a $q$-gram based Multiple String Matching Algorithm for Approximate Matching},
  journal = {Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie \'Srodowiska},
  volume  = {7},
  number  = {3},
  issn    = {2083-0157},
  year    = {2017},
  pages   = {47--50}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
libraries		libraries
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
approx_cf.cpp		approx_cf.cpp
approx_cf.h		approx_cf.h
approx_mag_dyn.cpp		approx_mag_dyn.cpp
approx_mag_dyn.h		approx_mag_dyn.h
approx_shor.h		approx_shor.h
generate_patterns.py		generate_patterns.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libraries

libraries

Dockerfile

Dockerfile

Makefile

Makefile

README.md

README.md

approx_cf.cpp

approx_cf.cpp

approx_cf.h

approx_cf.h

approx_mag_dyn.cpp

approx_mag_dyn.cpp

approx_mag_dyn.h

approx_mag_dyn.h

approx_shor.h

approx_shor.h

generate_patterns.py

generate_patterns.py

test.py

test.py

Repository files navigation

MAG for approximate pattern matching (MAGA)

About

Requirements

Compilation

Usage

Testing

Scripts

Docker

Citing

About

Languages

rsusik/maga

Folders and files

Latest commit

History

Repository files navigation

MAG for approximate pattern matching (MAGA)

About

Requirements

Compilation

Usage

Testing

Scripts

Docker

Citing

About

Topics

Resources

Stars

Watchers

Forks

Languages