Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

any comparation with lucene's Levenshtein Automaton ? #72

Open
byronhe opened this issue Aug 21, 2019 · 3 comments
Open

any comparation with lucene's Levenshtein Automaton ? #72

byronhe opened this issue Aug 21, 2019 · 3 comments

Comments

@byronhe
Copy link

byronhe commented Aug 21, 2019

http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html

@wolfgarbe
Copy link
Owner

Something like http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata or https://issues.apache.org/jira/browse/LUCENE-2507 ?

From what I understand from Michael McCandless post:

Prior to 4.0, FuzzyQuery took a brute force approach: it visits every single unique term in the dictionary, computes the Levenshtein distance for it, and accepts the term if the edit distance<=maximum edit distance.

From 4.0 they were pre-calculating a Levenshtein automaton, that accepts only the terms within edit distance N. And then at query time they were intersecting that automaton with all the terms in the dictionary.

This would mean that they save time, because the don't need to calculate the Levenshtein distance for every single term, but it seems they still need to intersect the pre-built Levenshtein automaton with every single term in the dictionary, whereas SymSpell has to calculate the Levenshtein distance only for 0.0042% to 0.016% of the vocabulary.

It would sure be interesting to extend the benchmark, to see how the two algorithms compare in reality.

@missinglink
Copy link

but it seems they still need to intersect the pre-built Levenshtein automaton with every single term in the dictionary

This is not the case if the dictionary is also indexed in a trie structure.

In that case, intersecting the automata with the dictionary is a simple recursive algorithm which only follows branches of the dictionary with corresponding transitions in the state machine.

@wolfgarbe
Copy link
Owner

It seems like somebody did indeed benchmark SymSpell vs. Lucene's spelling correction:
image
Source: https://github.com/Shivakumar-Narayanan/Spell-Check
https://docs.google.com/document/d/1QQROh8ndwBHbPwx2t1kKZcquHDDfF0Pz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants