Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] About SymSpell model and probabilistic models (Norvig, etc.) #61

Open
loretoparisi opened this issue Mar 14, 2019 · 5 comments

Comments

@loretoparisi
Copy link

I'm currently using both Hunspell and SymSpell as main spelling correction system. They works both ok, SymSpell works great (quality, performances, etc.) That said, I have a question about Norvig probabilistic Spell Checker, that I show up with a simple case.
In some romanized languages, there is not one-to-one relation from the source script language term to the english (romanized) language term. So given that you have the romanization of let's say Hindi, you will get more possible english words as destination. Now this is a typical output of such a system: 1 (Hindi) word -> N (eng) words.
Typically decide which of the N words is the best is done with algorithm like beam search, viterbi, etc., but there are a lot of cases where the indecision stays on.
Also in other case, we have eng (N) -> hi (M), so this function is not bijective at all.
Given that a Spell Checker have knowledge of all (most of) the words in a language, etc. and supposed I need context (like in this case) to go back from eng (N) -> hi (M), do you think that SymSpell or Norvig's probabilistic model could give a valid hint about the M choices (or the N in the opposite way)? What's your opinion on that?

@wolfgarbe
Copy link
Owner

wolfgarbe commented Mar 14, 2019

Are you referring to the ITRANS scheme of Devanagari transliteration?

Character-based transliteration:
There seem to exist some straight forward solutions to solve the ambiguity of the 1 to N translation of chararcters : https://pandey.github.io/posts/transliterate-devanagari-to-latin.html

Word-based transliteration
Of course it would be also possible to compile a dictionary with all pairs of correct
davangari-latin word wise transliterations. As far as I understand for a word-based transliteration there is no ambiguity involved. The dictionary could either be created manually or compiled automatically by processing a corpus of existing correct transliterations together with the original.

SymSpell and Norvig's spelling correction are both word based.
The ambiguity of the word-based spelling correction is caused as we don't know what errors were introduced to the input term. With the Damerau–Levenshtein edit distance we have a measure how closely two terms resemble each other, and with the word frequency of the candidate terms we have measure about the occurrence probability of a specific term in the corpus text. Both measures combined help us to choose the most likely correction candidate.
This can be extended to consider the context of the other words in a sentence or the whole document.

To use this model (word frequencies) for transliteration would make only sense, if there was ambiguity in word-based transliteration - which I don't see (but I don't know Hindi & Devanagari).

@loretoparisi
Copy link
Author

loretoparisi commented Mar 14, 2019

@wolfgarbe yes exactly, that is the main issue.
Here are a typical example:

trn.transform(u'मिलने')
[u'milane', u'milne', u'miline', u'milene', u'mine']

I'm using indic-trans for this task. The problem raised by mother-tongue hindi speakers is that for words like u'मिलने', you can have u'milne (meeting), or u'milane depending on the context. Now, if I use hunspell, I will get the suggestions, but of course there is no probability (term frequency) model involved.
While my idea (question to you) is if I use SymSpell (I have word frequencies from wikipedia, etc.) SymSpell will then suggest the term like milne or milane based on the previous/next terms, etc. Of course SymSpell would be the best option thanks your implementation :)
Do you think could it work in this way?

@wolfgarbe
Copy link
Owner

To utilize a sentence-wide context to solve ambiguity you need n-gram probabilities (co-occurrence probabilities between multiple terms), not the single word probabilities (word frequencies) used in SymSpell/Norvig. See also Using N-grams to Process Hindi
Queries with Transliteration Variations

A similar approach is using GloVe word vectors. See A simple spell checker built from word vectors

While I'm planning this for a future release, SymSpell is currently not using any sentence or document wide context for selecting the appropriate spelling suggestion.

@loretoparisi
Copy link
Author

loretoparisi commented Mar 14, 2019

@wolfgarbe thanks a lot! I'm using Word2Vec for other similar tasks (top nearest words, etc.) and that is a great article actually, one of the first about spell checkers and word embedding approach. My concerns here are about performances, because the Word2vec models are typically huge binary files (words vectors of DIM let's say 100-300 for each word/subword) so you can have from 1GB to 4GB files, and it could a problem to apply it. For response times, using FastText it is like 10-12msec in inference, not sure compared to SymSpell performances how it would be, my impression is that Word2vec approach in thoses cases could slower performances if you do not quantize (hence shrink) the models without decreasing too much the accuracy.
It would be awesome to use a embedding approach in SymSpell (cbow with positional weights maybe...)!
I will do more research, and it you like I would be back with results here.

@wolfgarbe
Copy link
Owner

Let me know if you find something interesting. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants