Update the logic of misspell identification #44

R1j1t · 2020-12-21T19:08:57Z

Is your feature request related to a problem? Please describe.
The current logic of misspelling identification relies on vocab.txt from the transformer model. BERT tokenisers break not such common words into subwords and subsequently store the sub-words in vocab.txt. Hence the original word might not be present in vocab.txt and be identified as misspelt.

Describe the solution you'd like
Still not clear, need to look into some papers on this.

Describe alternatives you've considered
Alternate which I can think of right now will be 2 folds:

ask user to provide list of such words and append in the vocab.txt from the transformers model
if the proposed change is ##x then check the editdistance from detokenised form of that word + previous word

Additional context
#30 explosion/spaCy#3994

The text was updated successfully, but these errors were encountered:

letconex · 2020-12-22T18:28:39Z

Concerning the logic: Is this a viable response?

>>> doc = nlp("This is a majour mistaken.")
>>> print(doc._.outcome_spellCheck)
This is a fact mistaken.
>>> doc = nlp("This is a majour mistake.")
>>> print(doc._.outcome_spellCheck)
This is a major mistake.
>>> doc = nlp("This is a majour mistakes.")
>>> print(doc._.outcome_spellCheck)
This is a for mistakes.
>>> doc = nlp("This is a majour misstake.")
>>> print(doc._.outcome_spellCheck)
This is a minor story.

R1j1t · 2020-12-23T10:54:55Z

That is not the desired response. But it is based on the current logic. If you want to improve accuracy, please try pass the vocab file

contextualSpellCheck/contextualSpellCheck/contextualSpellCheck.py

Lines 34 to 35 in 15b30eb

    
                       vocab_path (str, optional): Vocabulary file path to be used by the 
        
                                                    model . Defaults to "".

This will help model prevent False positives. Feel free to open a PR with a fix!

kshitij12345 · 2021-01-09T15:30:48Z

One side-effect of using the current transformers tokenizer logic is that it would by default support multi-lingual models. Otherwise I am not sure but I think different languages might require different spell-checkers as per the language nuances.

R1j1t · 2021-01-09T17:29:58Z

As mentioned in the README

This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model.

So lets say you want to perform spell correction on Japanese sentence:

provide Japanese spacy model: This will break the sentence into tokens. Now as this model is trained on Japanese language it knows the nuances (better than english model)
Provide the Japanese bert model (from tokenizer models): Which will provide the candidate word for OOV word. Note that vocabulary here is considered of the transformer model and not the spaCy model

Below is some code contributed to the repo for Japanese language:

contextualSpellCheck/examples/ja_example.py

Lines 4 to 13 in f8cbeb8

    
           nlp = spacy.load("ja_core_news_sm") 
        
           checker = ContextualSpellCheck( 
        
               model_name="cl-tohoku/bert-base-japanese-whole-word-masking", 
        
               max_edit_dist=2, 
        
           ) 
        
           nlp.add_pipe(checker) 
        
           doc = nlp("しかし大勢においては、ここような事故はウィキペディアの拡大には影響を及ぼしていない。") 
        
           print(doc._.performed_spellCheck) 
        
           print(doc._.outcome_spellCheck)

contextualSpellCheck examples folder

I hope it answers your question @kshitij12345. Please feel free to provide ideas or reference if you find something I might have missed something here!

stale · 2021-02-08T18:08:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2021-03-11T17:23:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2021-09-21T12:07:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

piegu · 2021-10-05T13:09:01Z

The current logic of misspell identification relies on vocab.txt from the transformer model. For not so common words tokenizers breaks them into subwords and hence the original entire word might be present as in in vocab.txt

HI @R1j1t,

First of all, congratulations for you Contextual Spell Checker (CSC) based on spaCy and BERT (transformer model).

As I'm searching for this kind of tool, I tested your CSC and I can give the following feedback:

your CSC is an universal Spell Checker as it is possible to dowload the spaCy and BERT model of a language other than English. For example, this is my code for using your CSC in Portuguese in a Colab notebook:

# Installation
!pip install -U pip setuptools wheel
!pip install -U spacy
!pip install contextualSpellCheck

# spaCy model in Portuguese
spacy_model = "pt_core_news_md" # 48MB, or "pt_core_news_sm" (20MB), or "pt_core_news_lg"  (577MB)
!python -m spacy download {spacy_model} 

# BERT model in Portuguese
model_name = "neuralmind/bert-base-portuguese-cased" # or "neuralmind/bert-large-portuguese-cased"

# Importation and instantiation of the spaCy model
import spacy
import contextualSpellCheck
nlp = spacy.load(spacy_model)

# Download BERT model and add contextual spellchecker to the spaCy model
nlp.add_pipe(
    "contextual spellchecker",
    config={
        "model_name": model_name,
        "max_edit_dist": 2,
    },
);

# Sentence with errors ("milões" instead of "milhões")
sentence = "A receita foi de $ 9,4 milões em comparação com o ano anterior de $ 2,7 milões."

# Get sentence with corrections (if errors found by CSC)
doc = nlp(sentence)
print(f'({doc._.performed_spellCheck}) {doc._.outcome_spellCheck}')

# (True) A receita foi de $ 9,4 milhões em comparação com o ano anterior de $ 2,7 milhões.

your CSC is an unigram Spell Checker as it uses the [MASK] token of a BERT model to replace a so-called mispelling word by a token from the BERT tokenizer vocab (see post). That means that your CSC can not correct a bigram error for example (see following example).

sentence = "a horta abdominal" # the correct sentence in Portuguese is "aorta abdominal"
doc = nlp(sentence)
print(f'({doc._.performed_spellCheck}) {doc._.outcome_spellCheck}')

# (False) 
# the CSC did not find corrected words with an edit distance < max_edit_dist

your CSC is a word corrector by replacing non vocab words with tokens from the BERT tokenizer vocab (if the their edit distances are inferior to the max_edit_dist). That is the true issue I think (ie, using a BERT model). In fact, by using BERT models, I do not see how your CSC will be able to correct words instead of replacing them. It is true you can pass an infinite vocab file that will allow to detect most of mispelling words but as already said, your CSC will only be able to replace them by one token of the BERT tokenizer vocab (a token is not necessarily a word in the Wordpiece BERT tokenizer that uses subwords as tokens). This means that a "solution" would be to use finetuned BERT models with gigantic vocabulary (in order to have whole words instead of sub-words). Unfortunately, this kind of finetuning would require a huge corpus of texts. And even so, your CSC spell checker would remain a unigram one.

Could you consider exploring another type of transformer model like T5 (or ByT5) which has a seq2seq architecture (BERT as encoder mas GPT as decoder) allowing to have sentences of different sizes in input and output of the model?

R1j1t · 2021-10-10T11:04:08Z

Hey @piegu, first of I want to thank you for your feedback. It feels terrific to have contributors, and even more so, who help in shaping the logic! When I started this project, I wanted the library to be generalized for multiple languages, hence spaCy and BERT's approach. I created tasks for me (#44, #40), and I would like to read more on these topics. But lately, I have been occupied with my day job and have limited my contributions to contextualSpellCheck.

Regarding your 2nd point, it is something I would agree I did not know. As pointed out in the comment by sgugger:

For this task, you need to either use a different model (coded yourself as it's not present in the library) or have your training set contain one [MASK] per token you want to mask. For instance if you want to mask all the tokens corresponding to one word (a technique called whole-word masking) what is typically done in training scripts is to replace all parts of one word by [MASK]. For pseudogener tokenized as pseudo, ##gene, that would mean having [MASK] [MASK].

I would still want to depend on transformer models, as it adds the functionality of multilingual support. I will try to experiment with your suggestions and try to think of a solution myself for the same.

Hope you like the project. Feel free to contribute!

wanglc02 · 2024-02-06T08:40:02Z

I noticed that part of the logic of misspell_identify is:

        misspell = []
        for token in docCopy:
            if (
                (token.text.lower() not in self.vocab)

Will changing token.text.lower() into token._lemma.lower() improve accuracy? According to https://spacy.io/api/lemmatizer, "as of v3.0, the Lemmatizer is a standalone pipeline component that can be added to your pipeline, and not a hidden part of the vocab that runs behind the scenes. This makes it easier to customize how lemmas should be assigned in your pipeline." So, the __contains__ method of self.vocab will not convert a token to its base form. We have to get the base form by token._lemma.

R1j1t added enhancement New feature or request help wanted Extra attention is needed labels Dec 21, 2020

stale bot added wontfix This will not be worked on and removed wontfix This will not be worked on labels Feb 8, 2021

stale bot added the wontfix This will not be worked on label Mar 11, 2021

stale bot closed this as completed Mar 18, 2021

R1j1t mentioned this issue Mar 31, 2021

[BUG] #59

Closed

R1j1t reopened this Aug 22, 2021

stale bot removed the wontfix This will not be worked on label Aug 22, 2021

R1j1t added this to To do in Code Dashboard Aug 22, 2021

stale bot added the wontfix This will not be worked on label Sep 21, 2021

R1j1t removed the wontfix This will not be worked on label Sep 24, 2021

R1j1t added bug Something isn't working and removed enhancement New feature or request labels Oct 10, 2021

R1j1t mentioned this issue Aug 17, 2022

Bad performance for other language #76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the logic of misspell identification #44

Update the logic of misspell identification #44

R1j1t commented Dec 21, 2020 •

edited

letconex commented Dec 22, 2020 •

edited by R1j1t

R1j1t commented Dec 23, 2020

kshitij12345 commented Jan 9, 2021

R1j1t commented Jan 9, 2021 •

edited

stale bot commented Feb 8, 2021

stale bot commented Mar 11, 2021

stale bot commented Sep 21, 2021

piegu commented Oct 5, 2021

R1j1t commented Oct 10, 2021

wanglc02 commented Feb 6, 2024

Update the logic of misspell identification #44

Update the logic of misspell identification #44

Comments

R1j1t commented Dec 21, 2020 • edited

letconex commented Dec 22, 2020 • edited by R1j1t

R1j1t commented Dec 23, 2020

kshitij12345 commented Jan 9, 2021

R1j1t commented Jan 9, 2021 • edited

stale bot commented Feb 8, 2021

stale bot commented Mar 11, 2021

stale bot commented Sep 21, 2021

piegu commented Oct 5, 2021

R1j1t commented Oct 10, 2021

wanglc02 commented Feb 6, 2024

R1j1t commented Dec 21, 2020 •

edited

letconex commented Dec 22, 2020 •

edited by R1j1t

R1j1t commented Jan 9, 2021 •

edited