Cleanlab on NER Data #256

Giriteja · 2022-05-11T07:21:53Z

Giriteja
May 11, 2022

Hi, Cleanlab Team how to use the clean lab in case of NER problems any documentation or tutorials to follow?

Jul 27, 2022

Here's outline of how existing Cleanlab v2.0.0 repo can be used for token classification tasks.

Basic Idea: Treat the labels and model’s predictions at each token as if they were labels & predictions for independent training examples (ignoring which document each token/label comes from). Then just run regular cleanlab as if this were a multiclass classification task (with each document broken up into many separate examples, one for each token). So when running cleanlab's find_label_issues(labels, pred_probs) and get_label_quality_scores(labels, pred_probs), the labels should be for each token in your entire corpus, and the pred_probs should be the corresponding class-probabilities estimat…

View full answer

jwmueller · 2022-07-27T20:07:39Z

jwmueller
Jul 27, 2022
Maintainer

Apologies for the delayed response, just seeing this now. FYI you'll get faster answers if you ask questions in our Slack channel:
https://cleanlab.ai/slack

0 replies

jwmueller · 2022-07-27T20:19:20Z

jwmueller
Jul 27, 2022
Maintainer

Here's outline of how existing Cleanlab v2.0.0 repo can be used for token classification tasks.

Basic Idea: Treat the labels and model’s predictions at each token as if they were labels & predictions for independent training examples (ignoring which document each token/label comes from). Then just run regular cleanlab as if this were a multiclass classification task (with each document broken up into many separate examples, one for each token). So when running cleanlab's find_label_issues(labels, pred_probs) and get_label_quality_scores(labels, pred_probs), the labels should be for each token in your entire corpus, and the pred_probs should be the corresponding class-probabilities estimated by your model. Make sure these estimates are out-of-sample, and for NER, you should include "Other" (ie. non-entity) as a class to which the model can assign high probability for tokens it does not believe are entities.

After you have run cleanlab'sfind_label_issues() on the token-level, you'll have to map the returned indices back to the original documents and the positions of the tokens therein, but that should not be too hard. You just want to maintain a dict token_positions where token_positions[i] = (document_id, location_of_token_in_this_document). Here i indicates the ith entry of labels as passed into find_label_issues(), ie. it is the ith token in the corpus (ignoring which documents each token comes from), and document_id = the ID of the document where this token came from, location_of_token_in_this_document = the integer position of this token within that document. This will allow you to easily count which documents seem to have the most label errors. You can also count which words tend to be labeled the least correctly overall, by aggregating over all tokens in the corpus where this word occurs.

3 replies

jwmueller Jul 28, 2022
Maintainer

We are actively working on an official extension of cleanlab for NER that will be released next month, but for now I've detailed one basic strategy to use the existing cleanlab for NER (formulated as token classification task) above. We found this works pretty well in practice.

Giriteja Jul 29, 2022
Author

Thanks Muller for the reply, ya after i asked this question we tried in the similar manner considering it has a multiclass classification problem it worked for us

jwmueller Jul 29, 2022
Maintainer

Great to hear!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanlab on NER Data #256

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Cleanlab on NER Data #256

Giriteja May 11, 2022

Replies: 2 comments · 3 replies

jwmueller Jul 27, 2022 Maintainer

jwmueller Jul 27, 2022 Maintainer

jwmueller Jul 28, 2022 Maintainer

Giriteja Jul 29, 2022 Author

jwmueller Jul 29, 2022 Maintainer

Giriteja
May 11, 2022

Replies: 2 comments 3 replies

jwmueller
Jul 27, 2022
Maintainer

jwmueller
Jul 27, 2022
Maintainer

jwmueller Jul 28, 2022
Maintainer

Giriteja Jul 29, 2022
Author

jwmueller Jul 29, 2022
Maintainer