Releases: cleanlab/cleanlab
v2.6.5
What's Changed
- Add end-to-end tests at the end of Datalab quickstart tutorial by @allincowell in #1118
- Centralize existing functionality for constructing and correcting knn graphs in a separate module by @elisno in #1117, #1119, #1129
- Optimize multiannotator.py for performance by @gogetron in #1077
- Optimize value_counts function for performance improvement with missing classes by @gogetron in #1073
- Improve test coverage for setting confident joint in
CleanLearning
by @elisno in #1123 - Switch from np.isnan to pd.isna for null value check by @gogetron in #1096
- Update pip install instruction in object detection tutorial by @elisno in #1126
- Refine handling of
underperforming_group
issue type by @gogetron in #1099 - Improve compatibility with sklearn 1.5 by removing the deprecated
multi_class
argument in LogisticRegression by @elisno in #1124 - Display exact duplicate sets dynamically in tabular tutorial by @nelsonauner in #1128
New Contributors
- @allincowell made their first contribution in #1118
- @nelsonauner made their first contribution in #1128
Full Changelog: v2.6.4...v2.6.5
v2.6.4
What's Changed
- Various performance optimizations and test improvements by @gogetron in #1064, #1067, #1079, #1087, #1095, #1106, #1107
- Restructured text and tabular classification tutorials into CleanLear… by @mturk24 in #1066
- user-facing cleanlab.datavaluation module by @coding-famer in #1050
- fix typo in datalab issue types by @coding-famer in #1085
- Add kwargs to functions that call plt.show() by @mturk24 in #1084; by @jwmueller in #1088
- update tutorials by @jwmueller in #1089, #1090, #1091
- Refine type hints by @desboisGIT in #1101; by @elisno in #1086
- Updated datalab issue type description for non iid issue by @mturk24 in #1102
- Remove unsqueeze call in image tutorial by @elisno in #1108
- Temporarily Revert to macOS 12 in CI due to Incompatibility with Python 3.8 and 3.9 by @elisno in #1110
- Fix numerical instability with Euclidean distance metric by @elisno in #1113
- avoid sensitive divisions by @jwmueller in #1114; by @elisno in #1116
- All identical datasets tests by @elisno in #1115
New Contributors
- @gogetron made their first contribution in #1064
- @desboisGIT made their first contribution in #1101
Full Changelog: v2.6.3...v2.6.4
v2.6.3 - Enhanced scores for outliers and near-duplicates
This release is non-breaking when upgrading from v2.6.2.
What's Changed
- Updated image_key documentation by @sanjanag in #1048
- Refine Scoring and Enhance Stability for Datasets with Identical Examples by @elisno in #1056
- Add warning message about TensorFlow compatibility to docs by @elisno in #1057
Full Changelog: v2.6.2...v2.6.3
v2.6.2
v2.6.1 -- Refined Regression Score and Fixes
This release is non-breaking when upgrading from v2.6.0. Some noteworthy updates include:
- The label quality score in the
cleanlab.regression
module is improved to be more human-readable.- This only involves rescaling the scores to display a more human-interpretable range of scores, without affecting how your data points are ranked within a dataset according to these scores.
- Better address some edge-cases in
Datalab.get_issues()
.
What's Changed
- Readme updates by @jwmueller in #1030, #1031, #1039; @elisno in #1040
- Adjust the range of regression label quality scores by @huiwengoh in #1032
- Misc fixes of get_issues method by @elisno in #1025, #1026, #1028
- Support features as input for data valuation check in Datalab by @elisno in #1023
- Fix/clarify docs by @mturk24 in #1029; @elisno in #1024, #1037
- CI/CD changes by @elisno in #1036
New Contributors
Full Changelog: v2.6.0...v2.6.1
v2.6.0 -- Elevating Data Insights: Comprehensive Issue Checks & Expanded ML Task Compatibility
This release is non-breaking when upgrading from v2.5.0, continuing our commitment to maintaining backward compatibility while introducing new features and improvements.
However, this release drops support for Python 3.7 while adding support for Python 3.11.
Enhancements to Datalab
In this update, Datalab, our dataset analysis platform, enhances its ability to identify various types of issues within your datasets. With this release, Datalab now detects additional types of issues by default, offering users a more comprehensive analysis. Specifically, it can now:
- Identify
null
values in your dataset. - Detect
class_imbalance
. - Highlight an
underperforming_group
, which refers to a subset of data points where your model exhibits poorer performance compared to others.
See our FAQ
for more information on how to provide pre-defined groups for this issue type.
Additionally, Datalab can now optionally:
- Assess the value of data points in your dataset using KNN-Shapley scores as a measure of
data_valuation
.
If you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!
Expanded Datalab Support for New ML Tasks
With cleanlab v2.6.0, Datalab extends its support to new machine-learning tasks and introduces enhancements across the board.
This release introduces the task
parameter in Datalab's API, enabling users to specify the type of machine learning task they are working on.
from cleanlab import Datalab
lab = Datalab(..., task="regression")
The task
s currently supported are:
- classification (default): Includes all previously supported issue-checking capabilities based on
pred_probs
,features
, or aknn_graph
, and the new features introduced earlier. - regression (new):
- Run specialized label error detection algorithms on regression datasets. You can see this in action in our updated regression tutorial.
- Find other issues utilizing
features
or aknn_graph
.
- multilabel (new):
- Detect label errors in multilabel classification datasets using
pred_probs
exclusively. Explore the updated capabilities in our multilabel tutorial. - Find various other types of issues based on
features
or aknn_graph
.
- Detect label errors in multilabel classification datasets using
Improved Object Detection Dataset Exploration
New functions have been introduced to enhance the exploration of object detection datasets, simplifying data comprehension and issue detection.
Learn how to leverage some of these functions in our object detection tutorial.
Other Major Improvements
- Rescaled Near Duplicate and Outlier Scores:
- Note that what matters for all cleanlab issue scores is not their absolute magnitudes but rather how these scores rank the data points from most to least severe instances of the issue. But based on user feedback, we have updated the near duplicate and outlier scores to display a more human-interpretable range of values. How these scores rank data points within a dataset remains unchanged.
- Consistency in counting label issues:
cleanlab.dataset.health_summary()
now returns the same number of issues ascleanlab.classification.find_label_issues()
andcleanlab.count.num_label_issues()
.
- Improved handling of non-iid issues:
- The non-iid issue check in Datalab now handles
pred_probs
as input.
- The non-iid issue check in Datalab now handles
- Better reporting in Datalab:
- Simplified
Datalab.report()
now highlights only detected issue types. To view all checked issue types, useDatalab.report(show_all_issues=True)
.
- Simplified
- Enhanced Handling of Binary Classification Tasks:
- Examples with predicted probabilities close to 0.5 for both classes are no longer flagged as label errors, improving the handling of binary classification tasks.
- Experimental Functionality:
- cleanlab now offers experimental functionality for detecting label issues in span categorization tasks with a single class, enhancing its applicability in natural language processing projects.
New Contributors
We're thrilled to welcome new contributors to the cleanlab community! Your contributions help us improve and grow cleanlab:
- @smttsp made their first contribution in #867
- @abhijitpal1247 made their first contribution in #856
- @01PrathamS made their first contribution in #893
- @mglowacki100 made their first contribution in #796
- @gibsonliketheguitar made their first contribution in #831
- @kylegallatin made their first contribution in #885
- @ryansingman made their first contribution in #919
- @R-Peleg made their first contribution in #948
Thank you for your valuable contributions! If you're interested in contributing, check out our contributing guide for ways to get involved.
Change Log
Significant changes in this release include:
- Update FAQ section in docs by @tataganesh in #869; @elisno in #913
- Improve Object Detection module by @Steven-Yiran in #840, #877; @aditya1503 in #883, #969, #968
- Clearer documentation/tutorials/readme by @jwmueller in #851, #931, #981, #983, #1001, #978, #994, #1010; @01PrathamS in #893; @elisno in #878, #1007, #992, #1015, #1016; @huiwengoh in #984; @sanjanag in #936; @tataganesh in #916; @ulya-tkch in #954;
- CI updates by @aditya1503 in #864; @elisno in #879, #961, #963, #965, #1008, #975, #1011, #1012, #1013, #1014; @jwmueller in #852, #865; @tataganesh in #900; @anishathalye in #956; @sanjanag in #1009
- Docs system updates by @elisno in #880, #881, #958, #959, #960, #964
- Add Null Issue Manager by @abhijitpal1247 in #856; @tataganesh in #927, #917
- Add Data Valuation Issue Manager by @coding-famer in #850, #925
- Extend non-iid issue check to run if only pred_probs are provided by @abhijitpal1247 in #857; @tataganesh in #896, #897
- Add Underperforming Group Issue Manager by @tataganesh in #838, #907; @elisno in #990
- Add Class Imbalance issue type to Datalab defaults by @tataganesh in #912, #933; @jwmueller in #924, #934; @elisno in #940
- Add regression task to Datalab by @mglowacki100 in #796; @elisno in #902
- Add multilabel task to Datalab by @tataganesh in #929
- 702 - Shorten Refs of classes and functions in Docs by @gibsonliketheguitar in #831
- Update near duplicate issues and sets by @ryansingman in #919; @elisno in #8...
v2.5.0 -- All major ML tasks now supported
This release is non-breaking when upgrading from v2.4.0 (except for certain methods in cleanlab.experimental
that have been moved, especially utility methods related to Datalab).
New ML tasks supported
Cleanlab now supports all of the most common ML tasks! This newest release adds dedicated support for the following types of datasets:
- regression (finding errors in numeric data): see
cleanlab.regression
and the "noisy labels in regression" quickstart tutorial. - object detection: see
cleanlab.object_detection
and the "Object Detection" quickstart tutorial. - image segmentation: see
cleanlab.segmentation
and the "Semantic Segmentation tutorial.
Cleanlab previously already supported: multi-class classification, multi-label classification (image/document tagging), token classification (entity recognition, sequence prediction).
If there is another ML task you'd like to see this package support, please let us know (or even better open a Pull Request)!
Supporting these ML tasks properly required significant research and novel algorithms developed by our scientists. We have published papers on these for transparency and scientific rigor, check out the list in the README or learn more at:
https://cleanlab.ai/research/
https://cleanlab.ai/blog/
Improvements to Datalab
Datalab is a general platform for detecting all sorts of common issues in real-world data, and the best place to get started for running this library on your datasets.
This release introduces major improvements and new functionalities in Datalab that include the ability to:
- Detect low-quality images in computer vision data (blurry, over/under-exposed, low-information, ...) via the integration of CleanVision.
- Detect label issues even without
pred_probs
from a ML model (you can instead just providefeatures
). - Flag rare classes in imbalanced classification datasets.
- Audit unlabeled datasets.
Other major improvements
- 50x speedup in the cleanlab.multiannotator code for analyzing data labeled by multiple annotators.
- Out-of-Distribution detection based on
pred_probs
via the GEN algorithm which is particularly effective for datasets with tons of classes. - Many of the methods across the package to find label issues now support a
low_memory
option. When specified, it uses an approximate mini-batching algorithm that returns results much faster and requires much less RAM.
New Contributors
Transforming cleanlab into the first universal data-centric AI platform is a major effort and we need your help! Many easy ways to contribute are listed on our github or you can jump into the discussions on Slack. We immensely appreciate all of the contributors who've helped build this package into what it is today, especially:
- @gordon-lim made their first contribution in #746
- @tataganesh made their first contribution in #751
- @vdlad made their first contribution in #677
- @axl1313 made their first contribution in #798
- @coding-famer made their first contribution in #800
Change Log
-
New feature: Label error detection in regression datasets by @krmayankb in #572; by @huiwengoh in #830
-
New feature: ObjectLab for detecting mislabeled images in objection detection datasets by @ulya-tkch in #676, #739, #745, #770, #779, #807, #833; by @aditya1503 in #750, #804
-
New feature: Label error detection in segmentation datasets by @vdlad in #677; by @ulya-tkch in #754, #756, #759, #772; by @elisno in #775
-
New feature: CleanVision to detect low-quality images by @sanjanag in #679, #797
-
New image quickstart tutorial that uses Datalab by @sanjanag in #795
-
Datalab code refactoring by @elisno in #803, #783, #793, #729
-
Include non-IID detection in set of default Datalab issue types by @elisno in #723
-
Extend Datalab to be able to detect label issues based on features by @Steven-Yiran in #760
-
Add imbalance issue type to Datalab by @tataganesh in #758, #828
-
Catch specific exception for knn in Datalab issue managers by @tataganesh in #825
-
Make plots smaller for datalab tutorials by @tataganesh in #751
-
50x speedup and other improvements in multiannotator module by @huiwengoh in #821, #784; by @ulya-tkch in #827
-
ENH: make clipping unnecessary for entropy by @DerWeh in #703
-
Extend default CleanLearning classifier to work for more datasets by @Steven-Yiran in #749
-
CleanLearning code improvements by @huiwengoh in #724; by @jwmueller in #744
-
Change CleanLearning inspect.getfullargspec to signature for sklearn v1.3 compatibility by @huiwengoh in #761
-
Expose low memory option for finding label issues by @tataganesh in #791, #822
-
Add GEN OOD-detection algorithm by @coding-famer in #800
-
Unify softmax implementations throughout package by @elisno in #826
-
Better warning handling for off_calibrated_custom in confident joint by @gordon-lim in #746
-
Clearer explanations in documentation/tutorials/readme by @cgnorthcutt in #725; by @jwmueller in #726, #734, #741, #743, #766, #832, #799, #752, #841, #816, #755, #731, #753, #845, #835, #847
-
CI and documentation system updates by @anishathalye in #742, #768, #769; by @jwmueller in #837; by @huiwengoh in #788, #757, #738, #794; by @sanjanag in #843; by @ulya-tkch in #777; by @elisno in #802; by @axl1313 in #798
-
Improved tests by @huiwengoh in #778, #763
Full Changelog: v2.4.0...v2.5.0
v2.4.0 -- One line of code to detect all sorts of dataset issues
Cleanlab has grown into a popular package used by thousands of data scientists to diagnose issues in diverse datasets and improve the data itself in order to fit more robust models. Many new methods/algorithms were added in recent months to increase the capabilities of this data-centric AI library.
Introducing Datalab
Now we've added a unified platform called Datalab
for you to apply many of these capabilities in a single line of code!
To audit any classification dataset for issues, first use any trained ML model to produce pred_probs
(predicted class probabilities) and/or feature_embeddings
(numeric vector representations of each datapoint). Then, these few lines of code can detect many types of real-world issues in your dataset like label errors, outliers, near duplicates, etc:
from cleanlab import Datalab
lab = Datalab(data=dataset, label_name="column_name_for_labels")
lab.find_issues(features=feature_embeddings, pred_probs=pred_probs)
lab.report() # summarize the issues found, how severe they are, and other useful info about the dataset
Follow our blog to better understand how this works internally, many articles will be published there shortly!
A detailed description of each type of issue Datalab
can detect is provided in this guide, but we recommend first starting with the tutorials which show you how easy it is to run on your own dataset.
Datalab
can be used to do things like find label issues with string class labels (whereas the prior find_label_issues()
method required integer class indices). But you are still free to use all of the prior cleanlab methods you're used to! Datalab
is also using these internally to detect data issues.
Our goal is for Datalab
to be an easy way to run a comprehensive suite of cleanlab capabilities on any dataset. This is an evolving paradigm, so be aware some Datalab
APIs may change in subsequent package versions -- as noted in the documentation.
You can easily run the issue checks in Datalab
together with a custom issue type you define outside of cleanlab. This customizability also makes it easy to contribute new data quality algorithms into Datalab
. Help us build the best open-source platform for data-centric AI by adding your ideas or those from recent publications! Feel free to reach out via Slack.
Revamped Tutorials
We've updated some of our existing tutorials with more interesting datasets and ML models. Regarding the basic tutorials on identifying label issues in classification data from various modalities (image, text, audio, tables), we have also created an analogous versions to detect issues in these same datasets with Datalab
instead (see Datalab Tutorials
). This should help existing users quickly ramp up on using Datalab
to see how much more powerful this comprehensive data audit can be.
Improvements for Multi-label Classification
To provide a better experience for users with multi-label classification datasets, we have explicitly separated the functionality to work with these into the cleanlab.multilabel_classification
module. So please start there rather than specifying the multi_label=True
flag in certain methods outside of this module, as that option will be deprecated in the future.
Particularly noteworthy are the new dataset-level issue summaries for multi-label classification datasets, available in the cleanlab.multilabel_classification.dataset
module.
While moving methods to the cleanlab.multilabel_classification
module, we noticed some bugs in existing methods. We got rid of these methods entirely (replacing them with new ones in the cleanlab.multilabel_classification
module), so some changes may appear to be backwards incompatible, even though the original code didn't function as intended in the first place.
Backwards incompatible changes
Your existing code will break if you do not upgrade to the new versions of these methods (the existing cleanlab v.2.3.1 code was probably producing bad results anyway based on some bugs that have been fixed). Here are changes you must make in your code for it to work with newer cleanlab versions:
cleanlab.dataset.rank_classes_by_label_quality(..., multi_label=True)
→
cleanlab.multilabel_classification.dataset.rank_classes_by_label_quality(...)
The multi_label=False/True
argument will be removed in the future from the former method.
cleanlab.dataset.find_overlapping_classes(..., multi_label=True)
→
cleanlab.multilabel_classification.dataset.common_multilabel_issues(...)
The multi_label=False/True
argument will be removed in the future from the former method. The returned DataFrame is slightly different, please refer to the new method's documentation.
cleanlab.dataset.overall_label_health_score(...multi_label=True)
→
cleanlab.multilabel_classification.dataset.overall_label_health_score(...)
The multi_label=False/True
argument will be removed in the future from the former method.
cleanlab.dataset.health_summary(...multi_label=True)
→
cleanlab.multilabel_classification.dataset.multilabel_health_summary(...)
The multi_label=False/True
argument will be removed in the future from the former method.
There are no other backwards incompatible changes in the package with this release.
Deprecated workflows
We recommend updating your existing code to the new versions of these methods (existing cleanlab v2.3.1 code will still work though, for now). Here are changes we recommend:
cleanlab.filter.find_label_issues(..., multi_label=True)
→
cleanlab.multilabel_classification.filter.find_label_issues(...)
The multi_label=False/True
argument will be removed in the future from the former method.
from cleanlab.multilabel_classification import get_label_quality_scores
→
from cleanlab.multilabel_classification.rank import get_label_quality_scores
Remember: All of the code to work with multi-label data now lives in the cleanlab.multilabel_classification
module.
Change Log
- readme updates by @jwmueller in #659, #660, #713
- CI updates (by @sanjanag in #701; by @huiwengoh in #671; by @elisno in #695, #706)
- Documentation updates (by @jwmueller in #669, #710, #711, #716, #719, #720; by @huiwengoh in #714, #717; by @elisno in #678, #684)
- Documentation: use default rules for shorter, more readable links by @DerWeh in #700
- Added installation instructions for package extras by @sanjanag in #697
- Pass confident joint computed in CleanLearning to filter.find_label_issues by @huiwengoh in #661
- Add Example codeblock to the docstrings of important functions in the dataset module by @Steven-Yiran in #662, #663, #668
- Remove batch size check in label_issues_batched by @huiwengoh in #665
- adding multilabel dataset issue summaries by @aditya1503 in #657
- move int2onehot, onehot2int to top of multilabel tutorial by @jwmueller in #666
- Update softmax to more stable variant by @ulya-tkch in #667
- Revamp text and tabular tutorial by @huiwengoh in #673, #693
- allow for kwargs in token find_label_issues by @jwmueller in #686
- Update numpy.typing import and annotations by @elisno in #688
- Standardize documentation and simplify code for outliers by @DerWeh in #689
- Extract function for computing OOD scores from distances by @elisno in #664
- Introduce Datalab by @elisno in #614
- Introduce NonIID issue type by @jecummin in #614
- Further Datalab updates by @elisno in #680, #683, #687, #690, #691, #699, #705, #709, #712
- Add descriptions of issues that Datalab can detect by @elisno in #682
- Datalab IssueManager.get_summary() -> make_summary() in custom issue manager example by @jwmueller in #692
- Improve NonIID issue checks by @elisno in #694, #707
New Contributors
- @Steven-Yiran made th...
v2.3.1 -- Better handling of some edge-cases
This minor release primarily just improves the user experience when encountering various edge-cases in:
- find_label_issues method
- find_overlapping_issues method
- cleanlab.multiannotator module
This release is non-breaking when upgrading from v2.3.0. Two noteworthy updates in the cleanlab.multiannotator
module include a:
- better tie-breaking algorithm inside of
get_majority_vote_label()
to avoid diminishing the frequency of rarer classes (this only plays a role whenpred_probs
are not provided). - better user-experience for
get_active_learning_scores()
to support scoring only unlabeled data or only labeled data. More of the arguments can now beNone
.
What's Changed
- Readme updates by @jwmueller in #645, #650, #656
- describe activelab in the documentation by @jwmueller in #648
- Added clipping to address issue #639 by @ulya-tkch in #647
- Fix for not specifying labels in find_overlapping_issues by @huiwengoh in #652
- Bug fixes + improvements to multiannotator module by @huiwengoh in #654
- FAQ question/answer on handling label errors in train vs test data by @jwmueller in #655
Full Changelog: v2.3.0...v2.3.1
v2.3.0 -- Extending cleanlab beyond label errors into a complete library for data-centric AI
Cleanlab was originally open-sourced as code to accompany a research paper on label errors in classification tasks, to prove to skeptical researchers that it's possible to utilize ML models to discover mislabeled data and then train even better versions of these same models. We've been hard at work since then, turning this into an industry-grade library that helps you handle label errors in many ML tasks such as: entity recognition, image/document tagging, data labeled by multiple annotators, etc. While label errors are critical to deal with in real-world ML applications, data-centric AI involves utilizing trained ML models to improve the data in other ways as well.
With the newest release, cleanlab v2.3 can now automatically:
- find mislabeled data + train robust models
- detect outliers and out-of-distribution data
- estimate consensus + annotator-quality for multi-annotator datasets
- suggest which data is best to (re)label next
As always, the cleanlab library works with almost any ML model (no matter how it was trained) and type of data (image, text, tabular, audio, etc). We have user-friendly 5min tutorials to get started with any of the above objectives and easily improve your data!
We're aiming for this library to provide all the key functionalities needed to practice data-centric AI. Much of this involves inventing new algorithms for data quality, and we transparently publish all of these algorithms in scientific papers. Read these to understand how particular cleanlab methods work under the hood and see extensive benchmarks of how effective they are on real data.
Highlights of what’s new in 2.3.0:
We have added new functionality for active learning and easily making Keras models compatible with sklearn. Label issues can now be estimated 10x faster and with much less memory using new methods added to help users with massive datasets. This release is non-breaking when upgrading from v2.2.0 (except for certain methods in cleanlab.experimental
that have been moved).
Active Learning with ActiveLab
For settings where you want to label more data to get better ML, active learning helps you train the best ML model with the least data labeling. Unfortunately data annotators often give imperfect labels, in which case we might sometimes prefer to have another annotator check an already-labeled example rather than labeling an entirely new example. ActiveLab is a new algorithm invented by our team that automatically answers the question: which new data should I label or which of my current labels should be checked again? ActiveLab is highly practical — it runs quickly and works with: any type of ML model, batch settings where many examples are (re)labeled before model retraining, and settings where multiple annotators can label an example (or just one annotator).
Here's all the code needed to determine active learning scores for examples in your unlabeled pool (no annotations yet) and labeled pool (at least one annotation already collected).
from cleanlab.multiannotator import get_active_learning_scores
scores_labeled_pool, scores_unlabeled_pool = get_active_learning_scores(
multiannotator_labels, pred_probs, pred_probs_unlabeled
)
The batch of examples with the lowest scores are those that are most informative to collect an additional label for (scores between labeled vs unlabeled pool are directly comparable). You can either have a new annotator label the batch of examples with lowest scores, or distribute them amongst your previous annotators as is most convenient. ActiveLab is also effective for: standard active learning where you collect at most one label per example (no re-labeling), as well as active label cleaning (with no unlabeled pool) where you only want to re-label examples to ensure 100% correct consensus labels (with the least amount of re-labeling).
Get started running ActiveLab with our tutorial notebook from our repo that has many other examples.
KerasWrapper
We've introduced one-line wrappers for TensorFlow/Keras models that enable you to use TensorFlow models within scikit-learn workflows with features like Pipeline
, GridSearch
and more. Just change one line of code to make your existing Tensorflow/Keras model compatible with scikit-learn’s rich ecosystem! All you have to do is swap out: keras.Model
→ KerasWrapperModel
, or keras.Sequential
→ KerasSequentialWrapper
. Imported from cleanlab.models.keras
, the wrapper objects have all the same methods of their keras counterparts, plus you can use them with tons of handy scikit-learn methods.
Resources to get started include:
- Blogpost and Jupyter notebook demonstrating how to make a HuggingFace Transformer (BERT model) sklearn-compatible.
- Jupyter notebook showing how to fit these sklearn-compatible models to a Tensorflow Dataset.
- Revamped tutorial on label errors in text classification data, which has been updated to use this new wrapper.
Computational improvements for detecting label issues
Through extensive optimization of our multiprocessing code (thanks to @clu0), find_label_issues
has been made ~10x faster on Linux machines that have many CPU cores.
For massive datasets, find_label_issues
may require too much memory to run our your machine. We've added new methods in cleanlab.experimental.label_issues_batched that can compute label issues with far less memory via mini-batch estimation. You can use these with billion-scale memmap arrays or Zarr arrays like this:
from cleanlab.experimental.label_issues_batched import find_label_issues_batched
labels = zarr.convenience.open("LABELS.zarr", mode="r")
pred_probs = zarr.convenience.open("PREDPROBS.zarr", mode="r")
issues = find_label_issues_batched(labels=labels, pred_probs=pred_probs, batch_size=100000)
By choosing sufficiently small batch_size
, you should be able to handle pretty much any dataset (set it as large as your memory will allow for best efficiency). With default arguments, the batched methods closely approximate the results of the option: cleanlab.filter.find_label_issues(..., filter_by="low_self_confidence", return_indices_ranked_by="self_confidence")
This and filter_by="low_normalized_margin"
are new find_label_issues()
options added in v2.3, which require less computation and still output accurate estimates of the label errors.
Other changes to be aware of
- Like all major ML frameworks, we have dropped support for Python 3.6.
- We have moved some particularly useful models (fasttext, keras) from
cleanlab.experimental
->cleanlab.models
.
Change Log
- Shorten tutorial titles in docs for readability by @ulya-tkch in #553
- Swap CI workflow to actions by @huiwengoh in #560
- Remove .pylintrc by @elisno in #564
- Tutorial fixes by @huiwengoh in #565
- Fix typo in CONTRIBUTING.md by @ulya-tkch in #566
- Multiannotator Active Learning Support by @huiwengoh in #538
- multiannotator explanation improvements by @jwmueller in #570
- Specify Sphinx to order functions by source code order by @huiwengoh in #571
- Fix example in ema docstring by @elisno in #563, #573
- update paper list and applications beyond label error detection in readme by @jwmueller in #574, #580
- Drop Python 3.6 support (by @jwmueller in #558, #577; by @anishathalye in #562; by @krmayankb in #578; by @sanjanag in #579)
- add maximum line length by @cgnorthcutt in #583
- Update github actions by @ulya-tkch in #589
- Revamp text tutorial by @huiwengoh in #584
- clarify thresholding in issues_from_scores by @jwmueller in #582
- Remove temp scaling from single annotator case by @huiwengoh in #590
- Update docs dependencies by @huiwengoh in #593
- Use euclidean distance for identifying outliers for lower dimensional features by @ulya-tkch in #581
- changing copyright year 2017-2022 to 2017-2023 by @aditya1503 in https://github.com/cleanlab/cleanl...