Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add end-to-end tests at the end of Datalab quickstart tutorial #1044

Open
jwmueller opened this issue Mar 8, 2024 · 0 comments
Open

Add end-to-end tests at the end of Datalab quickstart tutorial #1044

jwmueller opened this issue Mar 8, 2024 · 0 comments

Comments

@jwmueller
Copy link
Member

This tutorial has no tests for some reason: https://raw.githubusercontent.com/cleanlab/cleanlab/master/docs/source/tutorials/datalab/datalab_quickstart.ipynb

If you look at at the raw version of most of our other tutorials, eg:
https://raw.githubusercontent.com/cleanlab/cleanlab/master/docs/source/tutorials/image.ipynb

You'll notice they have a final hidden cell that is full of assert statements:
Screen Shot 2024-03-07 at 8 16 02 PM

This hidden cell is essentially an end-to-end test of the code.

Goal: add a similar hidden cell to the Datalab quickstart tutorial.
The hidden cell should have asserts which check that:

  1. the jaccard similarity between data detected as is_label_issue = True and actual known mislabels in the dataset is > 0.9 (or sufficiently high threshold)

  2. that roc_auroc_score(label_quality_scores, Z) > 0.9 (or sufficiently high threshold), where Z ground-truth array with value 1 if this data point is correctly labeled, value 0 if it is truly a mislabel. This assert checks that the label quality scores appropriately rank the data.

  3. the jaccard similarity between data detected as is_XYZ_issue = True and actual known instances of issue XYZ in the dataset is > 0.9 (or sufficiently high threshold).

Here XYZ = outlier, near duplicate, etc.

  1. no other issue types beyond those expected were detected in this tutorial. Make sure this assert is forwards compatible. That is, if we add 3 new issue types to Datalab-defaults in the future, this same assert should be able to catch if any of these newly added issue types is suddenly detected in this tutorial.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant