Add end-to-end tests at the end of Datalab quickstart tutorial #1044

jwmueller · 2024-03-08T04:21:15Z

You'll notice they have a final hidden cell that is full of assert statements:

This hidden cell is essentially an end-to-end test of the code.

Goal: add a similar hidden cell to the Datalab quickstart tutorial.
The hidden cell should have asserts which check that:

the jaccard similarity between data detected as is_label_issue = True and actual known mislabels in the dataset is > 0.9 (or sufficiently high threshold)
that roc_auroc_score(label_quality_scores, Z) > 0.9 (or sufficiently high threshold), where Z ground-truth array with value 1 if this data point is correctly labeled, value 0 if it is truly a mislabel. This assert checks that the label quality scores appropriately rank the data.
the jaccard similarity between data detected as is_XYZ_issue = True and actual known instances of issue XYZ in the dataset is > 0.9 (or sufficiently high threshold).

Here XYZ = outlier, near duplicate, etc.

no other issue types beyond those expected were detected in this tutorial. Make sure this assert is forwards compatible. That is, if we add 3 new issue types to Datalab-defaults in the future, this same assert should be able to catch if any of these newly added issue types is suddenly detected in this tutorial.

jwmueller added the needs triage label Mar 8, 2024

allincowell mentioned this issue May 4, 2024

[1044] Added end-to-end tests at the end of Quickstart editorial #1118

Merged

Provide feedback