Turn near-duplicate score test into a property-based test #944

elisno · 2024-01-05T12:33:26Z

After updating the near-duplicate scores, a test was added to ensure that near-duplicate examples have worse scores than non-near-duplicates.
RIght now, the test only works on a small, toy dataset. Turning it into a property-based test would be ideal.

So that the test doesn't take too long, it may be a good idea to make a different embedding strategy that:

Generates a random array
Selects a handful of random examples to have near-duplicates. Append a new example that takes the original example with variable noise.
```
# append to existing feature array
x[sample_ids] + small_random_noise
```
The embedding strategy should always generate "proper" near-duplicates that will get flagged. We don't want to test cases where there are no near-duplicates.

Below is the existing test case for comparison. We want to make sure that the issue scores created by the issue manager don't violate this property.

cleanlab/tests/datalab/issue_manager/test_duplicate.py

Lines 89 to 98 in 23af72a

    
           def test_scores_of_examples_with_issues_are_smaller_than_those_without( 
        
               self, issue_manager, embeddings 
        
           ): 
        
               # TODO: Turn this into a property-based test 
        
               issue_manager.find_issues(features=embeddings["embedding"]) 
        
               is_issue = issue_manager.issues["is_near_duplicate_issue"] 
        
               scores = issue_manager.issues["near_duplicate_score"] 
        
               max_issue_score = np.max(scores[is_issue]) 
        
               min_non_issue_score = np.min(scores[~is_issue]) 
        
               assert max_issue_score < min_non_issue_score

Originally posted by @elisno in #943 (comment)

elisno added the help-wanted We need your help to add this, but it may be more challenging than a "good first issue" label Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turn near-duplicate score test into a property-based test #944

Turn near-duplicate score test into a property-based test #944

elisno commented Jan 5, 2024 •

edited

Turn near-duplicate score test into a property-based test #944

Turn near-duplicate score test into a property-based test #944

Comments

elisno commented Jan 5, 2024 • edited

elisno commented Jan 5, 2024 •

edited