You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After updating the near-duplicate scores, a test was added to ensure that near-duplicate examples have worse scores than non-near-duplicates.
RIght now, the test only works on a small, toy dataset. Turning it into a property-based test would be ideal.
So that the test doesn't take too long, it may be a good idea to make a different embedding strategy that:
Generates a random array
Selects a handful of random examples to have near-duplicates. Append a new example that takes the original example with variable noise.
# append to existing feature arrayx[sample_ids] +small_random_noise
The embedding strategy should always generate "proper" near-duplicates that will get flagged. We don't want to test cases where there are no near-duplicates.
Below is the existing test case for comparison. We want to make sure that the issue scores created by the issue manager don't violate this property.
After updating the near-duplicate scores, a test was added to ensure that near-duplicate examples have worse scores than non-near-duplicates.
RIght now, the test only works on a small, toy dataset. Turning it into a property-based test would be ideal.
So that the test doesn't take too long, it may be a good idea to make a different embedding strategy that:
Below is the existing test case for comparison. We want to make sure that the issue scores created by the issue manager don't violate this property.
cleanlab/tests/datalab/issue_manager/test_duplicate.py
Lines 89 to 98 in 23af72a
Originally posted by @elisno in #943 (comment)
The text was updated successfully, but these errors were encountered: