Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn near-duplicate score test into a property-based test #944

Open
elisno opened this issue Jan 5, 2024 · 0 comments
Open

Turn near-duplicate score test into a property-based test #944

elisno opened this issue Jan 5, 2024 · 0 comments
Labels
help-wanted We need your help to add this, but it may be more challenging than a "good first issue"

Comments

@elisno
Copy link
Member

elisno commented Jan 5, 2024

After updating the near-duplicate scores, a test was added to ensure that near-duplicate examples have worse scores than non-near-duplicates.
RIght now, the test only works on a small, toy dataset. Turning it into a property-based test would be ideal.

So that the test doesn't take too long, it may be a good idea to make a different embedding strategy that:

  1. Generates a random array
  2. Selects a handful of random examples to have near-duplicates. Append a new example that takes the original example with variable noise.
    # append to existing feature array
    x[sample_ids] + small_random_noise
  3. The embedding strategy should always generate "proper" near-duplicates that will get flagged. We don't want to test cases where there are no near-duplicates.

Below is the existing test case for comparison. We want to make sure that the issue scores created by the issue manager don't violate this property.

def test_scores_of_examples_with_issues_are_smaller_than_those_without(
self, issue_manager, embeddings
):
# TODO: Turn this into a property-based test
issue_manager.find_issues(features=embeddings["embedding"])
is_issue = issue_manager.issues["is_near_duplicate_issue"]
scores = issue_manager.issues["near_duplicate_score"]
max_issue_score = np.max(scores[is_issue])
min_non_issue_score = np.min(scores[~is_issue])
assert max_issue_score < min_non_issue_score

Originally posted by @elisno in #943 (comment)

@elisno elisno added the help-wanted We need your help to add this, but it may be more challenging than a "good first issue" label Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help-wanted We need your help to add this, but it may be more challenging than a "good first issue"
Projects
None yet
Development

No branches or pull requests

1 participant