Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize value_counts function for performance improvement with missing classes #1073

Merged
merged 4 commits into from
May 21, 2024

Conversation

gogetron
Copy link
Contributor

Summary

This PR partially addresses #862

🎯 Purpose: Improve performance of internal value_counts function

[ ✏️ Write your summary here. ]
While working on token_classification I noticed when I worked with batches the value_counts function was around the top 30 when profiling. By preallocating the array and given that np.unique returns a sorted array, we can avoid the creation of multiple lists before converting them into an array which is orders of magnitude faster. With this PR almost all the time is spent in the np.unique function. The improvement is only noticeable when all the classes are not present (mostly batches with a relative high number of classes).

For memory I used the memory-profiler library. The code I used for benchmarking is copied below. In addition I sorted the imports in the modified files.

Code Setup

import numpy as np

from cleanlab.internal.util import value_counts

np.random.seed(0)
%load_ext memory_profiler

# Simulate a batch_size of 10_000 and a total of 250_000 classes
x = np.random.randint(250_000, size=10_000)

# Simulate a large dataset with most classes present.
y = np.random.randint(249_950, size=500_000_000)

Note: when calling value_counts on x I measured memory only once after the timeit function. Otherwise it would print too many statements.
Current version

%timeit value_counts(x, num_classes=250_000)
%memit value_counts(x, num_classes=250_000)
# 48.1 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# peak memory: 4103.93 MiB, increment: 4.66 MiB
%%timeit 
%memit value_counts(y, num_classes=250_000)
# peak memory: 8465.91 MiB, increment: 4385.38 MiB
# peak memory: 8472.33 MiB, increment: 4391.65 MiB
# peak memory: 8372.47 MiB, increment: 4291.54 MiB
# peak memory: 8570.36 MiB, increment: 4489.45 MiB
# peak memory: 8372.46 MiB, increment: 4291.54 MiB
# peak memory: 8582.37 MiB, increment: 4501.44 MiB
# peak memory: 8692.37 MiB, increment: 4611.44 MiB
# peak memory: 8760.37 MiB, increment: 4679.44 MiB
# 1min 5s ± 971 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR

%timeit value_counts(x, num_classes=250_000)
%memit value_counts(x, num_classes=250_000)
# 16.9 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# peak memory: 4093.53 MiB, increment: 0.00 MiB
%%timeit 
%memit value_counts(y, num_classes=250_000)
# peak memory: 8777.02 MiB, increment: 4712.53 MiB
# peak memory: 8587.01 MiB, increment: 4524.52 MiB
# peak memory: 8433.01 MiB, increment: 4370.52 MiB
# peak memory: 8355.69 MiB, increment: 4293.21 MiB
# peak memory: 8404.86 MiB, increment: 4342.38 MiB
# peak memory: 8789.01 MiB, increment: 4726.52 MiB
# peak memory: 8769.01 MiB, increment: 4706.52 MiB
# peak memory: 8355.70 MiB, increment: 4293.21 MiB
# 28.7 s ± 322 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing

🔍 Testing Done: Existing tests.

References

Reviewer Notes

💡 Include any specific points for the reviewer to consider during their review.

Copy link

codecov bot commented Mar 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.26%. Comparing base (abd0924) to head (0a09fdb).
Report is 15 commits behind head on master.

❗ Current head 0a09fdb differs from pull request most recent head d77f08b. Consider uploading reports for the commit d77f08b to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1073      +/-   ##
==========================================
- Coverage   96.15%   94.26%   -1.90%     
==========================================
  Files          74       74              
  Lines        5850     5857       +7     
  Branches     1044     1046       +2     
==========================================
- Hits         5625     5521     -104     
- Misses        134      254     +120     
+ Partials       91       82       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jwmueller jwmueller requested a review from elisno March 29, 2024 23:06
Copy link
Member

@elisno elisno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great performance improvement!

Made some changes and fixed a bug where missing string classes were improperly handled.

Included more test cases for this utility function.

@elisno elisno changed the title Optimize value_counts when missing classes for performance Optimize value_counts function for performance improvement with missing classes May 21, 2024
@elisno elisno merged commit c13f32a into cleanlab:master May 21, 2024
19 checks passed
elisno added a commit to elisno/cleanlab that referenced this pull request May 22, 2024
…ng classes (cleanlab#1073)

Co-authored-by: Elías Snorrason <eliassno@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants