Optimize value_counts function for performance improvement with missing classes #1073

gogetron · 2024-03-29T19:20:39Z

Summary

This PR partially addresses #862

🎯 Purpose: Improve performance of internal value_counts function

[ ✏️ Write your summary here. ]
While working on token_classification I noticed when I worked with batches the value_counts function was around the top 30 when profiling. By preallocating the array and given that np.unique returns a sorted array, we can avoid the creation of multiple lists before converting them into an array which is orders of magnitude faster. With this PR almost all the time is spent in the np.unique function. The improvement is only noticeable when all the classes are not present (mostly batches with a relative high number of classes).

For memory I used the memory-profiler library. The code I used for benchmarking is copied below. In addition I sorted the imports in the modified files.

Code Setup

import numpy as np

from cleanlab.internal.util import value_counts

np.random.seed(0)
%load_ext memory_profiler

# Simulate a batch_size of 10_000 and a total of 250_000 classes
x = np.random.randint(250_000, size=10_000)

# Simulate a large dataset with most classes present.
y = np.random.randint(249_950, size=500_000_000)

Note: when calling value_counts on x I measured memory only once after the timeit function. Otherwise it would print too many statements.
Current version

%timeit value_counts(x, num_classes=250_000)
%memit value_counts(x, num_classes=250_000)
# 48.1 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# peak memory: 4103.93 MiB, increment: 4.66 MiB

%%timeit 
%memit value_counts(y, num_classes=250_000)
# peak memory: 8465.91 MiB, increment: 4385.38 MiB
# peak memory: 8472.33 MiB, increment: 4391.65 MiB
# peak memory: 8372.47 MiB, increment: 4291.54 MiB
# peak memory: 8570.36 MiB, increment: 4489.45 MiB
# peak memory: 8372.46 MiB, increment: 4291.54 MiB
# peak memory: 8582.37 MiB, increment: 4501.44 MiB
# peak memory: 8692.37 MiB, increment: 4611.44 MiB
# peak memory: 8760.37 MiB, increment: 4679.44 MiB
# 1min 5s ± 971 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR

%timeit value_counts(x, num_classes=250_000)
%memit value_counts(x, num_classes=250_000)
# 16.9 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# peak memory: 4093.53 MiB, increment: 0.00 MiB

%%timeit 
%memit value_counts(y, num_classes=250_000)
# peak memory: 8777.02 MiB, increment: 4712.53 MiB
# peak memory: 8587.01 MiB, increment: 4524.52 MiB
# peak memory: 8433.01 MiB, increment: 4370.52 MiB
# peak memory: 8355.69 MiB, increment: 4293.21 MiB
# peak memory: 8404.86 MiB, increment: 4342.38 MiB
# peak memory: 8789.01 MiB, increment: 4726.52 MiB
# peak memory: 8769.01 MiB, increment: 4706.52 MiB
# peak memory: 8355.70 MiB, increment: 4293.21 MiB
# 28.7 s ± 322 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing

🔍 Testing Done: Existing tests.

References

Reviewer Notes

💡 Include any specific points for the reviewer to consider during their review.

codecov · 2024-03-29T20:38:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.26%. Comparing base (abd0924) to head (0a09fdb).
Report is 15 commits behind head on master.

❗ Current head 0a09fdb differs from pull request most recent head d77f08b. Consider uploading reports for the commit d77f08b to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1073      +/-   ##
==========================================
- Coverage   96.15%   94.26%   -1.90%     
==========================================
  Files          74       74              
  Lines        5850     5857       +7     
  Branches     1044     1046       +2     
==========================================
- Hits         5625     5521     -104     
- Misses        134      254     +120     
+ Partials       91       82       -9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

elisno

LGTM! Great performance improvement!

Made some changes and fixed a bug where missing string classes were improperly handled.

Included more test cases for this utility function.

…ng classes (cleanlab#1073) Co-authored-by: Elías Snorrason <eliassno@gmail.com>

gogetron added 2 commits March 29, 2024 17:49

Perf: value_counts when missing classes

434dfb7

Fix: typing error python >=3.9

0a09fdb

jwmueller requested a review from elisno March 29, 2024 23:06

elisno added 2 commits April 11, 2024 14:37

Add test_value_counts_fill_missing_classes to test_util.py

406fb7d

Fix value_counts function to handle missing classes properly

d77f08b

elisno approved these changes Apr 11, 2024

View reviewed changes

elisno changed the title ~~Optimize value_counts when missing classes for performance~~ Optimize value_counts function for performance improvement with missing classes May 21, 2024

elisno merged commit c13f32a into cleanlab:master May 21, 2024
19 checks passed

elisno added a commit to elisno/cleanlab that referenced this pull request May 22, 2024

Optimize value_counts function for performance improvement with missi…

316debd

…ng classes (cleanlab#1073) Co-authored-by: Elías Snorrason <eliassno@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize value_counts function for performance improvement with missing classes #1073

Optimize value_counts function for performance improvement with missing classes #1073

gogetron commented Mar 29, 2024

codecov bot commented Mar 29, 2024 •

edited

elisno left a comment

Optimize value_counts function for performance improvement with missing classes #1073

Optimize value_counts function for performance improvement with missing classes #1073

Conversation

gogetron commented Mar 29, 2024

Summary

Testing

References

Reviewer Notes

codecov bot commented Mar 29, 2024 • edited

Codecov Report

elisno left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 29, 2024 •

edited