-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf: common_label_issues #1069
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1069 +/- ##
==========================================
+ Coverage 96.15% 96.17% +0.02%
==========================================
Files 74 74
Lines 5850 5862 +12
Branches 1044 1047 +3
==========================================
+ Hits 5625 5638 +13
Misses 134 134
+ Partials 91 90 -1 ☔ View full report in Codecov by Sentry. |
Hi, thank you for your review, I have applied your suggested modification as it made a lot of sense. I have followed a similar idea as in my other PR #1067. In this function I had to add the batch_size parameter. Code Setup import numpy as np
from cleanlab.segmentation.summary import common_label_issues
from cleanlab.segmentation.rank import find_label_issues
SIZE = 250
NUM_IMAGES = 1000
NUM_CLASSES = 10
np.random.seed(0)
%load_ext memory_profiler
def generate_image_dataset():
labels = np.random.randint(NUM_CLASSES, size=(NUM_IMAGES, SIZE, SIZE), dtype=int)
pred_probs = np.random.random((NUM_IMAGES, NUM_CLASSES, SIZE, SIZE))
return labels, pred_probs
# Create input data
labels, pred_probs = generate_image_dataset()
issues = find_label_issues(labels, pred_probs, n_jobs=1, verbose=False) Current version %%timeit
%memit common_label_issues(issues, labels, pred_probs, verbose=False)
# peak memory: 8039.15 MiB, increment: 2437.64 MiB
# peak memory: 8039.63 MiB, increment: 2473.30 MiB
# peak memory: 8039.59 MiB, increment: 2473.26 MiB
# peak memory: 8039.46 MiB, increment: 2473.14 MiB
# peak memory: 8039.69 MiB, increment: 2473.37 MiB
# peak memory: 8039.66 MiB, increment: 2473.15 MiB
# peak memory: 8039.67 MiB, increment: 2473.16 MiB
# peak memory: 8039.69 MiB, increment: 2473.17 MiB
# 1min 11s ± 556 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) This PR %%timeit
%memit common_label_issues(issues, labels, pred_probs, verbose=False)
# peak memory: 7278.66 MiB, increment: 1674.31 MiB
# peak memory: 7384.47 MiB, increment: 1818.66 MiB
# peak memory: 7296.44 MiB, increment: 1730.57 MiB
# peak memory: 7296.82 MiB, increment: 1730.94 MiB
# peak memory: 7332.82 MiB, increment: 1766.75 MiB
# peak memory: 7279.39 MiB, increment: 1713.32 MiB
# peak memory: 7388.80 MiB, increment: 1822.73 MiB
# peak memory: 7332.67 MiB, increment: 1766.61 MiB
# 6.93 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
Summary
This PR partially addresses #862
[ ✏️ Write your summary here. ]
After profiling, it seems that the iteration over the array was the slowest part. I have mostly replaced it with numpy operations to avoid looping over each element. In addition, I have replaced the try except block with an if statement.
For memory I used the memory-profiler library. The code I used for benchmarking is copied below. In addition I sorted the imports in the modified files. Note: For benchmarking the current version I removed the tqdm in the loop.
Code Setup
Current version
This PR
Testing
References
Reviewer Notes
I had to make changes to the tqdm component because we are not looping over all the labels any more. I tried to follow the approach in other files of just displaying progress when verbose is True. However the bar now is only updated with each unique_label.
The new version consumes more memory than the previous version because of the numpy masked arrays. It seems that the memory bottleneck is not this function because this is the result I get when calling the find_label_issues function with the same input data:
However, I am open to try new things to reduce memory consumption by increasing execution time.