Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: common_label_issues #1069

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

gogetron
Copy link
Contributor

Summary

This PR partially addresses #862

🎯 Purpose: Improve performance of common_label_issues in segmentation/summary.py file.

[ ✏️ Write your summary here. ]
After profiling, it seems that the iteration over the array was the slowest part. I have mostly replaced it with numpy operations to avoid looping over each element. In addition, I have replaced the try except block with an if statement.

For memory I used the memory-profiler library. The code I used for benchmarking is copied below. In addition I sorted the imports in the modified files. Note: For benchmarking the current version I removed the tqdm in the loop.

Code Setup

import random

import numpy as np

from cleanlab.segmentation.summary import common_label_issues
from cleanlab.segmentation.rank import find_label_issues

SIZE = 1000
DATASET_EXP_SIZE = 5
np.random.seed(0)
%load_ext memory_profiler

# Copied from the test file with minor changes
def generate_three_image_dataset(bad_index):
    good_gt = np.zeros((SIZE, SIZE))
    good_gt[:SIZE // 2, :] = 1.0
    bad_gt = np.ones((SIZE, SIZE))
    bad_gt[:SIZE // 2, :] = 0.0
    good_pr = np.random.random((2, SIZE, SIZE))
    good_pr[0, :SIZE // 2, :] = good_pr[0, :SIZE // 2, :] / 10
    good_pr[1, SIZE // 2:, :] = good_pr[1, SIZE // 2:, :] / 10

    val = np.binary_repr([4, 2, 1][bad_index], width=3)
    error = [int(case) for case in val]

    labels = []
    pred = []
    for case in val:
        if case == "0":
            labels.append(good_gt)
            pred.append(good_pr)
        else:
            labels.append(bad_gt)
            pred.append(good_pr)

    labels = np.array(labels)
    pred_probs = np.array(pred)
    return labels, pred_probs, error

# Create input data
labels, pred_probs, error = generate_three_image_dataset(random.randint(0, 2))
for _ in range(DATASET_EXP_SIZE):
    labels = np.append(labels, labels, axis=0)
    pred_probs = np.append(pred_probs, pred_probs, axis=0)

labels, pred_probs = labels.astype(int), pred_probs.astype(float)
issues = find_label_issues(labels, pred_probs, n_jobs=1, verbose=False)

Current version

%%timeit
%memit common_label_issues(issues, labels, pred_probs, verbose=False)
# peak memory: 3478.82 MiB, increment: 887.61 MiB
# peak memory: 3479.37 MiB, increment: 887.70 MiB
# peak memory: 3479.22 MiB, increment: 887.55 MiB
# peak memory: 3479.32 MiB, increment: 887.65 MiB
# peak memory: 3479.29 MiB, increment: 887.61 MiB
# peak memory: 3479.21 MiB, increment: 887.54 MiB
# peak memory: 3479.27 MiB, increment: 887.59 MiB
# peak memory: 3479.25 MiB, increment: 887.57 MiB
# 24.6 s ± 93.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR

%%timeit
%memit common_label_issues(issues, labels, pred_probs, verbose=False)
# peak memory: 4158.13 MiB, increment: 1563.78 MiB
# peak memory: 4089.86 MiB, increment: 1475.34 MiB
# peak memory: 4109.87 MiB, increment: 1495.16 MiB
# peak memory: 4109.87 MiB, increment: 1495.16 MiB
# peak memory: 4089.87 MiB, increment: 1475.16 MiB
# peak memory: 4139.87 MiB, increment: 1525.16 MiB
# peak memory: 4243.75 MiB, increment: 1629.05 MiB
# peak memory: 4178.48 MiB, increment: 1563.78 MiB
# 2.34 s ± 35.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing

🔍 Testing Done: Existing tests.

References

Reviewer Notes

💡 Include any specific points for the reviewer to consider during their review.

I had to make changes to the tqdm component because we are not looping over all the labels any more. I tried to follow the approach in other files of just displaying progress when verbose is True. However the bar now is only updated with each unique_label.

The new version consumes more memory than the previous version because of the numpy masked arrays. It seems that the memory bottleneck is not this function because this is the result I get when calling the find_label_issues function with the same input data:

%%memit
issues = find_label_issues(labels, pred_probs, n_jobs=1, verbose=False)
# peak memory: 7640.36 MiB, increment: 5156.72 MiB

However, I am open to try new things to reduce memory consumption by increasing execution time.

Copy link

codecov bot commented Mar 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.17%. Comparing base (abd0924) to head (be261c7).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1069      +/-   ##
==========================================
+ Coverage   96.15%   96.17%   +0.02%     
==========================================
  Files          74       74              
  Lines        5850     5862      +12     
  Branches     1044     1047       +3     
==========================================
+ Hits         5625     5638      +13     
  Misses        134      134              
+ Partials       91       90       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gogetron
Copy link
Contributor Author

Hi, thank you for your review, I have applied your suggested modification as it made a lot of sense. I have followed a similar idea as in my other PR #1067. In this function I had to add the batch_size parameter.

Code Setup

import numpy as np

from cleanlab.segmentation.summary import common_label_issues
from cleanlab.segmentation.rank import find_label_issues

SIZE = 250
NUM_IMAGES = 1000
NUM_CLASSES = 10
np.random.seed(0)
%load_ext memory_profiler

def generate_image_dataset():
    labels = np.random.randint(NUM_CLASSES, size=(NUM_IMAGES, SIZE, SIZE), dtype=int)
    pred_probs = np.random.random((NUM_IMAGES, NUM_CLASSES, SIZE, SIZE))
    return labels, pred_probs

# Create input data
labels, pred_probs = generate_image_dataset()
issues = find_label_issues(labels, pred_probs, n_jobs=1, verbose=False)

Current version

%%timeit
%memit common_label_issues(issues, labels, pred_probs, verbose=False)
# peak memory: 8039.15 MiB, increment: 2437.64 MiB
# peak memory: 8039.63 MiB, increment: 2473.30 MiB
# peak memory: 8039.59 MiB, increment: 2473.26 MiB
# peak memory: 8039.46 MiB, increment: 2473.14 MiB
# peak memory: 8039.69 MiB, increment: 2473.37 MiB
# peak memory: 8039.66 MiB, increment: 2473.15 MiB
# peak memory: 8039.67 MiB, increment: 2473.16 MiB
# peak memory: 8039.69 MiB, increment: 2473.17 MiB
# 1min 11s ± 556 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR

%%timeit
%memit common_label_issues(issues, labels, pred_probs, verbose=False)
# peak memory: 7278.66 MiB, increment: 1674.31 MiB
# peak memory: 7384.47 MiB, increment: 1818.66 MiB
# peak memory: 7296.44 MiB, increment: 1730.57 MiB
# peak memory: 7296.82 MiB, increment: 1730.94 MiB
# peak memory: 7332.82 MiB, increment: 1766.75 MiB
# peak memory: 7279.39 MiB, increment: 1713.32 MiB
# peak memory: 7388.80 MiB, increment: 1822.73 MiB
# peak memory: 7332.67 MiB, increment: 1766.61 MiB
# 6.93 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@gogetron gogetron requested a review from elisno April 27, 2024 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants