Perf: common_label_issues #1069

gogetron · 2024-03-28T14:46:42Z

Summary

This PR partially addresses #862

🎯 Purpose: Improve performance of common_label_issues in segmentation/summary.py file.

[ ✏️ Write your summary here. ]
After profiling, it seems that the iteration over the array was the slowest part. I have mostly replaced it with numpy operations to avoid looping over each element. In addition, I have replaced the try except block with an if statement.

For memory I used the memory-profiler library. The code I used for benchmarking is copied below. In addition I sorted the imports in the modified files. Note: For benchmarking the current version I removed the tqdm in the loop.

Code Setup

import random

import numpy as np

from cleanlab.segmentation.summary import common_label_issues
from cleanlab.segmentation.rank import find_label_issues

SIZE = 1000
DATASET_EXP_SIZE = 5
np.random.seed(0)
%load_ext memory_profiler

# Copied from the test file with minor changes
def generate_three_image_dataset(bad_index):
    good_gt = np.zeros((SIZE, SIZE))
    good_gt[:SIZE // 2, :] = 1.0
    bad_gt = np.ones((SIZE, SIZE))
    bad_gt[:SIZE // 2, :] = 0.0
    good_pr = np.random.random((2, SIZE, SIZE))
    good_pr[0, :SIZE // 2, :] = good_pr[0, :SIZE // 2, :] / 10
    good_pr[1, SIZE // 2:, :] = good_pr[1, SIZE // 2:, :] / 10

    val = np.binary_repr([4, 2, 1][bad_index], width=3)
    error = [int(case) for case in val]

    labels = []
    pred = []
    for case in val:
        if case == "0":
            labels.append(good_gt)
            pred.append(good_pr)
        else:
            labels.append(bad_gt)
            pred.append(good_pr)

    labels = np.array(labels)
    pred_probs = np.array(pred)
    return labels, pred_probs, error

# Create input data
labels, pred_probs, error = generate_three_image_dataset(random.randint(0, 2))
for _ in range(DATASET_EXP_SIZE):
    labels = np.append(labels, labels, axis=0)
    pred_probs = np.append(pred_probs, pred_probs, axis=0)

labels, pred_probs = labels.astype(int), pred_probs.astype(float)
issues = find_label_issues(labels, pred_probs, n_jobs=1, verbose=False)

Current version

%%timeit
%memit common_label_issues(issues, labels, pred_probs, verbose=False)
# peak memory: 3478.82 MiB, increment: 887.61 MiB
# peak memory: 3479.37 MiB, increment: 887.70 MiB
# peak memory: 3479.22 MiB, increment: 887.55 MiB
# peak memory: 3479.32 MiB, increment: 887.65 MiB
# peak memory: 3479.29 MiB, increment: 887.61 MiB
# peak memory: 3479.21 MiB, increment: 887.54 MiB
# peak memory: 3479.27 MiB, increment: 887.59 MiB
# peak memory: 3479.25 MiB, increment: 887.57 MiB
# 24.6 s ± 93.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR

%%timeit
%memit common_label_issues(issues, labels, pred_probs, verbose=False)
# peak memory: 4158.13 MiB, increment: 1563.78 MiB
# peak memory: 4089.86 MiB, increment: 1475.34 MiB
# peak memory: 4109.87 MiB, increment: 1495.16 MiB
# peak memory: 4109.87 MiB, increment: 1495.16 MiB
# peak memory: 4089.87 MiB, increment: 1475.16 MiB
# peak memory: 4139.87 MiB, increment: 1525.16 MiB
# peak memory: 4243.75 MiB, increment: 1629.05 MiB
# peak memory: 4178.48 MiB, increment: 1563.78 MiB
# 2.34 s ± 35.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing

🔍 Testing Done: Existing tests.

References

Reviewer Notes

💡 Include any specific points for the reviewer to consider during their review.

I had to make changes to the tqdm component because we are not looping over all the labels any more. I tried to follow the approach in other files of just displaying progress when verbose is True. However the bar now is only updated with each unique_label.

The new version consumes more memory than the previous version because of the numpy masked arrays. It seems that the memory bottleneck is not this function because this is the result I get when calling the find_label_issues function with the same input data:

%%memit
issues = find_label_issues(labels, pred_probs, n_jobs=1, verbose=False)
# peak memory: 7640.36 MiB, increment: 5156.72 MiB

However, I am open to try new things to reduce memory consumption by increasing execution time.

codecov · 2024-03-28T14:57:45Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.17%. Comparing base (abd0924) to head (be261c7).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1069      +/-   ##
==========================================
+ Coverage   96.15%   96.17%   +0.02%     
==========================================
  Files          74       74              
  Lines        5850     5862      +12     
  Branches     1044     1047       +3     
==========================================
+ Hits         5625     5638      +13     
  Misses        134      134              
+ Partials       91       90       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cleanlab/segmentation/summary.py

gogetron · 2024-03-28T17:53:00Z

Hi, thank you for your review, I have applied your suggested modification as it made a lot of sense. I have followed a similar idea as in my other PR #1067. In this function I had to add the batch_size parameter.

Code Setup

import numpy as np

from cleanlab.segmentation.summary import common_label_issues
from cleanlab.segmentation.rank import find_label_issues

SIZE = 250
NUM_IMAGES = 1000
NUM_CLASSES = 10
np.random.seed(0)
%load_ext memory_profiler

def generate_image_dataset():
    labels = np.random.randint(NUM_CLASSES, size=(NUM_IMAGES, SIZE, SIZE), dtype=int)
    pred_probs = np.random.random((NUM_IMAGES, NUM_CLASSES, SIZE, SIZE))
    return labels, pred_probs

# Create input data
labels, pred_probs = generate_image_dataset()
issues = find_label_issues(labels, pred_probs, n_jobs=1, verbose=False)

Current version

%%timeit
%memit common_label_issues(issues, labels, pred_probs, verbose=False)
# peak memory: 8039.15 MiB, increment: 2437.64 MiB
# peak memory: 8039.63 MiB, increment: 2473.30 MiB
# peak memory: 8039.59 MiB, increment: 2473.26 MiB
# peak memory: 8039.46 MiB, increment: 2473.14 MiB
# peak memory: 8039.69 MiB, increment: 2473.37 MiB
# peak memory: 8039.66 MiB, increment: 2473.15 MiB
# peak memory: 8039.67 MiB, increment: 2473.16 MiB
# peak memory: 8039.69 MiB, increment: 2473.17 MiB
# 1min 11s ± 556 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR

%%timeit
%memit common_label_issues(issues, labels, pred_probs, verbose=False)
# peak memory: 7278.66 MiB, increment: 1674.31 MiB
# peak memory: 7384.47 MiB, increment: 1818.66 MiB
# peak memory: 7296.44 MiB, increment: 1730.57 MiB
# peak memory: 7296.82 MiB, increment: 1730.94 MiB
# peak memory: 7332.82 MiB, increment: 1766.75 MiB
# peak memory: 7279.39 MiB, increment: 1713.32 MiB
# peak memory: 7388.80 MiB, increment: 1822.73 MiB
# peak memory: 7332.67 MiB, increment: 1766.61 MiB
# 6.93 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

gogetron and others added 3 commits March 28, 2024 14:25

Perf: common_label_issues

c2c589b

Add: label progress

eab2112

Fix: f-string as literal string

1fa3cc9

elisno reviewed Mar 28, 2024

View reviewed changes

cleanlab/segmentation/summary.py Outdated Show resolved Hide resolved

Perf: process issues using batches

61ae4cc

Fix: close progressbar after the loop

be261c7

gogetron requested a review from elisno April 27, 2024 08:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: common_label_issues #1069

Perf: common_label_issues #1069

gogetron commented Mar 28, 2024

codecov bot commented Mar 28, 2024 •

edited

gogetron commented Mar 28, 2024

Perf: common_label_issues #1069

Are you sure you want to change the base?

Perf: common_label_issues #1069

Conversation

gogetron commented Mar 28, 2024

Summary

Testing

References

Reviewer Notes

codecov bot commented Mar 28, 2024 • edited

Codecov Report

gogetron commented Mar 28, 2024

codecov bot commented Mar 28, 2024 •

edited