Improve KNN Graph Construction for Handling Exact Duplicates and Numerical Precision #1119

elisno · 2024-05-06T23:13:33Z

Summary

This PR introduces a new argument, correct_exact_duplicates: bool = True, to the function construct_knn_graph_from_features and others. This update ensures that exact duplicates in the feature array are handled correctly during the k-nearest neighbors (KNN) graph construction.

New Functions

correct_knn_graph:
Wrapper around correct_knn_distances_and_indices to correct KNN graph based on the feature array.
```
def correct_knn_graph(features: FeatureArray, knn_graph: csr_matrix) -> csr_matrix:
    ...
```

correct_knn_distances_and_indices_with_exact_duplicate_sets_inplace:
Main logic for in-place correction of distances and indices arrays.

def correct_knn_distances_and_indices_with_exact_duplicate_sets_inplace(
    distances: np.ndarray,
    indices: np.ndarray,
    exact_duplicate_sets: List[np.ndarray],
) -> None:
    ...

correct_knn_distances_and_indices:
Corrects KNN distances and indices with optional exact duplicate sets and warning.

def correct_knn_distances_and_indices(
    features: FeatureArray,
    distances: np.ndarray,
    indices: np.ndarray,
    exact_duplicate_sets: Optional[List[np.ndarray]] = None,
    enable_warning: bool = False,
) -> tuple[np.ndarray, np.ndarray]:
    ...

Impact

The default behavior of most functions now includes correction for exact duplicates unless a knn: NearestNeighbors object is passed explicitly.
This change affects the outlier detection in cleanlab/outliers.py, where correction is applied manually if no knn object is provided.
The noniid check in Datalab is updated to construct a KNN graph without relying on NearestNeighbors from sklearn.

What this PR does not address:

The correction logic addresses exact duplicates but does not explicitly cover scenarios where there are near-duplicates or small variations in features that might need similar handling.
Performance optimization for large duplicated datasets. Users with large datasets and numerous sets of exact duplicates might experience slower performance due to the iteration across all the different duplicate sets.
The PR adds corrections for exact duplicates during KNN graph construction on training data, but it does not provide flexibility for other k nearest neighbor search libraries. Such graphs should be constructed by the user, and subsequently corrected with the same features.
- The same applies to correcting knn graphs on test data.

Benchmark Results

Two benchmark scenarios were tested:

All-Identical Dataset:
- One unique point duplicated N times.
Copied Dataset:
- Several points duplicated a few times.

The graphs below compare runtime and memory usage for different functions:

Top Graph: Runtime vs. Number of Points
Bottom Graph: Memory Usage vs. Number of Points

For the All-Identical Dataset, the correction function spends its time constructing a small circulant matrix to find the nearest neighbors of the first k+1 elements. All other points just refer to the first k points. The purple line shows how the correction algorithm performs if all the duplicate information is pre-computed and the output can be modified in-place. No knn-graph construction occurs in that function.

For the Copied Dataset, there are far more sets to iterate over, which impacts the performance of the correction function.
This shouldn't really exceed the runtime of the exhaustive search algorithm by too much.

The benchmark code and additional results are provided in the expandable sections.

Benchmark Code for All-Identical Dataset

Code for All-Identical Dataset

from __future__ import annotations
import time
import tracemalloc
import numpy as np
import pandas as pd
from tqdm.auto import tqdm

from cleanlab.internal.neighbor.knn_graph import (
    _compute_exact_duplicate_sets,
    correct_knn_distances_and_indices,
    correct_knn_distances_and_indices_with_exact_duplicate_sets_inplace,
    features_to_knn,
    construct_knn_graph_from_index,
    create_knn_graph_and_index,
)

# Define the sizes of feature arrays for the benchmark
feature_sizes = np.logspace(1, 7.0, num=25, base=10, dtype=int)

# Define a function to benchmark the memory and runtime of a given function
def benchmark_function(func, *args, **kwargs):
    # Record the start time and memory usage
    start_time = time.time()
    tracemalloc.start()
    
    # Run the function
    result = func(*args, **kwargs)
    
    # Record the end time and memory usage
    end_time = time.time()
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    
    # Calculate runtime and peak memory usage
    runtime = end_time - start_time
    peak_memory = peak / 10**6  # Convert to MB
    
    return runtime, peak_memory, result

# Initialize a DataFrame to store the benchmark results
columns = [f_col:='Function', N_col:='Num points', runtime_col:='Runtime (s)', memory_col:='Memory (MB)']
results = pd.DataFrame(columns=columns)

# Define functions to benchmark
functions = {
    # 'features_to_knn': features_to_knn,
    'construct_knn_graph_from_index_without_correction': construct_knn_graph_from_index,
    'construct_knn_graph_from_index_with_correction': construct_knn_graph_from_index,
    # 'create_knn_graph_and_index': create_knn_graph_and_index,
    'correct_knn_distances_and_indices': correct_knn_distances_and_indices,
    'correct_with_precomputed_exact_duplicate_sets': correct_knn_distances_and_indices,
    'correct_with_precomputed_exact_duplicate_sets_inplace': correct_knn_distances_and_indices_with_exact_duplicate_sets_inplace,
}
results_list_of_dicts = []
# Run the benchmark for each function and feature size
N_max_slow = 20000
for N in tqdm(feature_sizes):
    # features = np.random.rand(N, 10)  # Generate random feature array
    features = np.ones((N, 10))
    if N > N_max_slow:
        k = min(10, N-1)
        distances = np.tile(np.ones(k), (N, 1))
        indices = np.tile(np.arange(k), (N, 1))
        exact_duplicate_sets = [np.arange(N)]
    else:

        knn_graph, knn = create_knn_graph_and_index(features, correct_exact_duplicates=False)
        distances = knn_graph.data.reshape(knn_graph.shape[0], -1)
    
        indices = knn_graph.indices.reshape(knn_graph.shape[0], -1)

        exact_duplicate_sets = _compute_exact_duplicate_sets(features)
    
    for func_name, func in functions.items():
        if func_name == 'construct_knn_graph_from_index_without_correction':
            if N > N_max_slow:
                continue
            runtime, peak_memory, _ = benchmark_function(func, knn)
        elif func_name == 'construct_knn_graph_from_index_with_correction':
            if N > N_max_slow:
                continue
            runtime, peak_memory, _ = benchmark_function(func, knn, correct_exact_duplicates=True)
        elif func_name == 'create_knn_graph_and_index':
            if N > N_max_slow:
                continue
            runtime, peak_memory, _ = benchmark_function(func, features)
        elif func_name == 'correct_knn_distances_and_indices':
            if N > N_max_slow:
                continue
            runtime, peak_memory, _ = benchmark_function(func, features=features, distances=distances, indices=indices)
        elif func_name == 'correct_with_precomputed_exact_duplicate_sets':
            runtime, peak_memory, _ = benchmark_function(func, features=features, distances=distances, indices=indices, exact_duplicate_sets=exact_duplicate_sets)
        elif func_name == 'correct_with_precomputed_exact_duplicate_sets_inplace':
            runtime, peak_memory, _ = benchmark_function(func, distances=distances, indices=indices, exact_duplicate_sets=exact_duplicate_sets)
        else:
            if N > 1000:
                continue
            runtime, peak_memory, _ = benchmark_function(func, features)
        
        
        # Store the results in the DataFrame
        results_list_of_dicts = results_list_of_dicts + [dict({
            f_col: func_name,
            N_col: N,
            runtime_col: runtime,
            memory_col: peak_memory,
        })]

results = pd.DataFrame(results_list_of_dicts)
# Save the results to a CSV file
results.to_csv('benchmark_results.csv', index=False)

# Print the results
print(results)

# Plot the results
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 1, figsize=(10, 10))
for func_name in functions.keys():
    subset = results[results[f_col] == func_name]
    axes[0].plot(subset[N_col], subset[runtime_col], label=func_name)
    axes[1].plot(subset[N_col], subset[memory_col], label=func_name)

axes[0].set_xlabel(N_col)
axes[0].set_ylabel(runtime_col)
axes[0].set_title('Runtime vs. Num Points')
axes[0].legend()
axes[0].set_xscale('log')
axes[0].set_yscale('log')

axes[1].set_xlabel(N_col)
axes[1].set_ylabel(memory_col)
axes[1].set_title('Memory Usage vs. Num Points')
axes[1].legend()
axes[1].set_xscale('log')
axes[1].set_yscale('log')

plt.tight_layout()
plt.show()

Benchmark Code for Copied Dataset

Code for Copied Dataset

from __future__ import annotations
import time
import tracemalloc
import numpy as np
import pandas as pd
from memory_profiler import memory_usage
from tqdm.auto import tqdm

from cleanlab.internal.neighbor.knn_graph import (
    _compute_exact_duplicate_sets,
    correct_knn_distances_and_indices,
    correct_knn_distances_and_indices_with_exact_duplicate_sets_inplace,
    features_to_knn,
    construct_knn_graph_from_index,
    create_knn_graph_and_index,
)

# Define the sizes of feature arrays for the benchmark
feature_sizes = np.logspace(1, 4.6, num=25, base=10, dtype=int)

# Define a function to benchmark the memory and runtime of a given function
def benchmark_function(func, *args, **kwargs):
    # Record the start time and memory usage
    start_time = time.time()
    tracemalloc.start()
    
    # Run the function
    result = func(*args, **kwargs)
    
    # Record the end time and memory usage
    end_time = time.time()
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    
    # Calculate runtime and peak memory usage
    runtime = end_time - start_time
    peak_memory = peak / 10**6  # Convert to MB
    
    return runtime, peak_memory, result

# Initialize a DataFrame to store the benchmark results
columns = [f_col:='Function', N_col:='Num points', runtime_col:='Runtime (s)', memory_col:='Memory (MB)']
results = pd.DataFrame(columns=columns)

# Define functions to benchmark
functions = {
    # 'features_to_knn': features_to_knn,
    'construct_knn_graph_from_index_without_correction': construct_knn_graph_from_index,
    'construct_knn_graph_from_index_with_correction': construct_knn_graph_from_index,
    # 'create_knn_graph_and_index': create_knn_graph_and_index,
    'correct_knn_distances_and_indices': correct_knn_distances_and_indices,
    'correct_with_precomputed_exact_duplicate_sets': correct_knn_distances_and_indices,
    'correct_with_precomputed_exact_duplicate_sets_inplace': correct_knn_distances_and_indices_with_exact_duplicate_sets_inplace,
}
results_list_of_dicts = []
# Run the benchmark for each function and feature size
N_max_slow = 20000
num_copies = 5
for N in tqdm(feature_sizes):

    features = np.random.rand(N // num_copies, 10)
    features = np.vstack([features] * num_copies)
    


    N = features.shape[0]

    knn_graph, knn = create_knn_graph_and_index(features, correct_exact_duplicates=False)
    distances = knn_graph.data.reshape(knn_graph.shape[0], -1)

    indices = knn_graph.indices.reshape(knn_graph.shape[0], -1)

    exact_duplicate_sets = _compute_exact_duplicate_sets(features)
    
    for func_name, func in functions.items():
        if func_name == 'construct_knn_graph_from_index_without_correction':
            if N > N_max_slow:
                continue
            runtime, peak_memory, _ = benchmark_function(func, knn)
        elif func_name == 'construct_knn_graph_from_index_with_correction':
            if N > N_max_slow:
                continue
            runtime, peak_memory, _ = benchmark_function(func, knn, correct_exact_duplicates=True)
        elif func_name == 'create_knn_graph_and_index':
            if N > N_max_slow:
                continue
            runtime, peak_memory, _ = benchmark_function(func, features)
        elif func_name == 'correct_knn_distances_and_indices':
            if N > N_max_slow:
                continue
            runtime, peak_memory, _ = benchmark_function(func, features=features, distances=distances, indices=indices)
        elif func_name == 'correct_with_precomputed_exact_duplicate_sets':
            runtime, peak_memory, _ = benchmark_function(func, features=features, distances=distances, indices=indices, exact_duplicate_sets=exact_duplicate_sets)
        elif func_name == 'correct_with_precomputed_exact_duplicate_sets_inplace':
            runtime, peak_memory, _ = benchmark_function(func, distances=distances, indices=indices, exact_duplicate_sets=exact_duplicate_sets)
        else:
            if N > 1000:
                continue
            runtime, peak_memory, _ = benchmark_function(func, features)
        
        
        # Store the results in the DataFrame
        results_list_of_dicts = results_list_of_dicts + [dict({
            f_col: func_name,
            N_col: N,
            runtime_col: runtime,
            memory_col: peak_memory,
        })]

results = pd.DataFrame(results_list_of_dicts)
# Save the results to a CSV file
results.to_csv('benchmark_results_with_dataset_copy.csv', index=False)

# Print the results
print(results)

# Plot the results
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 1, figsize=(10, 10))
for func_name in functions.keys():
    subset = results[results[f_col] == func_name]
    axes[0].plot(subset[N_col], subset[runtime_col], label=func_name)
    axes[1].plot(subset[N_col], subset[memory_col], label=func_name)

axes[0].set_xlabel(N_col)
axes[0].set_ylabel(runtime_col)
axes[0].set_title('Runtime vs. Num Points')
axes[0].legend()
axes[0].set_xscale('log')
axes[0].set_yscale('log')

axes[1].set_xlabel(N_col)
axes[1].set_ylabel(memory_col)
axes[1].set_title('Memory Usage vs. Num Points')
axes[1].legend()
axes[1].set_xscale('log')
axes[1].set_yscale('log')

plt.tight_layout()
plt.show()

… examples

…lass-imbalance)

…search

…earch object from an array of numerical features

…g the NearestNeighbors object

… neighbor.py

…ructing the NearestNeighbors object

…ghbors, add comments Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Add docstring as well

the aim of the function is to eliminate unecessary memory allocations and reduce runtime with more efficient numpy operations and manipulating duplicate indices as little as possible. Currently, the correction would end with 2 situations: 1. The number of duplicates equals or exceeds the number of neighbors: - A single ciruclant matrix is enough to make sure that each row gets only other duplicates as neighbors. Any other duplicate points can just point to the first k duplicate points as their neigbors. This will basically ALWAYS happen when the dataset consists of all-identical examples that exceed the number of neighbors (So a circulant matrix should take O(k^2) space, and the first point takes O(k) space searching through O(N) duplicate points for an all-identical dataset. 2. The number of duplicates is smaller than the number of neighbors. - The same circulant matrix can be used to fill out first few columns of the indices matrix. But before that, we must most all non-duplicate points to the far right. In practice, EVERY POINT has enough non-duplicate points as neighbors to make this work, it's just about ensuring that they are put on the far-right side of the array.

also add docstring for the helper function generating the circulant matrix representing the neighbors of the first k+1 duplicates.

elisno · 2024-05-21T00:26:03Z

@huiwengoh I've addressed all of @jwmueller's comments.

Can you give a review and merge this?

elisno · 2024-05-21T00:43:59Z

I've added some graphs that show the runtime of the knn graph construction in the library.

The main conclusion is that:

As the dataset increases in size, the knn search takes the longest time, and the runtime when correcting for exact duplicates gets amortized.
The green line shows what the additional runtime is when correcting a knn_graph for exact duplicates.
- It involves calling np.unique on a feature array, but also makes some copies of the output arrays distances and indices.
The purple line shows how long it takes to run the core-correction logic.
- For a single exact duplicate set, we have a fully optimized algorithm that runs many orders of magnitude faster than running the actual knn search.
- As the number of exact duplicate sets increases, the runtime seems to approach the runtime complexity of the knn search (at least for ~1000 data points.

cleanlab/outlier.py

cleanlab/internal/neighbor/knn_graph.py

…ct duplicate correction

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

elisno · 2024-05-21T21:27:00Z

CI failure is unrelated to this PR.

Scikit-learn 1.5.0 was released just a few hours ago and it only affects one test case in

cleanlab/tests/test_classification.py

Line 785 in ca38929

cv.fit(X=DATA["X_train"], y=DATA["labels"])

cleanlab/outlier.py

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Co-authored-by: Hui Wen <45724323+huiwengoh@users.noreply.github.com>

…ng classes (cleanlab#1073) Co-authored-by: Elías Snorrason <eliassno@gmail.com>

…leanlab#1123)

This reverts commit 20d9687.

huiwengoh

lgtm! thanks for detailed docstrings and clarifications in the comments above :)

elisno and others added 30 commits May 3, 2024 15:49

Add test for all identical examples in test_regression.py

b9b9274

Add tests for detecting label issues in all identical examples dataset

182f50c

Add tests for detecting label issues in all identical examples dataset

a0c197c

clarify test_all_identical_examples in test_regression.py

fb93c07

remove unused Datalab test fixtures in test_all_identical_examples.py

08e86a6

Improve readability and documentation of test class for all identical…

ec8743e

… examples

Rename variable for number of feature columns (K -> M)

cbd1eab

add more issue types for classification (underperforming groups and c…

e841249

…lass-imbalance)

fix typos

49b17ee

update documentation of TestAllIdenticalExamplesDataset

25c4c83

Add decide_metric function to determine distance metric for neighbor …

747331a

…search

Add NeighborSearch protocol for k-nearest neighbors search

e6f2251

Add types for FeatureArray and Metric in neighbor/types.py

f90641e

rename test file

eae62e8

Add features_to_knn function to build and fit a k-nearest neighbors s…

a40f31d

…earch object from an array of numerical features

export features_to_knn from neighbor submodule

a588774

Add docs for cleanlab.internal.neighbor modules

0604ab3

correct neighbor.py

a0d6579

let search.py only work with NearestNeighbors

638ae31

refactor outlier.py to use knn construction function

a13c1f8

Refactor duplicate.py to use features_to_knn function for constructin…

76d16d2

…g the NearestNeighbors object

Refactor duplicate.py to use the knn_to_knn_graph function defined in…

5b33870

… neighbor.py

remove unused imports in outlier.py

45c4c3f

test knn_to_knn_graph

32d80ff

ignore unused import in __init__.py

a44d367

Refactor regression.rank.py to use features_to_knn function for const…

ceb3744

…ructing the NearestNeighbors object

Fix default value for neighbor_metric in rank.py

1082702

improve clarity of code selecting number of neighbors, add comments

8d662e2

Refactor metric.py to improve clarity of code selecting number of nei…

4395874

…ghbors, add comments Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

rename knn_to_knn_graph to construct_knn_graph_from_index

e2dd0fc

Add docstring as well

elisno added 8 commits May 17, 2024 23:27

move non-duplicate points in tests

b608e95

fix in-place function

84b3e6c

also add docstring for the helper function generating the circulant matrix representing the neighbors of the first k+1 duplicates.

refactor correct_knn_distances_and_indices

e426596

add docstring for in-place function

37fbf3b

add docstring for _compute_exact_duplicate_sets

b9d47bb

add docstring to correct_knn_graph

b404401

remove unused import

b063279

elisno requested a review from huiwengoh May 21, 2024 00:25

jwmueller reviewed May 21, 2024

View reviewed changes

cleanlab/outlier.py Outdated Show resolved Hide resolved

jwmueller reviewed May 21, 2024

View reviewed changes

cleanlab/internal/neighbor/knn_graph.py Outdated Show resolved Hide resolved

jwmueller reviewed May 21, 2024

View reviewed changes

cleanlab/internal/neighbor/knn_graph.py Outdated Show resolved Hide resolved

elisno and others added 2 commits May 21, 2024 20:24

update construct_knn_graph_from_index to accept feature array for exa…

86f021b

…ct duplicate correction

Update cleanlab/outlier.py

85c8d5e

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

elisno mentioned this pull request May 21, 2024

Enhance test coverage for setting Confident Joint in CleanLearning #1123

Merged

jwmueller reviewed May 21, 2024

View reviewed changes

cleanlab/outlier.py Outdated Show resolved Hide resolved

elisno and others added 7 commits May 22, 2024 00:27

Update cleanlab/outlier.py

7648ca9

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Optimize multiannotator.py for performance (cleanlab#1077)

867f307

Co-authored-by: Hui Wen <45724323+huiwengoh@users.noreply.github.com>

Optimize value_counts function for performance improvement with missi…

316debd

…ng classes (cleanlab#1073) Co-authored-by: Elías Snorrason <eliassno@gmail.com>

Enhance test coverage for setting Confident Joint in CleanLearning (c…

94c3ea6

…leanlab#1123)

Merge branch 'master' into correct-knn-graph

27e7225

rename test_neighbor.py to test_knn_graph.py

20d9687

Revert "rename test_neighbor.py to test_knn_graph.py"

85f6990

This reverts commit 20d9687.

huiwengoh approved these changes May 23, 2024

View reviewed changes

elisno changed the title ~~Correct knn graph to better handle duplicates and numerical issues~~ Improve KNN Graph Construction for Handling Exact Duplicates and Numerical Precision May 24, 2024

elisno merged commit 25b7aab into cleanlab:master May 24, 2024
19 checks passed

elisno mentioned this pull request May 24, 2024

Remove unnecessary warnings during KNN graph correction #1129

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve KNN Graph Construction for Handling Exact Duplicates and Numerical Precision #1119

Improve KNN Graph Construction for Handling Exact Duplicates and Numerical Precision #1119

elisno commented May 6, 2024 •

edited

elisno commented May 21, 2024

elisno commented May 21, 2024

elisno commented May 21, 2024 •

edited

huiwengoh left a comment

Improve KNN Graph Construction for Handling Exact Duplicates and Numerical Precision #1119

Improve KNN Graph Construction for Handling Exact Duplicates and Numerical Precision #1119

Conversation

elisno commented May 6, 2024 • edited

Summary

New Functions

Impact

What this PR does not address:

Benchmark Results

Benchmark Code for All-Identical Dataset

Benchmark Code for Copied Dataset

elisno commented May 21, 2024

elisno commented May 21, 2024

elisno commented May 21, 2024 • edited

huiwengoh left a comment

Choose a reason for hiding this comment

elisno commented May 6, 2024 •

edited

elisno commented May 21, 2024 •

edited