-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve KNN Graph Construction for Handling Exact Duplicates and Numerical Precision #1119
Conversation
…earch object from an array of numerical features
…g the NearestNeighbors object
…ructing the NearestNeighbors object
…ghbors, add comments Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Add docstring as well
the aim of the function is to eliminate unecessary memory allocations and reduce runtime with more efficient numpy operations and manipulating duplicate indices as little as possible. Currently, the correction would end with 2 situations: 1. The number of duplicates equals or exceeds the number of neighbors: - A single ciruclant matrix is enough to make sure that each row gets only other duplicates as neighbors. Any other duplicate points can just point to the first k duplicate points as their neigbors. This will basically ALWAYS happen when the dataset consists of all-identical examples that exceed the number of neighbors (So a circulant matrix should take O(k^2) space, and the first point takes O(k) space searching through O(N) duplicate points for an all-identical dataset. 2. The number of duplicates is smaller than the number of neighbors. - The same circulant matrix can be used to fill out first few columns of the indices matrix. But before that, we must most all non-duplicate points to the far right. In practice, EVERY POINT has enough non-duplicate points as neighbors to make this work, it's just about ensuring that they are put on the far-right side of the array.
also add docstring for the helper function generating the circulant matrix representing the neighbors of the first k+1 duplicates.
@huiwengoh I've addressed all of @jwmueller's comments. Can you give a review and merge this? |
I've added some graphs that show the runtime of the knn graph construction in the library. The main conclusion is that:
|
…ct duplicate correction
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
CI failure is unrelated to this PR. Scikit-learn 1.5.0 was released just a few hours ago and it only affects one test case in cleanlab/tests/test_classification.py Line 785 in ca38929
|
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Hui Wen <45724323+huiwengoh@users.noreply.github.com>
…ng classes (cleanlab#1073) Co-authored-by: Elías Snorrason <eliassno@gmail.com>
This reverts commit 20d9687.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! thanks for detailed docstrings and clarifications in the comments above :)
Summary
This PR introduces a new argument,
correct_exact_duplicates: bool = True
, to the functionconstruct_knn_graph_from_features
and others. This update ensures that exact duplicates in the feature array are handled correctly during the k-nearest neighbors (KNN) graph construction.New Functions
correct_knn_graph
:Wrapper around
correct_knn_distances_and_indices
to correct KNN graph based on the feature array.correct_knn_distances_and_indices_with_exact_duplicate_sets_inplace
:Main logic for in-place correction of distances and indices arrays.
correct_knn_distances_and_indices
:Corrects KNN distances and indices with optional exact duplicate sets and warning.
Impact
knn: NearestNeighbors
object is passed explicitly.cleanlab/outliers.py
, where correction is applied manually if noknn
object is provided.noniid
check in Datalab is updated to construct a KNN graph without relying onNearestNeighbors
from sklearn.What this PR does not address:
Benchmark Results
Two benchmark scenarios were tested:
All-Identical Dataset:
Copied Dataset:
The graphs below compare runtime and memory usage for different functions:
For the All-Identical Dataset, the correction function spends its time constructing a small circulant matrix to find the nearest neighbors of the first k+1 elements. All other points just refer to the first k points. The purple line shows how the correction algorithm performs if all the duplicate information is pre-computed and the output can be modified in-place. No knn-graph construction occurs in that function.
For the Copied Dataset, there are far more sets to iterate over, which impacts the performance of the correction function.
This shouldn't really exceed the runtime of the exhaustive search algorithm by too much.
The benchmark code and additional results are provided in the expandable sections.
Benchmark Code for All-Identical Dataset
Code for All-Identical Dataset
Benchmark Code for Copied Dataset
Code for Copied Dataset