Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH avoid checking columns where training data is all nan in KNNImputer #29060

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

xuefeng-xu
Copy link
Contributor

@xuefeng-xu xuefeng-xu commented May 21, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

In KNNImputer, columns where training data is all nan will be removed or impute with 0.
Therefore, we only need to check data with valid columns using valid_mask.
This can avoid computing pairwise distance when data with valid columns has no missing values.

Any other comments?

This could potentially save some memory.

Copy link

github-actions bot commented May 21, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4f6bed6. Link to the linter CI: here

@xuefeng-xu
Copy link
Contributor Author

This code script shows efficiency improvement if one column of data is all nan.

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
import pandas as pd

calhousing = fetch_california_housing()

X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')

rng = np.random.RandomState(42)
density = 10
mask = rng.randint(density, size=X.shape) == 0

X_na = X.copy()
X_na.values[mask] = np.nan

X_na.values[:, 0] = np.nan # one column is all nan

This PR

%timeit KNNImputer().fit_transform(X_na)
6.72 s ± 59.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Main

%timeit KNNImputer().fit_transform(X_na)
9.24 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant