-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Add writeable parameter to check_array #29018
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we are adding a new parameter, this feels like a big enough bug fix to be in 1.5.1.
sklearn/tests/test_common.py
Outdated
_set_checking_parameters(estimator) | ||
|
||
# The following estimators can work inplace only with certain settings | ||
if estimator.__class__.__name__ == "HDBSCAN": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we are in our testing code sklearn/tests/test_common.py
can we use isinstance
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the class name because it's what we use in _set_checking_parameters(estimator)
. Otherwise we have to import all all estimators for which we want to set a specific param which feels a bit cumbersome.
I modified the behavior a bit. Previously I tried to change the writeability without copy and only copy if it failed. I felt that it was not a healthy behavior for the user. If a user calls Arguably, we could raise an error in the ambiguous |
I also found a use case that is currently failing unexpectedly in main (due to #28348) and works with this PR: import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
X, y = make_regression()
X.flags.writeable = False
df = pd.DataFrame(X, copy=False) # dataframe backed by a read-only array
LinearRegression().fit(df, y)
# fails, although LinearRegression doesn't even want to do any inplace operation |
When we're happy with the behavior, I'll set the new writeable param to all estimators that want to perform inplace operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with the proposed solution. This feels like a big enough bug fix to be in 1.5.1, but it adds a new parameter, where we usually wait for 1.6.
Alright, I'm going to add the new kwarg in estimators. In the mean time, I opened issues and PRs for the estimators with an unexpected behavior regarding the copy param and update the PR description with corresponding links.
I can propose an easy quick fix for #28899 in a separate PR that we'd include in 1.5.1, and then the rest of this PR would fit in 1.6. |
closes #28824
closes #14481
Fixes #28899
This PR proposes to add a
writeable
parameter tocheck_array
. It acts as a toggle: it can be True meaning it's set or None (unset). If True,check_array
will make sure that the returned array is writeable (which may require a copy). If None, the writeability of the array is left untouched.It has the same status as the
dtype
ororder
parameters. They define desired properties of the output array. Sometimes they can only be applied by making a copy, even if copy=False.Writeable arrays are required in estimators that can perform inplace operations. These estimators have a
copy
orcopy_X
parameter and currently they raise an error ifcopy=False
and X is read-only. This behavior seems expected of rthe basic use case where the user has full control over the input array of the estimator. But in a complex pipeline, in can happen that an array is created in read-only mode (e.g. from joblib's auto memmapping) at an intermediate step which triggers an (unexpected to me) error, the last one being #28781.This PR also presents an alternative to #28348, which isn't safe because changing the writeable flag of an array is not possible if the array doesn't own its data. And it happens even if the array is aleardy writeable, just trying to set the flag is not allowed. That's what happens in #28899.
I added a common test, which for now fails as expected for most estimators because I haven't applied the
writeable
param to all inplace estimators, so it shows the current behavior. A few of them still pass:writeable
in this PRwriteable
in this PRwriteable
in this PRcopy=False
so needs investigation (Fix Make centering inplace in KernelPCA #29100)copy=False
so needs investigation(I found that the only way to modify the input data is to pass a custom splitter that returns slices instead of arrays of indices which goes against our splitter doc, see doc, so we could deprecate the copy param or leave it as is because we don't want to put effort on this estimator and don't want to break user code)
cc/ @thomasjpfan Here's a proposed implementation of what we're discussing in #28824 (comment)