Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] remove 'categorical_feature' and 'feature_name' parameters in cv() and train() #6435

Open
Tracked by #6439
jameslamb opened this issue Apr 30, 2024 · 6 comments

Comments

@jameslamb
Copy link
Collaborator

jameslamb commented Apr 30, 2024

Proposal

I'm requesting comment on the following proposal:

  • remove keyword argument categorical_feature from cv() and train() in the R and Python packages
  • remove keyword argument feature_name from cv() and train() in the Python package
  • remove keyword argument colnames from cv() and train() in the R package

And doing all of these only after the packages issuing deprecation warnings for 2-3 releases.

Summary

Both the R and Python packages expose functions cv() (for cross-validation) and train() (for regular entire-dataset training). These functions require a LightGBM Dataset object.

The Dataset object holds attributes categorical_features and feature_names, and allows setting those via constructor keyword arguments and set_{attr}() methods.

Despite that, these cv() and train() functions also take categorical_features and feature_names as keyword arguments.

Python cv()

feature_name: _LGBM_FeatureNameConfiguration = "auto",
categorical_feature: _LGBM_CategoricalFeatureConfiguration = "auto",

Python train()

feature_name: _LGBM_FeatureNameConfiguration = "auto",
categorical_feature: _LGBM_CategoricalFeatureConfiguration = "auto",

R-package cv()

, colnames = NULL
, categorical_feature = NULL

R-package train()

colnames = NULL,
categorical_feature = NULL,

These keyword arguments aren't providing any value, in my opinion. Their values are just forwarded along to calls like this:

train_set._update_params(params)._set_predictor(predictor).set_feature_name(feature_name).set_categorical_feature(
categorical_feature
)

Which at best is redundant with the Dataset class, and at worst could lead to runtime exceptions (if the Dataset has already been constructed).

Motivation

Would simplify the library's interface without any loss of functionality.

If this proposal is accepted, the Dataset class would be the only place that this information is provided to train() and cv().

References

Inspired by this post I noticed on Stack Overflow: https://stackoverflow.com/questions/78383840/in-lightgbm-why-do-the-train-and-the-cv-apis-accept-categorical-feature-argument/78405996#78405996

xgboost does not expose such arguments in train() (code link) or cv() (code link).

These arguments have been part of the API since September 2017: ef77806#diff-9bd633ead0bdfe9540c42a618efd9e559cca16c522ad844a09fcf4ffc7d6e84c.

@borchero
Copy link
Collaborator

borchero commented May 1, 2024

I was not aware that there parameters were even available, I'm very much in favor of the removal across language APIs.

@mayer79
Copy link
Contributor

mayer79 commented May 4, 2024

Very good idea. I have in mind that it is also in the fit() method of the Scikit-learn API. Do we need it there because that is the first time LightGBM sees the data?

@jameslamb
Copy link
Collaborator Author

Thanks for pointing that out @mayer79

Yes, the scikit-learn interface needs these inputs because it takes in the raw data (e.g. numpy, scipy, pandas), not a lightgbm.Dataset. And that it by design, to match the types of inputs estimators in the scikit-learn library accept.

@jameslamb
Copy link
Collaborator Author

While working through this tonight to add deprecation warnings, I found a few more things in lgb.cv() in the R package that I think should be deprecated.

  • passing anything other than a Dataset to data

# If 'data' is not an lgb.Dataset, try to construct one using 'label'
if (!.is_Dataset(x = data)) {
if (is.null(label)) {
stop("'label' must be provided for lgb.cv if 'data' is not an 'lgb.Dataset'")
}
data <- lgb.Dataset(data = data, label = label)
}

  • passing label and weight

, label = NULL
, weight = NULL

{xgboost} does not support passing such things in cv() in its R or Python packages.

LightGBM's python package also does not support those things.

I think restricting cv() to only take in a Dataset will help simplify that interface.

cc @mayer79

@mayer79
Copy link
Contributor

mayer79 commented May 6, 2024

I like. Maybe worth noting that the one and only @david-cortes very recently removed non-dataset input in xgb.cv().

@jameslamb
Copy link
Collaborator Author

Oh nice, thanks @mayer79 ! I hadn't seen dmlc/xgboost#10031.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants