You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have question about a warning message when training a LightGBM model with lgbm.train. I get the following warning:
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero
The reason is that I have a column, specified as categorical, that contains the following integers:
[1015, 1033, 1128, 1398, 1541, 1673, 1677]
In the documentation it says:
"All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero."
My values are not particularly large. The "consider using consecutive integers starting from zero" seems to be a suggestion. What happens if they do not? How does the sparseness affect the performance of LightGBM? Another categorical column of my dataset has the three values
[1, 3, 4]
and this column does not cause the same warning.
The text was updated successfully, but these errors were encountered:
mjalse
changed the title
Warning when using sparse categorical values with LightGBM
Warning when using sparse categorical values
Mar 26, 2024
Optimal Split for Categorical Features:
... LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram.
So if we use a categorical feature with many different sparse values, a large histogram would be generated and it can be memory consuming.
I have question about a warning message when training a LightGBM model with
lgbm.train
. I get the following warning:The reason is that I have a column, specified as categorical, that contains the following integers:
In the documentation it says:
My values are not particularly large. The "consider using consecutive integers starting from zero" seems to be a suggestion. What happens if they do not? How does the sparseness affect the performance of LightGBM? Another categorical column of my dataset has the three values
and this column does not cause the same warning.
The text was updated successfully, but these errors were encountered: