Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

Open
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

StefanieSenger
Copy link
Contributor

Reference Issues/PRs

closes #6534

What does this implement/fix? Explain your changes.

This 2016 PR intended to add info_gain and info_gain_ratio functions for univariate feature selection. Here, I update and finish it up. For further information, please refer to the discussion on the old PR.

@StefanieSenger StefanieSenger marked this pull request as ready for review April 29, 2024 09:23
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @StefanieSenger . Would it make sense to add a test which compares the transformed X between the information gain and information gain ratio, since they should be generally the same?

@StefanieSenger
Copy link
Contributor Author

I have added such a test @OmarManzoor, maybe it helps if one day someone works on the if ratio block in _info_gain(), which is infact the only few lines that differ between both functions.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions otherwise this looks good. Thanks @StefanieSenger

sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved
sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved
sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
@StefanieSenger
Copy link
Contributor Author

Nice, thank you @OmarManzoor

@glemaitre glemaitre self-requested a review May 21, 2024 15:28
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of first comments to use scipy instead of our own implementation of the entropy or the KL divergence.

doc/modules/feature_selection.rst Outdated Show resolved Hide resolved

- |Feature| :func:`~feature_selection.info_gain` and
:func:`~feature_selection.info_gain_ratio` can now be used for
univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`.
univariate feature selection.
:pr:`28905` by :user:`Viktor Pekar <vpekar>` and
:user:`Stefanie Senger <StefanieSenger>`.

Comment on lines 293 to 296
def _get_entropy(prob):
t = np.log2(prob)
t[~np.isfinite(t)] = 0
return np.multiply(-prob, t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowadays, I think this is implemented in scipy.stats.entropy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html

The base here is set to 2 (I have to check if it makes sense or not).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have substituted this function with one of the scipy entropy ones (scipi.special.entr()), though I need to admid I don't understand that and have just chosen the one that would not raise when running the tests.

Comment on lines 301 to 305
def _a_log_a_div_b(a, b):
with np.errstate(invalid="ignore", divide="ignore"):
t = np.log2(a / b)
t[~np.isfinite(t)] = 0
return np.multiply(a, t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supposidely this could be replaced by the rel_entr from scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.rel_entr.html#scipy.special.rel_entr

The difference is that we use log2 instead of the log base e in the scipy definition. I have to check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I assume that we could use the natural logarithm anywhere because it would only be different by a constant multiplier and since we are only comparing the information everywhere then it should not matter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, we had talked about this. This time I found that scipy.special.rel_entr() was the one that had the same results as before.

c_prob = c_count / c_count.sum()
fc_prob = fc_count / total

c_f = _a_log_a_div_b(fc_prob, c_prob * f_prob)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give an example regarding the base, here it would be equivalent to:

c_f = rel_entr(fc_prob, c_prob * f_prob) / np.log(2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that worked.

@@ -0,0 +1,115 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably avoid to have a new example and instead we should edit an existing one.

Comment on lines 337 to 345
"""Count feature, class, joint and total frequencies

Returns
-------
f_count : array, shape = (n_features,)
c_count : array, shape = (n_classes,)
fc_count : array, shape = (n_features, n_classes)
total: int
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need a proper docstring with our new standards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for private functions? I wonder, because many other private functions don't have anything similar to numpy docstring style, which I think is what you are referring to(?).

I tried to do some improvements. I there some test I can run to find out it it is enough? The CI didn't raise because of that.

return np.asarray(scores).reshape(-1)


def _get_fc_counts(X, y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is call a single time, we should not need to have a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like it, because though having this function name it gives meaning to this part of the code and structures _info_gain(). Maybe we should even rename to avoid the unclear fc in it's naming. I will make a suggestions with my push.
Can you also imagine to keep this function?

with np.errstate(invalid="ignore", divide="ignore"):
scores = scores / (_get_entropy(c_prob) + _get_entropy(1 - c_prob))

# the feature score is averaged over classes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment only apply to the first case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I will delete it entirely. I think it's not really necessary to have it at all.

c_nf = _a_log_a_div_b((c_count - fc_count) / total, c_prob * (1 - f_prob))
nc_f = _a_log_a_div_b((f_count - fc_count) / total, (1 - c_prob) * f_prob)

scores = c_f + nc_nf + c_nf + nc_f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I would prefer _info_gain to return this score and

have the ratio below done in the info_gain_ratio and finally have a function to that could be called twice to just make the reduction.

def _info_gain(X, y):
    # probably the name of the function should be better.
    ...
    return scores, c_prob

def info_gain(X, y, aggregate=np.max):
    return aggregate.reduce(_info_gain(X, y)[0], axis=0)

def info_gain_ratio(X, y, aggregate=np.max):
    scores, c_prob = _info_gain(X, y)
    with np.errstate(invalid="ignore", divide="ignore"):
        scores /= (entropy(c_prob) + entropy(1 - c_prob))
    return aggregate.reduce(scores, axis=0)

@glemaitre glemaitre self-requested a review June 3, 2024 20:59
Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @glemaitre, for your review and your explanations in the call. I have tried to address what we talked about. I will push the recent changes and try to continue understanding the rest.

with np.errstate(invalid="ignore", divide="ignore"):
scores = scores / (_get_entropy(c_prob) + _get_entropy(1 - c_prob))

# the feature score is averaged over classes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I will delete it entirely. I think it's not really necessary to have it at all.

return np.asarray(scores).reshape(-1)


def _get_fc_counts(X, y):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like it, because though having this function name it gives meaning to this part of the code and structures _info_gain(). Maybe we should even rename to avoid the unclear fc in it's naming. I will make a suggestions with my push.
Can you also imagine to keep this function?

Comment on lines 337 to 345
"""Count feature, class, joint and total frequencies

Returns
-------
f_count : array, shape = (n_features,)
c_count : array, shape = (n_classes,)
fc_count : array, shape = (n_features, n_classes)
total: int
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for private functions? I wonder, because many other private functions don't have anything similar to numpy docstring style, which I think is what you are referring to(?).

I tried to do some improvements. I there some test I can run to find out it it is enough? The CI didn't raise because of that.

Comment on lines 293 to 296
def _get_entropy(prob):
t = np.log2(prob)
t[~np.isfinite(t)] = 0
return np.multiply(-prob, t)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have substituted this function with one of the scipy entropy ones (scipi.special.entr()), though I need to admid I don't understand that and have just chosen the one that would not raise when running the tests.

Comment on lines 301 to 305
def _a_log_a_div_b(a, b):
with np.errstate(invalid="ignore", divide="ignore"):
t = np.log2(a / b)
t[~np.isfinite(t)] = 0
return np.multiply(a, t)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, we had talked about this. This time I found that scipy.special.rel_entr() was the one that had the same results as before.

c_prob = c_count / c_count.sum()
fc_prob = fc_count / total

c_f = _a_log_a_div_b(fc_prob, c_prob * f_prob)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that worked.

StefanieSenger and others added 2 commits June 3, 2024 23:44
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants