FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

StefanieSenger · 2024-04-28T06:45:33Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This 2016 PR intended to add info_gain and info_gain_ratio functions for univariate feature selection. Here, I update and finish it up. For further information, please refer to the discussion on the old PR.

…ions setup

…kar/scikit-learn into ig-and-igr-feature-selection

…nual values; moved IGR tests

…ctions

OmarManzoor

Thanks for the PR @StefanieSenger . Would it make sense to add a test which compares the transformed X between the information gain and information gain ratio, since they should be generally the same?

StefanieSenger · 2024-05-10T11:55:06Z

I have added such a test @OmarManzoor, maybe it helps if one day someone works on the if ratio block in _info_gain(), which is infact the only few lines that differ between both functions.

OmarManzoor

A few minor suggestions otherwise this looks good. Thanks @StefanieSenger

sklearn/feature_selection/_univariate_selection.py

Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

StefanieSenger · 2024-05-13T07:35:47Z

Nice, thank you @OmarManzoor

glemaitre

Just a couple of first comments to use scipy instead of our own implementation of the entropy or the KL divergence.

doc/modules/feature_selection.rst

glemaitre · 2024-05-21T15:31:06Z

doc/whats_new/v1.6.rst

+
+- |Feature| :func:`~feature_selection.info_gain` and
+  :func:`~feature_selection.info_gain_ratio` can now be used for
+  univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`.


Suggested change

univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`.

univariate feature selection.

:pr:`28905` by :user:`Viktor Pekar <vpekar>` and

:user:`Stefanie Senger <StefanieSenger>`.

glemaitre · 2024-05-21T15:34:01Z

sklearn/feature_selection/_univariate_selection.py

+def _get_entropy(prob):
+    t = np.log2(prob)
+    t[~np.isfinite(t)] = 0
+    return np.multiply(-prob, t)


Nowadays, I think this is implemented in scipy.stats.entropy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html

The base here is set to 2 (I have to check if it makes sense or not).

I have substituted this function with one of the scipy entropy ones (scipi.special.entr()), though I need to admid I don't understand that and have just chosen the one that would not raise when running the tests.

glemaitre · 2024-05-21T18:00:12Z

sklearn/feature_selection/_univariate_selection.py

+    def _a_log_a_div_b(a, b):
+        with np.errstate(invalid="ignore", divide="ignore"):
+            t = np.log2(a / b)
+        t[~np.isfinite(t)] = 0
+        return np.multiply(a, t)


supposidely this could be replaced by the rel_entr from scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.rel_entr.html#scipy.special.rel_entr

The difference is that we use log2 instead of the log base e in the scipy definition. I have to check.

So I assume that we could use the natural logarithm anywhere because it would only be different by a constant multiplier and since we are only comparing the information everywhere then it should not matter

Okay, we had talked about this. This time I found that scipy.special.rel_entr() was the one that had the same results as before.

glemaitre · 2024-05-21T18:21:41Z

sklearn/feature_selection/_univariate_selection.py

+    c_prob = c_count / c_count.sum()
+    fc_prob = fc_count / total
+
+    c_f = _a_log_a_div_b(fc_prob, c_prob * f_prob)


To give an example regarding the base, here it would be equivalent to:

c_f = rel_entr(fc_prob, c_prob * f_prob) / np.log(2)

Yes, that worked.

glemaitre · 2024-05-21T18:29:31Z

examples/feature_selection/plot_compare_feature_selection.py

@@ -0,0 +1,115 @@
+"""


We will probably avoid to have a new example and instead we should edit an existing one.

glemaitre · 2024-05-21T18:30:13Z

sklearn/feature_selection/_univariate_selection.py

+    """Count feature, class, joint and total frequencies
+
+    Returns
+    -------
+    f_count : array, shape = (n_features,)
+    c_count : array, shape = (n_classes,)
+    fc_count : array, shape = (n_features, n_classes)
+    total: int
+    """


we will need a proper docstring with our new standards

Even for private functions? I wonder, because many other private functions don't have anything similar to numpy docstring style, which I think is what you are referring to(?).

I tried to do some improvements. I there some test I can run to find out it it is enough? The CI didn't raise because of that.

glemaitre · 2024-05-21T18:31:11Z

sklearn/feature_selection/_univariate_selection.py

+    return np.asarray(scores).reshape(-1)
+
+
+def _get_fc_counts(X, y):


Since this is call a single time, we should not need to have a function.

I actually like it, because though having this function name it gives meaning to this part of the code and structures _info_gain(). Maybe we should even rename to avoid the unclear fc in it's naming. I will make a suggestions with my push.
Can you also imagine to keep this function?

glemaitre · 2024-05-21T18:32:05Z

sklearn/feature_selection/_univariate_selection.py

+        with np.errstate(invalid="ignore", divide="ignore"):
+            scores = scores / (_get_entropy(c_prob) + _get_entropy(1 - c_prob))
+
+    # the feature score is averaged over classes


I think the comment only apply to the first case

True, I will delete it entirely. I think it's not really necessary to have it at all.

glemaitre · 2024-05-21T18:42:48Z

sklearn/feature_selection/_univariate_selection.py

+    c_nf = _a_log_a_div_b((c_count - fc_count) / total, c_prob * (1 - f_prob))
+    nc_f = _a_log_a_div_b((f_count - fc_count) / total, (1 - c_prob) * f_prob)
+
+    scores = c_f + nc_nf + c_nf + nc_f


I think that I would prefer _info_gain to return this score and

have the ratio below done in the info_gain_ratio and finally have a function to that could be called twice to just make the reduction.

def _info_gain(X, y): # probably the name of the function should be better. ... return scores, c_prob def info_gain(X, y, aggregate=np.max): return aggregate.reduce(_info_gain(X, y)[0], axis=0) def info_gain_ratio(X, y, aggregate=np.max): scores, c_prob = _info_gain(X, y) with np.errstate(invalid="ignore", divide="ignore"): scores /= (entropy(c_prob) + entropy(1 - c_prob)) return aggregate.reduce(scores, axis=0)

StefanieSenger

Thank you, @glemaitre, for your review and your explanations in the call. I have tried to address what we talked about. I will push the recent changes and try to continue understanding the rest.

StefanieSenger · 2024-06-03T09:15:49Z

sklearn/feature_selection/_univariate_selection.py

+        with np.errstate(invalid="ignore", divide="ignore"):
+            scores = scores / (_get_entropy(c_prob) + _get_entropy(1 - c_prob))
+
+    # the feature score is averaged over classes


True, I will delete it entirely. I think it's not really necessary to have it at all.

StefanieSenger · 2024-06-03T09:22:20Z

sklearn/feature_selection/_univariate_selection.py

+    return np.asarray(scores).reshape(-1)
+
+
+def _get_fc_counts(X, y):


I actually like it, because though having this function name it gives meaning to this part of the code and structures _info_gain(). Maybe we should even rename to avoid the unclear fc in it's naming. I will make a suggestions with my push.
Can you also imagine to keep this function?

StefanieSenger · 2024-06-03T09:36:19Z

sklearn/feature_selection/_univariate_selection.py

+    """Count feature, class, joint and total frequencies
+
+    Returns
+    -------
+    f_count : array, shape = (n_features,)
+    c_count : array, shape = (n_classes,)
+    fc_count : array, shape = (n_features, n_classes)
+    total: int
+    """


Even for private functions? I wonder, because many other private functions don't have anything similar to numpy docstring style, which I think is what you are referring to(?).

I tried to do some improvements. I there some test I can run to find out it it is enough? The CI didn't raise because of that.

StefanieSenger · 2024-06-03T09:58:52Z

sklearn/feature_selection/_univariate_selection.py

+def _get_entropy(prob):
+    t = np.log2(prob)
+    t[~np.isfinite(t)] = 0
+    return np.multiply(-prob, t)


I have substituted this function with one of the scipy entropy ones (scipi.special.entr()), though I need to admid I don't understand that and have just chosen the one that would not raise when running the tests.

StefanieSenger · 2024-06-03T21:40:20Z

sklearn/feature_selection/_univariate_selection.py

+    def _a_log_a_div_b(a, b):
+        with np.errstate(invalid="ignore", divide="ignore"):
+            t = np.log2(a / b)
+        t[~np.isfinite(t)] = 0
+        return np.multiply(a, t)


Okay, we had talked about this. This time I found that scipy.special.rel_entr() was the one that had the same results as before.

StefanieSenger · 2024-06-03T21:40:46Z

sklearn/feature_selection/_univariate_selection.py

+    c_prob = c_count / c_count.sum()
+    fc_prob = fc_count / total
+
+    c_f = _a_log_a_div_b(fc_prob, c_prob * f_prob)


Yes, that worked.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

vpekar added 30 commits March 13, 2016 22:02

Added IG and IGR feature selection functions

12300fd

Fixed a broken test

2ce8c92

Merge branch 'master' into ig-and-igr-feature-selection

a0ca2f9

Added an extra return var to conform to other feature selection funct…

1ba5b75

…ions setup

Removed the pvals return param from mi function

e576fe0

Dealing with functions that don't return pvals

97744fe

Removed unused import

2cda7af

Renamed vars, using __future__.division

d7701f2

Moved __future__.division

56eb381

Fixed import error

b4f02f8

Merge branch 'master' into ig-and-igr-feature-selection

39053da

Fixing flake8 errors

201abc4

Merge branch 'ig-and-igr-feature-selection' of https://github.com/vpe…

d453dae

…kar/scikit-learn into ig-and-igr-feature-selection

Added support for dense arrays for ig and igr, added formulas

a7b663f

Removed unused import

4a6a849

Removed unused import

1eb379a

Corrected IGR formula

f4f0517

Updated docstrings

6d55cea

Added info_gain and info_gain_ratio examples

6ad6f7d

Fixed PyFlakes errors

ef48e09

Code refactoring, using safe_sparse_dot on all matrix types

1deb585

Reverted feature_selection.rst

3684364

Using max as the default globalization strategy

fc01086

Updated docstrings and rst documentation

a966d1e

Merge branch 'master' into ig-and-igr-feature-selection

738afc2

Docstrings: links only on titles

8c2a41c

Refactored to calculate IGR inside _info_gain; added tests against ma…

676bbdc

…nual values; moved IGR tests

Removed IGR tests

30ff737

Added an example comparing different univariate feature selection fun…

1b76234

…ctions

Removed IG and IGR from two examples

b21c655

StefanieSenger mentioned this pull request Apr 28, 2024

[MRG] Add Information Gain and Information Gain Ratio feature selection functions #6534

Closed

StefanieSenger marked this pull request as draft April 29, 2024 08:40

error corrected docstrings

973caab

StefanieSenger marked this pull request as ready for review April 29, 2024 09:23

StefanieSenger and others added 2 commits May 3, 2024 12:23

added testing for aggretate={'mean', 'sum'}

2b9b6bd

Merge branch 'main' into information_gain

e43c2c5

OmarManzoor reviewed May 6, 2024

View reviewed changes

add test for equally distributed classes

506855c

StefanieSenger and others added 3 commits May 10, 2024 14:00

unfunctional code removed

8f97e01

Merge branch 'main' into information_gain

4bcbf46

update changelog

b6d0481

OmarManzoor approved these changes May 13, 2024

View reviewed changes

sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved

sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved

sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved

Apply suggestions from code review

6bc738c

Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

StefanieSenger and others added 3 commits May 17, 2024 13:50

Merge branch 'main' into information_gain

4d4b368

resolve merge conflict

3718599

delete classes.rst again

b7d25ac

glemaitre self-requested a review May 21, 2024 15:28

glemaitre reviewed May 21, 2024

View reviewed changes

Merge branch 'main' into information_gain

ffe4a4c

glemaitre self-requested a review June 3, 2024 20:59

StefanieSenger commented Jun 3, 2024

View reviewed changes

StefanieSenger and others added 2 commits June 3, 2024 23:44

changes after review

8393cee

Apply suggestions from code review

90cab0e

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

StefanieSenger commented Apr 28, 2024

OmarManzoor left a comment

StefanieSenger commented May 10, 2024

OmarManzoor left a comment •

edited

StefanieSenger commented May 13, 2024

glemaitre left a comment

glemaitre May 21, 2024

glemaitre May 21, 2024

StefanieSenger Jun 3, 2024

glemaitre May 21, 2024

glemaitre May 21, 2024

StefanieSenger Jun 3, 2024

glemaitre May 21, 2024

StefanieSenger Jun 3, 2024

glemaitre May 21, 2024

glemaitre May 21, 2024

StefanieSenger Jun 3, 2024

glemaitre May 21, 2024

StefanieSenger Jun 3, 2024

glemaitre May 21, 2024

StefanieSenger Jun 3, 2024

glemaitre May 21, 2024

StefanieSenger left a comment

StefanieSenger Jun 3, 2024

StefanieSenger Jun 3, 2024

StefanieSenger Jun 3, 2024

StefanieSenger Jun 3, 2024

StefanieSenger Jun 3, 2024

StefanieSenger Jun 3, 2024

		return np.asarray(scores).reshape(-1)


		def _get_fc_counts(X, y):

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

Are you sure you want to change the base?

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

Conversation

StefanieSenger commented Apr 28, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

OmarManzoor left a comment

Choose a reason for hiding this comment

StefanieSenger commented May 10, 2024

OmarManzoor left a comment • edited

Choose a reason for hiding this comment

StefanieSenger commented May 13, 2024

glemaitre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OmarManzoor left a comment •

edited