Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlxtend fpgrowth and association rules with the existence of missing values #1004

Open
zazass8 opened this issue Dec 18, 2022 · 4 comments
Open

Comments

@zazass8
Copy link

zazass8 commented Dec 18, 2022

Describe the workflow you want to enable

I am trying to implement the fpgrowth algorithm on a dataset that I am working on. I do know for a fact that the input fpgrowth only accepts is in binary format (e.g. True, False or 1's,0's). My dataset is in binary format but I have some missing values within it. I can't really drop those missing values because that will result in dropping useful binary datapoints from the rest.

Describe your proposed solution

What I thought of doing, is to compute support and confidence metrics based on the frequency of items that exist in that binary map AND by also ignoring the missing values. I believe that would be doable and I did find how it's done from this paper (https://weber.itn.liu.se/~aidvi/courses/06/dm/papers/AssociationRules/RAR(98).pdf). The formulas for support and confidence are quite different, mainly they subtract the count of the missing values from the denominator of the support/confidence. They also include a new metric known as 'representativity' and it is an additional criterium that helps in determining better which association rules are the strongest ones.

Describe alternatives you've considered, if relevant

I tried to make a few edits on the open source code you have on github, but I got lost while reading it, it's a very abstract code to be honest. Because, you have been using a lot of recursions as well.

If it is possible you could help me to make these edits I will very much appreciate it.

Additional context

@zuari1993
Copy link
Contributor

Hi,
It would be great if you can explain the exact formula that is mentioned in the paper and the exact changes to be made.
I would be more than happy to take this up and make the changes.

I just don't have the time to read through the paper and identify the needed changes. Thanks.

@zazass8
Copy link
Author

zazass8 commented Dec 20, 2022

Okay, so the old rule of support was given as the proportion of the existence of an item out of all transactions. Now, it is given as
image where |B_x| is the count of item X, |B| is the count of transactions of the whole database and |Dis(X)| is the count of null values in the column of item X.

For confidence, given a rule as X->Y, the old definition was given as the proportion of the count of the existence of X and Y occurring together over the count of the existence of X. Now, the new definition is given as
image
where |B_xy| is the the count of the existence of X and Y occurring together, |B_x| is the count of the existence of X and |Dis(Y) interesection B_x| is the count of null values in the column of item Y AND the value in item X exists.

What I have done so far:

  1. fpcommon.py :

In the valid_input_check(df) method I have commented out from the line that begins to define 'all_bools' until its end. That was done so that the API can also accept df's with NaN values as well. We are commenting out the code that is checking this.

In the setup_fptree(df, min_support) method, I added this code right above the line item_support = item_support.reshape(-1)
image
This, basically creates a copy of the original array and replaces the null values with a 1 and the rest with a NaN. To make it easier to compute the support for each itam. BUT, this is for each item. Keep in mind, that I do return the disabled array in the last line of this function.

In the generate_itemsets(generator, disabled, num_itemsets, colname_map) method, I use the disabled array from before as an additional argument and I made some modifications in the for loop as well.
image
This method, generates the sets of items that exceed a minimum support threshold AFTER the fp_tree is constructed. My code, says that if the set only consists of a single item, to compute the support as computed in the setup_fptree method. Elsewise, if the set has more than one items, to keep a count of the null values that appear at least once from all the items in that transaction. Then, we deduct that count from the denominator to obtain the final support.

  1. fpgrowth.py :

Here, I just pass the disabled df as an argument to the setup_fptree and generate_itemsets methods like this
image

  1. association_rules.py

In this, file I haven't really done much because I wasn't sure if it was correct enough. All I did, I edited the definitions of confidence and lift like this
image
Basically, I included as arguments the disabled counts of C and A and subtract it from the denominators. For lift, as you know the formula is similar as to the one of confidence, now they divide it by |B_y|. For the new definition, I thought of transforming it as |Dis(X) interesection B_y| which is equivalent for |B_x| on the confidence definition.

But, I wasn't really sure on how to do it.

The very last thing I did, was to add a dis variable which I haven't really defined somewhere, within the final for loops. It's basically the definition of the disabled counts for each case. But still, wasn't sure on how to define it. Then, I append it to the rule_supports lists accordingly like this
image and
image
at the very end.

All the above, was code that I haven't really tested yet which means it may be wrong at some parts. It's really confusing how to get through this, by also including the existence of missing values.

But the task sounds simple: apply fpgrowth by also excluding the count of missing values.

It gets very complicated when itemsets with multiple items are generated. And we also have to apply it in all cases, because the algorithm is supposed to filter out the ones with support value lower than the threshold given. The same, when we try to generate the rules.

But, anyway I believe we can get this through. Let me know if you have any further questions and I would really appreciate your time and effort for helping me with this.

@rasbt
Copy link
Owner

rasbt commented Dec 20, 2022

I can see that dealing with itemsets that have missing values may be a common problem, and I am open to modifications. I currently don't have time to look at the paper in detail, but based on your description, it sounds like the general fpgrowth algorithm remains the same, and the change is primarily in how the existing metrics are computed, plus the new "representativity" metric?

Before getting to that, I think a key consideration is also how to represent missing values. Given that the current version works with Bool data types, we probably have to change certain things. I don't think that pandas currently support NaN values for Bool data, so maybe we would have to change it to an integer representation, which could have performance implications.

I think the first step here is to think about what the input datatype and interface would look like. Tricky 🤔

@zazass8
Copy link
Author

zazass8 commented Dec 20, 2022

Yes, exactly the algorithm remains the same except from the modification of the metrics that are computed. And yes, it includes the rule of representativity as well.

As I have mentioned above, I commented out some lines on the valid_input_check(df) function where it checks if the data in the df are boolean or binary type. I believe that by commenting out that, it should be fine? But still, as the author of the code you know better for this and maybe your suggestion is better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants