Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with class imbalance in blocked cross-validation #262

Open
AndrewAnnex opened this issue May 25, 2020 · 5 comments
Open

Deal with class imbalance in blocked cross-validation #262

AndrewAnnex opened this issue May 25, 2020 · 5 comments
Labels
enhancement Idea or request for a new feature question Further information is requested

Comments

@AndrewAnnex
Copy link

Description of the desired feature

I am using gempy to produce geologic models of multiple geologic layers simultaneously. In verde it seems that points are only ever considered part of 1 surface and 1 class to predict, but in gempy I of course have multiple layers. Additionally there needs to be a way to make sure that every class is present in the training dataset, otherwise the model will not be able to predict for that class. That functionality is already present in sklearn stratified k fold, but of course the block portion is not there.

example image of issue:
image

the red dots are the test data and the blue are the training data in the map view on the left, on the right the test data is orange, there are 22 classes but it is clear that around class 14/15 the full sample of that class is only present in the test dataset

Are you willing to help implement and maintain this feature? Yes/No

yes and no.. I can dig into the code to see how difficult this is but I think I would need a deep
understanding of the paper referenced, and that changes to make this happen would diverge from that implementation sufficiently to require a new function entirely. I have my own ideas for how to make this work also that I could try out and contribute back but they won't be peer reviewed

@welcome
Copy link

welcome bot commented May 25, 2020

👋 Thanks for opening your first issue here! Please make sure you filled out the template with as much detail as possible.

You might also want to take a look at our Contributing Guide and Code of Conduct.

@AndrewAnnex
Copy link
Author

note as a double check I made the orange bars in the histogram transparent and fixed the bins to be consistent and it is indeed still an issue.

@leouieda leouieda added the enhancement Idea or request for a new feature label Jun 4, 2020
@leouieda
Copy link
Member

leouieda commented Jun 4, 2020

@AndrewAnnex thanks for posting this 👍 Let me see if I understand your problem. What are you trying to predict exactly? Is it the spatial distribution of 2 or more classes/categories? If so, then #261 and #268 will be of interest to you. We can't do it properly just yet since Verde only does regression type models. But #268 would solve this.

The fold imbalance part is another issue. I'm not entirely sure how StratifiedKFold works. It might be a bit tricky to make it work in a blocked version but if you have any ideas I'm more than happy to take a look. It's fine if it's not published. If you come up with something interesting and want to publish it you would have the bonus of already having peer reviewed code to go with it 🙂

@jessepisel this seems like it's something you might be interested in (or know how to proceed).

@leouieda
Copy link
Member

leouieda commented Jun 4, 2020

Also, checkout #254 which adds the BlockKFold class. I imagine a BlockStratifiedKFold would look similar in many aspects.

@leouieda leouieda changed the title add group/class balencing to block shuffle split, dealing with class imbalance Deal with class imbalance in blocked cross-validation Jun 4, 2020
@AndrewAnnex
Copy link
Author

AndrewAnnex commented Jun 4, 2020

@leouieda I am trying to predict the elevation of a given surface layer, essentially given an x,y,z position what stratigraphic surface is present at that position. This is broadly similar to producing a 3d spline interpolation of a surface for a single layer, I have multiple layers so I use the GemPy project currently because it is the one of the few available open source geomodeling packages available. looking at #268, it is essentially a model XYZ -> C, where C is the target or prediction to be made, and there are N possible categories in C. Otherwise it seems that the first case in #261 is basically what I am doing.

As I understand the blockKFold works, you define some spatially disjoint spatial boxes such that when you split the data for the fold into a test and train set you guarantee no mixing along with some criteria such that the test blocks are spatially distributed in some way so they are not all in one corner or another. For a Stratified Block fold, I would imagine that the blocks would need to balanced so that there is an equal proportion of each class in either the test/train set as a whole or for each spatial block (that seems harder).

My idea, although it is just a hunch at the moment, would be to use space filling curves (like a hilbert curve) to provide a 1 dimensional index that could essentially be used to produce another categorical or ordinal column through which the data could be spatially stratified, then a conventional multi-label stratification could be performed using builtin methods in sklearn. Space filling curves can be tuned to create a desired number of uniform "blocks" (to n it is a quad tree like structure...) and there are a few to choose between that have different properties.

I think it could potentially work, if one had enough data points, to first perform the block K fold, then for each fold sub sample the test/train data to equalize the counts of each class, but It depends on what block K fold is really doing as it sounds like it tries to equalize the counts of data for each block?

There is also the imblearn package that implements a number of undersampling techniques, it also has oversampling like SMOTE implemented, but those methods either rely on some form of interpolation or sampling with replacement that I think is undesirable for my use case.

@leouieda leouieda added the question Further information is requested label Oct 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Idea or request for a new feature question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants