Optimal set sizes for ResamplingHoldout #493
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi, I noticed that ResamplingHoldout can sometimes return train/test set sizes that are sub-optimal, in the sense of the binomial likelihood. For example for a set size of N=2 (trivial I know) and a ratio of 0.7 train data, the most likely train set size is 2 (with 0 test data), which has probability 0.49:
However current mlr3 master gives me 1 train and 1 test sample, which has probability 0.42 (that is sub-optimal). I added this simple example as a test case.
So I know this is not an issue for big/real data, but it is an easy fix to get code that is optimal. You just have to use the formula for the mode of the binomial distribution, https://en.wikipedia.org/wiki/Binomial_distribution#Mode