Optimal set sizes for ResamplingHoldout #493

tdhock · 2020-04-24T21:45:10Z

Hi, I noticed that ResamplingHoldout can sometimes return train/test set sizes that are sub-optimal, in the sense of the binomial likelihood. For example for a set size of N=2 (trivial I know) and a ratio of 0.7 train data, the most likely train set size is 2 (with 0 test data), which has probability 0.49:

> dbinom(0:2, 2, 0.7)
[1] 0.09 0.42 0.49

However current mlr3 master gives me 1 train and 1 test sample, which has probability 0.42 (that is sub-optimal). I added this simple example as a test case.

So I know this is not an issue for big/real data, but it is an easy fix to get code that is optimal. You just have to use the formula for the mode of the binomial distribution, https://en.wikipedia.org/wiki/Binomial_distribution#Mode

codecov-io · 2020-10-31T02:30:15Z

Codecov Report

Merging #493 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #493   +/-   ##
=======================================
  Coverage   92.65%   92.66%           
=======================================
  Files          75       75           
  Lines        1934     1936    +2     
=======================================
+ Hits         1792     1794    +2     
  Misses        142      142

Impacted Files	Coverage Δ
R/ResamplingHoldout.R	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b86b8e...f838376. Read the comment docs.

tdhock added 3 commits April 24, 2020 14:21

optimal set size test fails

38b167b

use binomial mode for train set size

7e9ddf2

add TD Hocking ctb

f838376

Base automatically changed from master to main January 25, 2021 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal set sizes for ResamplingHoldout #493

Optimal set sizes for ResamplingHoldout #493

tdhock commented Apr 24, 2020

codecov-io commented Oct 31, 2020

Optimal set sizes for ResamplingHoldout #493

Are you sure you want to change the base?

Optimal set sizes for ResamplingHoldout #493

Conversation

tdhock commented Apr 24, 2020

codecov-io commented Oct 31, 2020

Codecov Report