Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resample() does not set data_prototype (and task_prototype), which some learners rely on #987

Open
zecojls opened this issue Dec 22, 2023 · 6 comments
Assignees

Comments

@zecojls
Copy link

zecojls commented Dec 22, 2023

Hi, I'm using MLR3 on a Kaggle kernel and found issues with the resample function. The error message mentions some issues with data.table column selection and future.apply.

I'm currently able to use mlr3 v0.16.1 and the latest release of mlr3extralerners, but forcing data.table and future.apply to not upgrade by default (as they are dependencies to both).

Reproducible code:

# Install packages
install.packages("skimr")
install.packages("Cubist")
install.packages("mlr3verse")
remotes::install_github("mlr-org/mlr3extralearners@*release")
> Installing package into ‘/usr/local/lib/R/site-library’
> (as ‘lib’ is unspecified)
>
> Installing package into ‘/usr/local/lib/R/site-library’
> (as ‘lib’ is unspecified)
>
> Installing package into ‘/usr/local/lib/R/site-library’
> (as ‘lib’ is unspecified)
> 
> Downloading GitHub repo mlr-org/mlr3extralearners@v0.7.1
>
> data.table   (1.14.8 -> 1.14.10) [CRAN]
> future       (1.33.0 -> 1.33.1 ) [CRAN]
> future.apply (1.11.0 -> 1.11.1 ) [CRAN]
> mlr3         (0.17.0 -> 0.17.1 ) [CRAN]
> Installing 4 packages: data.table, future, future.apply, mlr3
# Modeling
library("mlr3")
task = tsk("boston_housing")
task$select(c("age", "b", "chas")) 
learner = lrn("regr.randomForest", importance = "mse")
learner$train(task)
cv.results <- resample(task, learner, rsmp("cv", folds = 10))
> INFO  [15:56:18.968] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 1/10)
> INFO  [15:56:19.261] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 2/10)
> INFO  [15:56:19.501] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 3/10)
> INFO  [15:56:20.041] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 4/10)
> INFO  [15:56:20.261] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 5/10)
> INFO  [15:56:20.778] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 6/10)
> INFO  [15:56:20.985] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 7/10)
> INFO  [15:56:21.201] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 8/10)
> INFO  [15:56:21.427] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 9/10)
> INFO  [15:56:21.643] [mlr3] Applying learner 'regr.randomForest' on task 'boston_housing' (iter 10/10)
> Error in eval(predvars, data, env): object 'age' not found
> Traceback:
> 
> 1. resample(task, learner, rsmp("cv", folds = 10))
> 2. future_map(n, workhorse, iteration = seq_len(n), learner = grid$learner, 
>  .     mode = grid$mode, MoreArgs = list(task = task, resampling = resampling, 
>  .         store_models = store_models, lgr_threshold = lgr_threshold, 
>  .         pb = pb))
> 3. future.apply::future_mapply(FUN, ..., MoreArgs = MoreArgs, SIMPLIFY = FALSE, 
>  .     USE.NAMES = FALSE, future.globals = FALSE, future.packages = "mlr3", 
>  .     future.seed = TRUE, future.scheduling = scheduling, future.chunk.size = chunk_size, 
>  .     future.stdout = stdout)
> 4. future_xapply(FUN = FUN, nX = nX, chunk_args = dots, MoreArgs = MoreArgs, 
>  .     get_chunk = function(X, chunk) lapply(X, FUN = `chunkWith[[`, 
>  .         chunk), expr = expr, envir = envir, future.envir = future.envir, 
>  .     future.globals = future.globals, future.packages = future.packages, 
>  .     future.scheduling = future.scheduling, future.chunk.size = future.chunk.size, 
>  .     future.stdout = future.stdout, future.conditions = future.conditions, 
>  .     future.seed = future.seed, future.label = future.label, fcn_name = fcn_name, 
>  .     args_name = args_name, debug = debug)
> 5. value(fs)
> 6. value.list(fs)
> 7. resolve(y, result = TRUE, stdout = stdout, signal = signal, force = TRUE)
> 8. resolve.list(y, result = TRUE, stdout = stdout, signal = signal, 
>  .     force = TRUE)
> 9. signalConditionsASAP(obj, resignal = FALSE, pos = ii)
> 10. signalConditions(obj, exclude = getOption("future.relay.immediate", 
>   .     "immediateCondition"), resignal = resignal, ...)

Session info:

R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ranger_0.14.1           Cubist_0.4.2.1          lattice_0.22-5         
 [4] mlr3extralearners_0.7.1 mlr3_0.16.1             data.table_1.14.8      
 [7] future.apply_1.11.0     future_1.33.0           skimr_2.1.5            
[10] ggridges_0.5.4          lubridate_1.9.3         forcats_1.0.0          
[13] stringr_1.5.1           dplyr_1.1.4             purrr_1.0.2            
[16] readr_2.1.4             tidyr_1.3.0             tibble_3.2.1           
[19] ggplot2_3.4.4           tidyverse_2.0.0         bigrquery_1.4.2        
[22] httr_1.4.7             

loaded via a namespace (and not attached):
 [1] bit64_4.0.5          jsonlite_1.8.8       assertthat_0.2.1    
 [4] lgr_0.4.4            mlr3misc_0.13.0      remotes_2.4.2.1     
 [7] globals_0.16.2       pillar_1.9.0         backports_1.4.1     
[10] glue_1.6.2           uuid_1.1-1           digest_0.6.33       
[13] checkmate_2.3.1      colorspace_2.1-0     Matrix_1.6-4        
[16] plyr_1.8.9           htmltools_0.5.7      pkgconfig_2.0.3     
[19] listenv_0.9.0        scales_1.3.0         processx_3.8.2      
[22] tzdb_0.4.0           timechange_0.2.0     generics_0.1.3      
[25] withr_2.5.2          repr_1.1.6.9000      cli_3.6.1           
[28] paradox_0.11.1       magrittr_2.0.3       crayon_1.5.2        
[31] evaluate_0.23        ps_1.7.5             fs_1.6.3            
[34] fansi_1.0.5          parallelly_1.36.0    pkgbuild_1.4.2      
[37] palmerpenguins_0.1.1 tools_4.0.5          prettyunits_1.2.0   
[40] hms_1.1.3            gargle_1.5.2         lifecycle_1.0.4     
[43] munsell_0.5.0        callr_3.7.3          compiler_4.0.5      
[46] rlang_1.1.2          grid_4.0.5           pbdZMQ_0.3-10       
[49] IRkernel_1.3.2.9000  base64enc_0.1-3      gtable_0.3.4        
[52] codetools_0.2-18     DBI_1.1.3            curl_5.1.0          
[55] reshape2_1.4.4       R6_2.5.1             knitr_1.45          
[58] fastmap_1.1.1        bit_4.0.5            utf8_1.2.4          
[61] rprojroot_2.0.4      desc_1.4.2           stringi_1.8.2       
[64] parallel_4.0.5       IRdisplay_1.1.0.9000 Rcpp_1.0.11         
[67] vctrs_0.6.5          dbplyr_2.4.0         tidyselect_1.2.0    
[70] xfun_0.41  
``
` 

@be-marc be-marc self-assigned this Dec 22, 2023
@be-marc
Copy link
Member

be-marc commented Dec 22, 2023

Hey, sorry I can't reproduce the issue. I create a clean environment with renv.

renv::init(bare = TRUE)
renv::install(c("mlr3@0.17.1", "mlr-org/mlr3extralearners@*release", "randomForest"))

Your code runs without any problems.

task = tsk("boston_housing")
task$select(c("age", "b", "chas")) 
learner = lrn("regr.randomForest", importance = "mse")
learner$train(task)
rr = resample(task, learner, rsmp("cv", folds = 10))

Session info.

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 23.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] mlr3extralearners_0.7.1 mlr3_0.17.1            

loaded via a namespace (and not attached):
 [1] digest_0.6.33        backports_1.4.1      R6_2.5.1             codetools_0.2-19     randomForest_4.7-1.1 lgr_0.4.4            parallel_4.3.1       RhpcBLASctl_0.23-42  palmerpenguins_0.1.1
[10] mlr3misc_0.13.0      parallelly_1.36.0    pak_0.7.1            future_1.33.1        renv_1.0.3           data.table_1.14.10   compiler_4.3.1       paradox_0.11.1       globals_0.16.2   
``

@zecojls
Copy link
Author

zecojls commented Dec 22, 2023

My Kaggle kernel has R 4.0 and the Ubuntu 20 installed by default. Not sure if I can change that. What do you recommend?

@be-marc
Copy link
Member

be-marc commented Dec 23, 2023

I can confirm that there is a bug on kaggle. It is not the subsetting of the task and not the task itself. The error does not occur with regr.rpart but with regr.randomForest and regr.ranger. I cannot reproduce the bug on my local machine or in a rocker image with R 4.0.5. The error looks like mlr3 is not passing data to the predict function of the upstream packages. Such an error would definitely have been noticed in our unit tests. Yes, that is quite tricky now. We can't debug easily on Kaggle.

@be-marc
Copy link
Member

be-marc commented Dec 23, 2023

@mb706
Copy link
Collaborator

mb706 commented Mar 26, 2024

I believe the issue is this line in the randomforest learner:

https://github.com/mlr-org/mlr3extralearners/blob/5e291e0062347d24a263505e882dd9f409cb04ef/R/learner_randomForest_regr_randomForest.R#L113

This executes

task$data(cols = intersect(names(learner$state$data_prototype),
  task$feature_names))

When I stop here, the learner's learner$state$data_prototype is NULL (this is the bug, see below), and, in modern R versions, the intersect() is also NULL leading to the call task$data(cols = NULL) and all columns are returned.

However, in older R versions, intersect(NULL, <character>) is not NULL, it is character(0). This leads to task$data(cols = character(0)) being called, and ordered_features() in the line linked above therefore returning a 0-column data.table.

Idk when this new behaviour of intersect() was introduced, it appears to be this diff and this entry in R 4.2.0 NEWS sounds matching:

The set utility functions, notably intersect() have been tweaked to be more consistent and symmetric in their two set arguments, also preserving a common mode.

.... although the timing does not seem to match. But somewhere between 4.1.2 and 4.2.0 I think. Too lazy to check.


Now to the bug in our code: I assume the problem is that resampling does not set the data_prototype any more during resampling, since this patch. resample() does not call the learner's train(), so data_prototype is not set.

@mb706 mb706 changed the title Latest mlr3 with resample issue linked to data.table and future.apply resample() does not set data_prototype (and task_prototype), which some learners rely on Mar 26, 2024
@mb706
Copy link
Collaborator

mb706 commented Mar 26, 2024

(It may be unnecessary, currently, to set data_prototype in resampling, since the task remains the same, but this may change with the new holdout task thing that may be introduced. Also we should make sure other places handle data_prototype being NULL correctly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants