Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] lightgbm::lgb.model.dt.tree() error caused by lgb.dump() error with large models #6380

Open
p-schaefer opened this issue Mar 22, 2024 · 8 comments

Comments

@p-schaefer
Copy link

Description

When models or data sets reach a certain level of complexity, the lgb.dump() will cause an error in R: Error: R character strings are limited to 2^31-1 bytes.

Reproducible example

library(dplyr)
library(lightgbm)
library(nycflights13)

dt<-nycflights13::flights %>%
  mutate(origin=factor(origin),
         dest=factor(dest),
         carrier=factor(carrier)
  ) %>%
  select(-tailnum,-time_hour)

spt1<-round(nrow(dt)*(3/4))
spt2<-round(nrow(dt)*(1/4))
train<-head(dt,spt1)
test<-tail(dt,spt2)

dtrain <- lgb.Dataset(as.matrix(train[,colnames(train)!="arr_delay"]),
                      categorical_feature = c("origin","dest","carrier"),
                      label = train[,colnames(train)=="arr_delay"][[1]])

params <- list(
  objective = "regression"
  , metric = "l2"
  , min_data = 1L
  , learning_rate = 1.0
  , num_threads = 2L
  , max_cat_threshold = 2L
)

model <- lgb.train(
  params = params
  , data = dtrain
  , nrounds = 1000000L
)

json_model <- lightgbm::lgb.dump(model) # This will cause the error

A potential solution may be to dump the data directly to a temporary file, then stream in the data from the temporary file (https://rdrr.io/cran/jsonlite/man/stream_in.html)

Environment info

Session info:
─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.3 (2024-02-29)
 os       Ubuntu 22.04.4 LTS
 system   x86_64, linux-gnu
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Etc/UTC
 date     2024-03-22
 rstudio  2023.12.1+402 Ocean Storm (server)
 pandoc   3.1.1 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package        * version    date (UTC) lib source
 base64enc        0.1-3      2015-07-28 [3] CRAN (R 4.0.2)
 bslib            0.6.1      2023-11-28 [2] CRAN (R 4.3.2)
 cachem           1.0.8      2023-05-01 [2] CRAN (R 4.3.0)
 callr            3.7.5      2024-02-19 [2] CRAN (R 4.3.2)
 cellranger       1.1.0      2016-07-27 [3] CRAN (R 4.0.1)
 class            7.3-22     2023-05-03 [4] CRAN (R 4.3.1)
 classInt         0.4-10     2023-09-05 [2] CRAN (R 4.3.1)
 cli              3.6.2      2023-12-11 [2] CRAN (R 4.3.2)
 codetools        0.2-19     2023-02-01 [4] CRAN (R 4.2.2)
 colorspace       2.1-0      2023-01-23 [1] CRAN (R 4.3.0)
 cowplot          1.1.3      2024-01-22 [2] CRAN (R 4.3.2)
 crosstalk        1.2.1      2023-11-23 [2] CRAN (R 4.3.2)
 DALEX            2.4.3      2023-01-15 [2] CRAN (R 4.2.3)
 data.table       1.15.2     2024-02-29 [2] CRAN (R 4.3.2)
 datamods         1.4.5      2024-02-28 [2] CRAN (R 4.3.2)
 DBI              1.2.2      2024-02-16 [2] CRAN (R 4.3.2)
 dbplyr           2.5.0      2024-03-19 [2] CRAN (R 4.3.3)
 digest           0.6.35     2024-03-11 [2] CRAN (R 4.3.3)
 dplyr          * 1.1.4      2023-11-17 [2] CRAN (R 4.3.2)
 e1071            1.7-14     2023-12-06 [2] CRAN (R 4.3.2)
 ellipsis         0.3.2      2021-04-29 [3] CRAN (R 4.1.1)
 esquisse         1.2.0      2024-01-10 [2] CRAN (R 4.3.2)
 evaluate         0.23       2023-11-01 [2] CRAN (R 4.3.1)
 extrafont        0.19       2023-01-18 [1] CRAN (R 4.3.0)
 extrafontdb      1.0        2012-06-11 [1] CRAN (R 4.3.0)
 fansi            1.0.6      2023-12-08 [2] CRAN (R 4.3.2)
 fastmap          1.1.1      2023-02-24 [3] CRAN (R 4.2.2)
 forcats        * 1.0.0      2023-01-29 [3] CRAN (R 4.2.2)
 fs               1.6.3      2023-07-20 [3] CRAN (R 4.3.1)
 generics         0.1.3      2022-07-05 [2] CRAN (R 4.2.3)
 ggiraph          0.8.9      2024-02-24 [2] CRAN (R 4.3.2)
 ggiraphExtra     0.3.0      2020-10-06 [2] CRAN (R 4.3.2)
 ggplot2        * 3.5.0      2024-02-23 [2] CRAN (R 4.3.2)
 ggrepel          0.9.5      2024-01-10 [2] CRAN (R 4.3.2)
 glue             1.7.0      2024-01-09 [2] CRAN (R 4.3.2)
 gridExtra        2.3        2017-09-09 [2] CRAN (R 4.2.3)
 gtable           0.3.4      2023-08-21 [2] CRAN (R 4.3.1)
 hardhat          1.3.1      2024-02-02 [2] CRAN (R 4.3.2)
 here             1.0.1      2020-12-13 [2] CRAN (R 4.2.3)
 hms              1.1.3      2023-03-21 [3] CRAN (R 4.2.3)
 htmltools        0.5.7      2023-11-03 [2] CRAN (R 4.3.1)
 htmlwidgets      1.6.4      2023-12-06 [2] CRAN (R 4.3.2)
 httpuv           1.6.14     2024-01-26 [2] CRAN (R 4.3.2)
 httr             1.4.7      2023-08-15 [2] CRAN (R 4.3.1)
 iBreakDown       2.1.2      2023-12-01 [2] CRAN (R 4.3.2)
 insight          0.19.9     2024-03-15 [2] CRAN (R 4.3.3)
 jquerylib        0.1.4      2021-04-26 [3] CRAN (R 4.1.2)
 jsonlite         1.8.8      2023-12-04 [1] CRAN (R 4.3.2)
 KernSmooth       2.23-22    2023-07-10 [4] CRAN (R 4.3.1)
 knitr            1.45       2023-10-30 [2] CRAN (R 4.3.1)
 later            1.3.2      2023-12-06 [2] CRAN (R 4.3.2)
 lattice          0.22-5     2023-10-24 [4] CRAN (R 4.3.1)
 lazyeval         0.2.2      2019-03-15 [2] CRAN (R 4.2.3)
 leafem           0.2.3      2023-09-17 [2] CRAN (R 4.3.2)
 leaflet          2.2.1      2023-11-13 [2] CRAN (R 4.3.1)
 lifecycle        1.0.4      2023-11-07 [2] CRAN (R 4.3.1)
 lightgbm       * 4.3.0      2024-01-18 [2] CRAN (R 4.3.2)
 lubridate      * 1.9.3      2023-09-27 [2] CRAN (R 4.3.1)
 magrittr         2.0.3      2022-03-30 [2] CRAN (R 4.2.3)
 mapview          2.11.2     2023-10-13 [2] CRAN (R 4.3.1)
 MASS             7.3-60.0.1 2024-01-13 [4] CRAN (R 4.3.2)
 Matrix           1.6-5      2024-01-11 [2] CRAN (R 4.3.2)
 mgcv             1.9-1      2023-12-21 [4] CRAN (R 4.3.2)
 mime             0.12       2021-09-28 [3] CRAN (R 4.2.0)
 munsell          0.5.0      2018-06-12 [2] CRAN (R 4.2.3)
 mycor            0.1.1      2018-04-10 [2] CRAN (R 4.3.2)
 NADA             1.6-1.1    2020-03-22 [2] CRAN (R 4.3.1)
 nlme             3.1-163    2023-08-09 [4] CRAN (R 4.3.1)
 nycflights13   * 1.0.2      2021-04-12 [1] CRAN (R 4.3.3)
 openxlsx         4.2.5.2    2023-02-06 [2] CRAN (R 4.2.3)
 parsnip          1.2.0      2024-02-16 [2] CRAN (R 4.3.2)
 phosphoricons    0.2.0      2023-05-17 [2] CRAN (R 4.3.1)
 pillar           1.9.0      2023-03-22 [2] CRAN (R 4.2.3)
 pkgconfig        2.0.3      2019-09-22 [2] CRAN (R 4.2.3)
 plotly           4.10.4     2024-01-13 [2] CRAN (R 4.3.2)
 plyr             1.8.9      2023-10-02 [2] CRAN (R 4.3.1)
 png              0.1-8      2022-11-29 [2] CRAN (R 4.2.3)
 pool             1.0.3      2024-02-14 [2] CRAN (R 4.3.2)
 ppcor            1.1        2015-12-03 [2] CRAN (R 4.3.2)
 processx         3.8.4      2024-03-16 [2] CRAN (R 4.3.3)
 promises         1.2.1      2023-08-10 [2] CRAN (R 4.3.1)
 proxy            0.4-27     2022-06-09 [2] CRAN (R 4.2.3)
 ps               1.7.6      2024-01-18 [2] CRAN (R 4.3.2)
 purrr          * 1.0.2      2023-08-10 [2] CRAN (R 4.3.1)
 R6               2.5.1      2021-08-19 [2] CRAN (R 4.2.3)
 raster           3.6-26     2023-10-14 [2] CRAN (R 4.3.1)
 RColorBrewer     1.1-3      2022-04-03 [2] CRAN (R 4.2.3)
 Rcpp             1.0.12     2024-01-09 [1] CRAN (R 4.3.2)
 reactable        0.4.4      2023-03-12 [2] CRAN (R 4.3.1)
 readr          * 2.1.5      2024-01-10 [2] CRAN (R 4.3.2)
 readxl           1.4.3      2023-07-06 [1] CRAN (R 4.3.0)
 reprex           2.1.0      2024-01-11 [2] CRAN (R 4.3.2)
 reshape2         1.4.4      2020-04-09 [2] CRAN (R 4.3.1)
 rio              1.0.1      2023-09-19 [2] CRAN (R 4.3.1)
 rlang            1.1.3      2024-01-10 [2] CRAN (R 4.3.2)
 rmarkdown        2.26       2024-03-05 [2] CRAN (R 4.3.2)
 rpivotTable      0.3.0      2018-01-30 [2] CRAN (R 4.3.1)
 rprojroot        2.0.4      2023-11-05 [2] CRAN (R 4.3.1)
 rstudioapi       0.15.0     2023-07-07 [1] CRAN (R 4.3.0)
 Rttf2pt1         1.3.12     2023-01-22 [1] CRAN (R 4.3.0)
 sass             0.4.9      2024-03-15 [2] CRAN (R 4.3.3)
 satellite        1.0.5      2024-02-10 [2] CRAN (R 4.3.2)
 scales           1.3.0      2023-11-28 [2] CRAN (R 4.3.2)
 sessioninfo      1.2.2      2021-12-06 [2] CRAN (R 4.2.3)
 sf               1.0-15     2023-12-18 [2] CRAN (R 4.3.2)
 shiny            1.8.0      2023-11-17 [2] CRAN (R 4.3.2)
 shinybusy        0.3.3      2024-03-09 [2] CRAN (R 4.3.3)
 shinyWidgets     0.8.3      2024-03-21 [2] CRAN (R 4.3.3)
 sjlabelled       1.2.0      2022-04-10 [2] CRAN (R 4.3.2)
 sjmisc           2.8.9      2021-12-03 [2] CRAN (R 4.3.2)
 sp               2.1-3      2024-01-30 [2] CRAN (R 4.3.2)
 stringi          1.8.3      2023-12-11 [2] CRAN (R 4.3.2)
 stringr        * 1.5.1      2023-11-14 [2] CRAN (R 4.3.2)
 survival         3.5-8      2024-02-14 [4] CRAN (R 4.3.3)
 systemfonts      1.0.6      2024-03-07 [2] CRAN (R 4.3.3)
 tibble         * 3.2.1      2023-03-20 [2] CRAN (R 4.2.3)
 tidyr          * 1.3.1      2024-01-24 [2] CRAN (R 4.3.2)
 tidyselect       1.2.1      2024-03-11 [2] CRAN (R 4.3.3)
 tidyverse      * 2.0.0      2023-02-22 [2] CRAN (R 4.2.3)
 timechange       0.3.0      2024-01-18 [2] CRAN (R 4.3.2)
 treeshap         0.3.1      2024-01-22 [2] CRAN (R 4.3.2)
 tzdb             0.4.0      2023-05-12 [3] CRAN (R 4.3.0)
 units            0.8-5      2023-11-28 [2] CRAN (R 4.3.2)
 utf8             1.2.4      2023-10-22 [2] CRAN (R 4.3.1)
 uuid             1.2-0      2024-01-14 [2] CRAN (R 4.3.2)
 vctrs            0.6.5      2023-12-01 [1] CRAN (R 4.3.2)
 viridisLite      0.4.2      2023-05-02 [2] CRAN (R 4.3.0)
 withr            3.0.0      2024-01-16 [2] CRAN (R 4.3.2)
 writexl          1.5.0      2024-02-09 [2] CRAN (R 4.3.2)
 xfun             0.42       2024-02-08 [2] CRAN (R 4.3.2)
 xgboost          1.7.7.1    2024-01-25 [2] CRAN (R 4.3.2)
 xtable           1.8-4      2019-04-21 [2] CRAN (R 4.2.3)
 yaml             2.3.8      2023-12-11 [2] CRAN (R 4.3.2)
 zip              2.3.1      2024-01-27 [2] CRAN (R 4.3.2)

 [1] /home/pschaefer/R/x86_64-pc-linux-gnu-library/4.3
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Additional Comments

@jameslamb
Copy link
Collaborator

Thanks very much for the excellent write-up!!

I do wish this had been posted in the existing discussion we were having on the exact same topic at #6288, to not split the conversation. But now that we have this issue with a reproducible example, I'll close that one and we can focus here.

dump the data directly to a temporary file

Interesting idea! I hope we can avoid touching the filesystem to support larger models, if possible, since that can introduce its own set of problems (permissions errors, space issues, files being left behind, etc.).

Some combination of that and these other ideas might help here:

  • not using JSON in the middle of this operation (it's quite costly memory-wise because the keys that become column headers are repeated many times)
  • providing an entrypoint in LightGBM's C API with an iterator over chunks of the data instead of trying to dump it into an R string all at once
  • arranging the data into array format on the C/C++ side and creating an R data frame there

Are you interested in working on this? If not, no worries... we appreciate the thorough write-up and you can subscribe to this issue to be notified if / when someone addresses it.

@mayer79
Copy link
Contributor

mayer79 commented Mar 22, 2024

@jameslamb Maybe we can dump/parse single trees (or m=NULL trees) instead of the full model.

@p-schaefer
Copy link
Author

Thanks everyone. Apologies for opening a new issue on this in so many places. But it seems like there are some potential solutions on the table. Unfortunately, I'm not very familiar with C/C++, so I'm afraid I would be of little help there. But if there is anything I can help with on the R or Python side, I'd be happy too. I could be mistaken, but to me it seems a lot of this is handled on the C side though.

I think @mayer79 suggestion would have utility in a lot of places, but in terms of computation time and overhead, either not using JSON, or arranging the data into array format on the C/C++ side would probably be optimal there.

@jameslamb
Copy link
Collaborator

Maybe we can dump/parse single trees (or m=NULL trees) instead of the full model.

Yep! This is one specific version of the more general statement I made, "an iterator over chunks of the data".

Looking through the C API... the underlying API for dumping to JSON actually already supports iterating over ranges of trees 😁

LightGBM/src/c_api.cpp

Lines 2687 to 2689 in 28536a0

int LGBM_BoosterDumpModel(BoosterHandle handle,
int start_iteration,
int num_iteration,

So I think we can probably do this with 0 API changes.

@mayer79
Copy link
Contributor

mayer79 commented Mar 31, 2024

@jameslamb: Very neat! I can work on this after #6364 is merged.

@jameslamb
Copy link
Collaborator

Great thank you! I'd like to merge #6364 soon, but we're blocked until I can get some help with #6316 (comment)

@p-schaefer
Copy link
Author

Has there been any progress on this? Is there anything I can do to help move this along?

@jameslamb
Copy link
Collaborator

It's being worked on in #6397, you can subscribed there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants