Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggplot2::ggsave() error with pointblank::scan_data() #515

Open
2 tasks done
Thiyaghessan opened this issue Feb 8, 2024 · 5 comments
Open
2 tasks done

ggplot2::ggsave() error with pointblank::scan_data() #515

Thiyaghessan opened this issue Feb 8, 2024 · 5 comments

Comments

@Thiyaghessan
Copy link

Thiyaghessan commented Feb 8, 2024

Prework

Description

Error triggered when executing scan_data() on a data.table object with 34 rows and 268 columns:

Error in ggplot2::ggsave(): ! Dimensions exceed 50 inches (heightandwidthare specified in inches not pixels). ℹ If you're sure you want a plot that big, uselimitsize = FALSE. Run rlang::last_trace() to see where the error occurred.

rlang::last_trace() Output:

Backtrace:
     ▆
  1. ├─pointblank::scan_data(charities_pc_dat[[1]])
  2. │ └─pointblank:::build_table_scan_page(...)
  3. │   └─sections %>% ...
  4. └─base::lapply(...)
  5.   └─pointblank (local) FUN(X[[i]], ...)
  6.     └─pointblank:::probe_interactions_assemble(data = data, lang = lang)
  7.       ├─base::suppressWarnings(probe_interactions(data = data))
  8.       │ └─base::withCallingHandlers(...)
  9.       └─pointblank:::probe_interactions(data = data)
 10.         └─ggplot2::ggsave(...)

Reproducible example

URL <- ""https://nccsdata.s3.amazonaws.com/harmonized/core/CORE-2009-501C3-CHARITIES-PC-HRMN.csv"

data <- data.table::fread( URL )

pointblank::scan_data( dat )

Expected result

scan_data() should have returned the HTML output.

Session info

R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

@yjunechoe
Copy link
Collaborator

You get that error because the data has 34 rows and 268 columns, and some of the plots automatically generated by scan_data() map those 268 columns to the axes or facets - this results in an incredibly large/dense plot which causes the error you see in ggsave().

If you only want sections from scan_data() that does not produce plots, I think scan_data(data, sections = "OVS") would do it. If you do want plots for correlations, missing variables, etc., I'd look into whether you can collapse or pivot-longer some of the columns that you have.

@rich-iannone
Copy link
Member

I think it might be good here to introduce a limit (maybe 10?) on the number of columns used in these parts of the scan data report. This will at least make the function work w/o failing on the default options. On top of this, it would be useful to have a columns arg of some sort that allows the user to choose what's being used in these reporting parts.

The eventual goal, I think, is to have these report sections become a bit more scalable with larger amounts of data (perhaps using gt to arrange things, so you'd get scrolling and not these very tiny subplots).

@rich-iannone rich-iannone added this to the v0.12.0 milestone Feb 20, 2024
@rich-iannone rich-iannone changed the title ggplot2::ggsave() error with pointblank::scan_data() ggplot2::ggsave() error with pointblank::scan_data() Feb 20, 2024
@rich-iannone rich-iannone modified the milestones: v0.12.0, v0.13.0 Feb 27, 2024
@SpikyClip
Copy link

Anyone find any workaround for this, or a way to supply the limitsize arg?

I think it would be really useful for a tool like scan_data to be able to handle more columns, as those are the scenarios where its useful to have a script break down what columns contain useful/sparse information, so you can then subset datasets.

@yjunechoe
Copy link
Collaborator

@SpikyClip Thanks for this perspective - it's helpful to know that scan_data() is useful for determining the importance of variables prior to subsetting.

As Rich mentioned above, scan_data() will need some rework to accommodate larger data frames because currently some sections of the report (like the matrix plot) do not scale well with many columns. For now, scan_data(data, sections = "OVS") is a workaround to only render the sections of the report that easily handles many columns.

We could patch in a workaround for letting users supply the limitsize argument, but the fundamental challenge seems to run deeper. (Toggling limitsize off would make the error go away but your report could end up a huge self-contained html in hundreds of megabytes).

Happy to hear any suggestions on this!

@SpikyClip
Copy link

We could patch in a workaround for letting users supply the limitsize argument, but the fundamental challenge seems to run deeper. (Toggling limitsize off would make the error go away but your report could end up a huge self-contained html in hundreds of megabytes).

I see, that makes sense. What if, past a certain arbitrary number of columns, it renders the plot data as a filterable data table rather than a plot? Would still allow the user to view the key information if necessary (e.g. which columns have the most NA values, which columns correlate the most), and it'll hopefully catch the ggsave error. I don't think(?) the underlying tables would take up too much space but it may require testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants