Adding a small section in the RDM chapter on tips for working with large data simulation data #3652

EstherPlomp · 2024-05-14T15:28:28Z

Summary

I recently had a question from a researcher on this and it appears that there is not really information about this in other resources (or not obviously :)).

What needs to be done?

Add some of the tips collected
Are there additional resources I missed?

Who can help?

anyone interested in this topic!

Danielle Gawehns suggested reaching out to: Anna Lohmann works on replications of simulation studies. + Judith and Sanne from this event https://www.linkedin.com/posts/sannejwwillems_just-10-days-left-register-here-https-activity-7114549108219985920-Z9v5

Information/tips already gathered:

Some case studies:

Example of publishing of simulations: https://data.4tu.nl/collections/fd47d945-5e9d-478b-8322-69b7f14b6e4f/5 (found via: https://community.data.4tu.nl/2022/03/18/sharing-large-simulation-datasets-effectively/)
https://community.data.4tu.nl/2023/09/25/a-fair-data-fund-project-creating-a-cockpit-like-overview-of-large-materials-simulation-databases/

Tips from other researchers:

Using a workflow manager package (like signac) to consistently match run inputs to outputs - @srtee
Use workflow management systems, like AiiDA or pyiron - Peter Kraus
Good directory management to separate input scripts, "raw" simulation outputs, and post-processing analysis, so that the scripts and analysis can be version-controlled with git. (Large raw files can be excluded with gitignore to save time.) @srtee
You could check out datalad but may not be worthwhile @srtee
Most importantly, communicate the simulation settings accurately during publication so that other groups can reliably compare their own future simulations. @srtee
one can share the simulation code & parameters to regenerate the simulated data. Share data/code via Zenodo @gedankenstuecke
Set a seed, document parameters/commands, save output data and intermediate data as far as reasonable considering storage/computation efforts. Also document all the parameterizations you discarded, and why, so that others don't make the same mistakes. From: https://scholar.social/@nuest@mstdn.social/112438157242036468
Are ensembles of simulations relevant? Maybe one could document statistics, too. From: https://scholar.social/@nuest@mstdn.social/112438157242036468
document the hardware used to run the code and some information about how much computing power/time was needed to create the data. (so future users can decide whether they want to run the code or not) from https://scholar.social/@jcolomb@nerdculture.de/112440215170281529 @jcolomb
Ideally, you also have a sample configuration that runs in a short time with a sample dataset/sample result/output data so that others can compare and play around. Daniel Nüst

And this may be an interesting read: https://doi.org/10.1007/978-3-030-14401-2_12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a small section in the RDM chapter on tips for working with large data simulation data #3652

Adding a small section in the RDM chapter on tips for working with large data simulation data #3652

EstherPlomp commented May 14, 2024 •

edited

Adding a small section in the RDM chapter on tips for working with large data simulation data #3652

Adding a small section in the RDM chapter on tips for working with large data simulation data #3652

Comments

EstherPlomp commented May 14, 2024 • edited

Summary

What needs to be done?

Who can help?

EstherPlomp commented May 14, 2024 •

edited