Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a small section in the RDM chapter on tips for working with large data simulation data #3652

Open
2 tasks
EstherPlomp opened this issue May 14, 2024 · 0 comments

Comments

@EstherPlomp
Copy link
Collaborator

EstherPlomp commented May 14, 2024

Summary

I recently had a question from a researcher on this and it appears that there is not really information about this in other resources (or not obviously :)).

What needs to be done?

  • Add some of the tips collected
  • Are there additional resources I missed?

Who can help?

  • anyone interested in this topic!

Danielle Gawehns suggested reaching out to: Anna Lohmann works on replications of simulation studies. + Judith and Sanne from this event https://www.linkedin.com/posts/sannejwwillems_just-10-days-left-register-here-https-activity-7114549108219985920-Z9v5


Information/tips already gathered:

Some case studies:

Tips from other researchers:

  • Using a workflow manager package (like signac) to consistently match run inputs to outputs - @srtee
  • Use workflow management systems, like AiiDA or pyiron - Peter Kraus
  • Good directory management to separate input scripts, "raw" simulation outputs, and post-processing analysis, so that the scripts and analysis can be version-controlled with git. (Large raw files can be excluded with gitignore to save time.) @srtee
  • You could check out datalad but may not be worthwhile @srtee
  • Most importantly, communicate the simulation settings accurately during publication so that other groups can reliably compare their own future simulations. @srtee
  • one can share the simulation code & parameters to regenerate the simulated data. Share data/code via Zenodo @gedankenstuecke
  • Set a seed, document parameters/commands, save output data and intermediate data as far as reasonable considering storage/computation efforts. Also document all the parameterizations you discarded, and why, so that others don't make the same mistakes. From: https://scholar.social/@nuest@mstdn.social/112438157242036468
  • Are ensembles of simulations relevant? Maybe one could document statistics, too. From: https://scholar.social/@nuest@mstdn.social/112438157242036468
  • document the hardware used to run the code and some information about how much computing power/time was needed to create the data. (so future users can decide whether they want to run the code or not) from https://scholar.social/@jcolomb@nerdculture.de/112440215170281529 @jcolomb
  • Ideally, you also have a sample configuration that runs in a short time with a sample dataset/sample result/output data so that others can compare and play around. Daniel Nüst

And this may be an interesting read: https://doi.org/10.1007/978-3-030-14401-2_12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant