Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing the data footprint for large simulations #1141

Open
charliejharrison opened this issue Aug 6, 2021 · 2 comments
Open

Reducing the data footprint for large simulations #1141

charliejharrison opened this issue Aug 6, 2021 · 2 comments

Comments

@charliejharrison
Copy link
Collaborator

We would like to explore what happens to the model over large numbers of generations, spanning upwards of 106 cells. To make this feasible we would need to drastically reduce the amount of data that is saved to disk, removing most of the tables and analysis runs. Data serialisation currently involves a number of classes (Listeners, Loggers, Processes) and happens in a few places, and I'm not sure whether the model relies on any of these for calculations or whether they can be safely excised.

Can data serialisation be made more configurable?

@tahorst
Copy link
Member

tahorst commented Aug 11, 2021

This would be a great feature to add! I'm copying my original response to your email below for completeness on this thread (I hope it was enough to get you going) and listing some proposals on how to address this in a better way moving forward.

Current way of doing it manually:
I think if you comment out any listeners listed here that you don't want, then they should not be written to disk. There's a chance that some processes depend on data in listeners as we're running sims and you won't be able to remove those (we try to limit this as much as possible to keep processes/states the only things that are needed during simulation and listeners as the interface with disk) but this should be the simplest way of doing it. Logging to disk is set up in the simulation.py code here, this function is what actually sets the files that will be written (you can see internal_states like BulkMolecules and all listeners are here), and I think it should write to disk with this function call after every timestep. If you don't want to save bulk molecules in your sims, you will need to remove it from createTables.

Proposal for making more configurable:

  • accept a config file that replaces the configuration in simulation.py. We can have a default config file that behaves the same way but optionally accept a different file for workflows and manual scripts to reduce the number of listeners used
  • remove simulation dependence on listeners - there are a few places in sims that read values from listeners but these could be made into states so that processes and states communicate during simulation and listeners are a one way route to disk (CellProperties state #512, Redundant volume/concentration calculations #80)
  • specify the listeners that get written to disk in the config file and pass it to this function instead of defaulting to the classes that are chained together

I would love to get your thoughts on this @1fish2 (once you're back from your roadtrip!)

@1fish2
Copy link
Contributor

1fish2 commented Aug 21, 2021

It looks like we could turn off all the output tables by setting the option logToDisk to False when constructing the simulation and simulationDaughter Firetasks. (wholecell/sim/simulation.py copies this option into the Simulation._logToDisk attribute -- it's tricky). All the listeners would still run in memory but none of the listeners, internal_states, or external_states would write to disk.

While you're at it, set logToShell to False to turn off most of the console output and tweak divide_cell.divide_cell() to only write one of the two daughter cell inherited state files.

Removing classes from _listenerClasses would save additional in-memory work but as @tahorst noted, the code expects to

  • read from the "Mass" listener: readFromListener("Mass", "cellMass"), self.listeners['Mass'].volume
  • write to the "EvaluationTime" listener: self.listeners["EvaluationTime"]
  • write to "Main" per special-casing in Disk; not even listed in _listenerClasses.

Yes, it'd be cleaner to turn those listeners into states.

To make this configurable, @tahorst's idea of removing listeners from _listenerClasses sounds good for all but "Mass", "EvaluationTime", and "Main". To configure specific listeners (including those 3) and internal/external states from writing to disk, Disk.createTables() could filter them out, again with special-casing for "Main".

BTW, all this is easy in vivarium-ecoli: Just configure Store variables to have _emit = False or let them default to False per the Vivarium framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants