Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing directories of generations that initialized but terminated immediately #1044

Open
bettybliu opened this issue Feb 17, 2021 · 6 comments

Comments

@bettybliu
Copy link
Contributor

When simulations break right after a new generation initialized, they leave behind a out/[simOutDir]/[variant]/[seed]/generation_######/ directory which contains a 000000/simOut/shell.log output that looks like this:
image

When trying to run runscripts/manual/analysisMultigen.py or runscripts/manual/analysisCohort.py in a directory containing this "phantom generation" directory, the analysis would fail with the following error message:

Tue Feb 16 02:32:30 2021: --- Starting AnalysisCohortTask ---
Tue Feb 16 02:32:30 2021: Running proteinCopyNumberDistribution.py
Traceback (most recent call last):
  File "/Users/betty/Dropbox/Academics/PhD-Y1/Covert-Lab/wcEcoli/wholecell/fireworks/firetasks/analysisBase.py", line 179, in run_task
    mod.Plot.main(*args)
  File "/Users/betty/Dropbox/Academics/PhD-Y1/Covert-Lab/wcEcoli/models/ecoli/analysis/analysisPlot.py", line 120, in main
    instance.plot(inputDir, plotOutDir, plotOutFileName, simDataFile,
  File "/Users/betty/Dropbox/Academics/PhD-Y1/Covert-Lab/wcEcoli/models/ecoli/analysis/analysisPlot.py", line 110, in plot
    self.do_plot(inputDir, plotOutDir, plotOutFileName, simDataFile,
  File "/Users/betty/Dropbox/Academics/PhD-Y1/Covert-Lab/wcEcoli/models/ecoli/analysis/cohort/proteinCopyNumberDistribution.py", line 75, in do_plot
    idx_timepoint = np.random.randint(0, high=len(time))
TypeError: len() of unsized object
Completed analysis in 0:00:48.803789 with an exception in:
        proteinCopyNumberDistribution.py
Traceback (most recent call last):
  File "runscripts/manual/analysisCohort.py", line 47, in <module>
    analysis.cli()
  File "/Users/betty/Dropbox/Academics/PhD-Y1/Covert-Lab/wcEcoli/wholecell/utils/scriptBase.py", line 589, in cli
    self.run(args)
  File "runscripts/manual/analysisCohort.py", line 42, in run
    task.run_task({})
  File "/Users/betty/Dropbox/Academics/PhD-Y1/Covert-Lab/wcEcoli/wholecell/fireworks/firetasks/analysisBase.py", line 202, in run_task
    raise RuntimeError('Error in analysis plot(s): {}'.format(', '.join(exceptionFileList)))
RuntimeError: Error in analysis plot(s): proteinCopyNumberDistribution.py

This could be fixed easily by removing the "phantom generation" directory, but for larger runs, one would need to go into each seed and check the last generation's shell.log. It would be nice to have a way to automatically delete these "phantom generation" directories or exclude them from analysis.

@tahorst
Copy link
Member

tahorst commented Feb 17, 2021

Is there only a shell.log file there? I would expect tables and attributes would have been written in simOut/ as well unless the error occurred before any listeners wrote data.

This can also be an issue when running with fw_queue.py where the directory structure is created prior to running simulations. If there are 2 generations to be run but the sim fails in the first generation, then variant, cohort, multigen analysis usually doesn't work because of an empty 2nd generation directory.

We could check if Daughter1_inherited_state.cPickle and Daughter2_inherited_state.cPickle exist in simOut/ as a proxy for a completed generation or write a status update file ('initializing', 'running', 'failed', 'completed') to simOut/ to know if a simulation has started or completed. AnalysisPaths could store the state of sims with each path and we could use an argument to filter by status when calling get_cells() to remove any that haven't written data. This seems more robust than needing to run a script to remove empty directories and less likely to result in accidental data loss.

@bettybliu
Copy link
Contributor Author

Yes there are tables and attributes as well, not just the shell.log file. I agree, registering the simulation state and removing them from analysis path could be a more robust way to do it than simply deleting the directories.

Using Daughter1_inherited_state.cPickle as a proxy for completed generations works, but registering the status for "failed" runs may be trickier (i.e. those that ran for a while and then stopped, I don't have enough backend knowledge on this to tell what would be a good way to record this during the run). And some may choose to include those partial generations in their analysis while others don't. I'd say the status update file would probably work best.

@tahorst
Copy link
Member

tahorst commented Feb 17, 2021

Yes, I agree that including partial generations (or even simulations that haven't started) in analysis will be important for some analysis plots, especially for debugging issues.

I think the best way to handle the status would be creating a function in Simulation that writes whatever status is passed to it to the status file (simOut/ for the sim will be specified by self._outputDir). Then you can call that function at different points in that file to reflect the statuses mentioned above.

@1fish2
Copy link
Contributor

1fish2 commented Feb 17, 2021

Each sim firetask already returns success/failure status code. Simulation or SimulationTask, actually every firetask runner should write that info to a file however the task was run (manual script, fw_queue Fireworks on Sherlock, or wcm.py Fireworks on GCloud).

Currently, when running via Fireworks on GCloud, DockerTask writes the status, the shell output, and other info to a logs/*.log file, as it does for all firetasks. Unify that log with the shell.log?

Also when running via Fireworks, the status gets stored in the launchpad DB.

I like the idea of AnalysisPaths optionally (on by default?) skipping failed sims. That'd make analyses less fragile.

BTW, Python 3.5 added a handy function os.scandir() that'd make AnalysisPaths simpler and faster. It enumerates a directory like os.listdir() but instead of returning just the local entry names, it iterates DirEntry entries which you can query for .is_dir(), .name, .path, etc. Here's an example to get a set of subdirs of directory d:

{entry.name for entry in os.scandir(d) if entry.is_dir()}

(That's going into diff_simouts.py because it assumed all simOut/* entries except *.cPickle are directories containing tables, which is not the case for shell.log.)

@tahorst
Copy link
Member

tahorst commented Feb 18, 2021

Each sim firetask already returns success/failure status code. Simulation or SimulationTask, actually every firetask runner should write that info to a file however the task was run (manual script, fw_queue Fireworks on Sherlock, or wcm.py Fireworks on GCloud).

Is that fireworks that writes the code to file? Where does it end up? It still might be better handled in simulation.py if we want simulation specific statuses (eg initializing, running, completed, no division, etc) instead of firetask statuses.

Currently, when running via Fireworks on GCloud, DockerTask writes the status, the shell output, and other info to a logs/*.log file, as it does for all firetasks. Unify that log with the shell.log?

It could be good to unify with shell.log. It would also be good to save any error messages that pop up during simulation since shell.log is only the simulation output and not a complete redirect of the stdout/stderr stream.

@1fish2
Copy link
Contributor

1fish2 commented Feb 18, 2021

Is that fireworks that writes the code to file? Where does it end up?

After the Firetask raises an exception or returns normally, Fireworks records info in the launchpad DB including moving the task state to COMPLETED or FIZZLED. In GCloud runs, DockerTask records how the task ended to its log file. Otherwise, it's not getting saved.

It still might be better handled in simulation.py if we want simulation specific statuses (eg initializing, running, completed, no division, etc) instead of firetask statuses.

Good idea. It could write structured text to the console (and thus to shell.log and the DockerTask log) and/or write the status to a separate file.

It could be good to unify [the GCloud DockerTask log file] with shell.log. It would also be good to save any error messages that pop up during simulation since shell.log is only the simulation output and not a complete redirect of the stdout/stderr stream.

It's OK if suboptimal for shell.log to duplicate part of the DockerTask log. The former is created in all Simulation runs. The latter is created for all firetasks running in GCloud.

(BTW fw_queue shouldn't need to create the output directories now that the WCM firetasks do that in case they're running in GCloud.)

The DockerTask log does (I think) include stderr as well as stdout, also some preamble and conclusion. Example:

20210217.195240 DockerTask parca

{'_fw_name': '{{borealis.docker_task.DockerTask}}',
 'command': ['python',
             '-u',
             '-m',
             'wholecell.fireworks.runTask',
             'ParcaTask',
             '{"ribosome_fitting": true, "rnapoly_fitting": true, "cpus": 1, '
             '"variable_elongation_transcription": false, '
             '"variable_elongation_translation": false, "debug": false, '
             '"output_directory": "/wcEcoli/out/wf/kb/"}'],
 'image': 'gcr.io/allen-discovery-center-mcovert/jerry-wcm-code',
 'inputs': [],
 'internal_prefix': '/wcEcoli/out/wf/',
 'name': 'parca',
 'outputs': ['>>/wcEcoli/out/wf/logs/parca.log', '/wcEcoli/out/wf/kb/'],
 'storage_prefix': 'sisyphus-jerry/WCM/20210217.112538__numpy_wheel/',
 'timeout': 3600}

Docker Image ID: sha256:71c0466fe4928c373dab6f741af403be603c3874a089e088a938220d35c2cbb8

--------------------------------------------------------------------------------
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
runTask ParcaTask('ribosome_fitting': True, 'rnapoly_fitting': True, 'cpus': 1, 'variable_elongation_transcription': False, 'variable_elongation_
translation': False, 'debug': False, 'output_directory': '/wcEcoli/out/wf/kb/')
Wed Feb 17 19:53:26 2021: Instantiating raw_data
Wed Feb 17 19:53:31 2021: Saving raw_data
Wed Feb 17 19:53:31 2021: Calculating sim_data parameters
Ran initialize in 28 s
Ran input_adjustments in 0 s
Fitting RNA synthesis probabilities.
Max derivative (counts) = 0.000000
Max derivative (concentration) = 332047597.307941
Running non-linear optimization
Convergence (Jacobian) = 0% (<K> = 2475.04571)
Convergence (Jacobian_aux) = 96% (<K> = 0.98532)
Ran basal_specs in 210 s
Fitting RNA synthesis probabilities.
Fitting RNA synthesis probabilities.
Fitting RNA synthesis probabilities.
Fitting RNA synthesis probabilities.
...
Fitting condition no_oxygen
Fitting condition succinate
Fitting condition with_aa
Ran fit_condition in 973 s
Fitting promoter binding
--------------------------------------------------------------------------------

FAILED TASK: parca, elapsed 0:21:41 of timeout parameter 1:00:00 ['Docker process exit code 137 (SIGKILL)']

The last line could be better worded. The payload firetask inside Docker was not killed by the timeout. It wasn't killed due to Out Of Memory unless the Docker code changed how it indicates that status. I do not know what killed it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants