Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parca for operons #1123

Open
1fish2 opened this issue Jul 20, 2021 · 3 comments
Open

Parca for operons #1123

1fish2 opened this issue Jul 20, 2021 · 3 comments

Comments

@1fish2
Copy link
Contributor

1fish2 commented Jul 20, 2021

Design sketch after brainstorming with @ggsun:

  • (If I got this right) add a Parca option to select between (P) polycistronic and (M) monocistronic operons. The downstream simulation code would implement the operons as specified in sim_data.
    • A third choice would emit (B) both cases in order to run sims and compare their results as "variants."
    • Q. Would the default be (P)?
    • Q. How much would we use (B)?
    • Q. Are we expecting more Parca variations?
  • Add a PM() variant function that selects the (P) or (M) sim_data. The default case PM(0) would do the same thing as Wildtype() while PM(1) would select the other sim_data.
  • Add mechanism into apply_variant() to compose PM() with any existing variant function, indexing a composite variant as composed_variant_index * 2 + PM_index. (fw_queue, wcm, and manual runscripts require always supply a contiguous range of variant indexes. We could change that if needed.)
    • Alternatively, always feed PM(0) into any existing variant and hand-code composite functions as needed.
  • Put the (P) and (M) Parca output files (simData.cPickle, rawData.cPickle, validationData.cPickle, save-intermediates, ...) into separate kb/ subdirectories.
    • The (B) case can run the two Parcas in parallel. Use our nestable multiprocessing pool in manual/runParca and separate Firetasks in Fireworks workflows.
    • PM(index) would copy one or the other sim_data to the variant's kb/ directory, and also copy rawData.cPickle and validationData.cPickle (or create symlinks?). Make the analysis code read those copies.
    • Alternative to separate sim_data files: Put all the (one or two) generated sim_data trees into one pickle file, say in a dict, and do likewise for the other output files. Make the variant functions write one sim_data object to each variant kb/ directory. This simplifies apply_variant() a little, maintains a single kb/ directory, allows sharing identical leaf nodes (ndarrays etc.) between the two sim_data trees, and could save some duplicate computation. But @ggsun points out that the two cases diverge pretty early in the Parca workflow because the operon structure affects the transcription probabilities and their modulation by transcription factors which takes up the bulk of the Parca calculations. So this sounds like more development work and less runtime parallelism.
  • Use separate KmcountsCached cache files for the two cases.
    • Or put the cache's checksum in its filename so every distinct case would get a distinct cache file. The files would accumulate until make clean.
    • Q. The KmcountsCached checksum compares Kmcounts.shape and sum(abs(R_aux(KmcountsCached))). Would a CRC checksum be more selective than sum(abs())?
@ggsun
Copy link
Contributor

ggsun commented Jul 21, 2021

Thanks for all the discussions and for writing this up, Jerry! To propose my opinions on some of the questions here:

Q. Would the default be (P)?

My hope is that eventually the default option would be (P), though in the early stages of this work the default would be set to (M) to minimize interference with other people's work. How quickly we can switch over to (P) would depend on how disruptive adding operon structures would be to how the model runs in general.

Q. How much would we use (B)?

I'd say its uses are limited to a specific instance where we need to compare the outputs of the simulation with/without operons. I don't see this being used often once we finish up a publication on the operon integration and actually move over to default (P), except for specific debugging purposes.

Q. Are we expecting more Parca variations?

Maybe. It's hard to anticipate what changes we would be bringing to the model in the future. If the necessary parameter changes for the variant requires that we go all the way back to raw_data and rerun the parca, we would need such a variant to change those parameters.

Q. The KmcountsCached checksum compares Kmcounts.shape and sum(abs(R_aux(KmcountsCached))). Would a CRC checksum be more selective than sum(abs())?

Yes, that would probably be a better way to do this, though in this specific case Kmcounts.shape will be already different so it would be moot.

@tahorst
Copy link
Member

tahorst commented Jul 21, 2021

Lots of great discussion points here! Thanks for detailing everything Jerry! I agree with Gwanggyu's responses. Overall, I think improving the variant approach (adding a parca variant level and composing simulation variants) would be useful as long as it doesn't unnecessarily complicate the workflow and runscripts with a few more points about those in responses below.

Q. Are we expecting more Parca variations?

I would expect a framework to use parca variants would be useful moving forward. We already have some parca options like variable elongation and capacity fitting that could be analyzed as variants. A parca variant framework could also be helpful for the work I added in #1108 if we consider different raw_data inputs and modified sim_data outputs as a variant in order to simplify the directory structure with each iteration and make it easier to apply simulation variants on top.

Add mechanism into apply_variant() to compose PM() with any existing variant function, indexing a composite variant as composed_variant_index * 2 + PM_index

There would be a lot of utility in allowing variant composition in general. I think in particular, combining the condition variant with other variants would be useful right now.

Or put the cache's checksum in its filename so every distinct case would get a distinct cache file. The files would accumulate until make clean.

I like this idea! Often times, I will be working on multiple branches that will each have to rerun the non linear optimization when I switch between them. This could be avoided if they each had their own cache files.

@1fish2
Copy link
Contributor Author

1fish2 commented Jul 21, 2021

Since more Kmcounts caches would help right away and it's an independent step for operons, I'll start there.

Any objection to renaming it from fixtures/endo_km/km3.cPickle to, say, cache/parca-km-<code>.cPickle, where <code> is some checksum info? It could be the sum() to limited digits of precision, or the shape, or a CRC, or some combination of those.

So variant types & indexes that set different Parca options would help more generally, also composing variants.

1fish2 added a commit that referenced this issue Jul 22, 2021
Put a checksum into the `KmcountsCached` cache filename so different cases get independent cache files, e.g. when switching git branches, Parca options during parameter optimization, or mono/polycistronic operons.

This renames the cache file from `fixtures/endo_km/km3.cPickle` to `parca-km-1918837868.cPickle`, for instance.

Q. Does anyone prefer the "fixtures" directory name?

The cache files `cache/parca-km-*.cPickle` will accumulate until `make clean`.

Does this succeed in distinguishing current cases?

We could make this more sensitive by checksumming more inputs or less picky by rounding `Kmcounts.astype(np.float16)`.

See #1123
1fish2 added a commit that referenced this issue Jul 27, 2021
Put a checksum into the `KmcountsCached` cache filename so different cases get independent cache files, e.g. when switching git branches, Parca options during parameter optimization, or mono/polycistronic operons.

This renames the cache file from `fixtures/endo_km/km3.cPickle` to `parca-km-1918837868.cPickle`, for instance.

Q. Does anyone prefer the "fixtures" directory name?

The cache files `cache/parca-km-*.cPickle` will accumulate until `make clean`.

Does this succeed in distinguishing current cases?

We could make this more sensitive by checksumming more inputs or less picky by rounding `Kmcounts.astype(np.float16)`.

See #1123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants