Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative counts from StochasticSystem #39

Open
tahorst opened this issue Jun 20, 2019 · 5 comments
Open

Negative counts from StochasticSystem #39

tahorst opened this issue Jun 20, 2019 · 5 comments

Comments

@tahorst
Copy link
Member

tahorst commented Jun 20, 2019

I've noticed that under certain conditions, StochasticSystem.evolve() can produce negative counts which should not be possible if I'm not mistaken. I've pushed a branch (neg-counts) which has a script to reproduce this behavior - python neg_counts.py should have the system evolve until negative counts are reached. There are also very few reactions occurring even with an extremely long time, which might be a related issue.

@jmason42
Copy link
Contributor

jmason42 commented Jun 20, 2019

I haven't run the system but I did inspect the data file - guessing this is adapted from the complexation stoichiometry. Barring an implementation issue, I'd guess we're looking at a numerical problem, probably downstream of the fact that there are very large number of reactants in some reactions - for example, there's one where there are 130 reactants, 120 of which are the same species. Depending on the implementation, "n choose 120" could easily overflow (e.g. if you try to compute 120 factorial). Just a guess at this point, I'll try to take a look. Thanks for the example code.


Going to edit with findings until I have a better idea of what's happening. First clue here is that the 'time' is NaN immediately following the first step. So that should be caught and thrown early.

Additionally, the only reaction that occurs is the first reaction, over and over until the system fails.

Hmm, 'choose' seems OK. It spits out what look like reasonable numbers. They're strictly positive, which I think is because anything that would be zero gets filtered before the call to 'choose'.

Yes, so many of these reaction propensities are extreme. E.g. at the final step, the propensity for reaction 84 is 1.59e91 reactions per second. That's not even the largest, and there is at least one reaction with a NaN propensity. These problems exist from the first simulated step.

@jmason42
Copy link
Contributor

There are several issues coming together at once here:

  1. Extreme numbers of reactants participating in reactions (several types, several copies of a single type).
  2. Large numbers of molecules.
  3. A single, large stochastic rate constant being utilized for every reaction.

So, effectively, you have several big numbers (sometimes very big) multiplied together, resulting in some NaNs. The exact consequence of these NaNs isn't really worth tracking down but you can imagine how NaN propagation throws a wrench into the Gillespie SSA update procedure.

The remediation isn't clear. Obviously this package needs some error checking, but this is more of a GIGO issue. Ideally I'd recommend the following:

  • Discard any complexation reactions where the end product has no meaningful impact in the simulation. This should reduce the problem considerably.
  • Pick a representative constant that is physically reasonable for the modeled space. This means it should be backed by existing experimental data, and adjusted each time step for the volume of the cell. (Reference Find complexation rates wcEcoli#431 (comment).) Note that this representative rate constant only makes sense if all reactions are dimeric, thus...
  • Expand all (n>2)-meric complexation reactions into dimeric reactions. This could and should be done computationally.

I don't see any other approach that wouldn't be inferior to simply reverting complexation to my old procedure.

@prismofeverything
Copy link
Member

Interesting, so we have found the practical bounds of this algorithm. Thanks for the analysis @jmason42! From looking at it it seems

  • Changing the rates doesn't seem to affect the problem, as the overflow happens before the rates are applied. That said, using the right rates would be beneficial in general and you are correct to point out that we never fully resolved Find complexation rates wcEcoli#431.
  • Increasing the number of simultaneous reactants has a greater impact than simply increasing the counts. Reactant counts are effectively multiplied, making the number of simultaneous reactions exponential, while increasing counts is linear.
  • Doesn't matter which implementation we use. I got a different failure from GillespieReference, the original python implementation (overflow warning followed by value error).

Having the code detect these conditions is desirable and can be added without too much trouble, but as @jmason42 points out does not solve the problem. Reducing the number of simultaneous reactions would do it ("dimerization" of the stoichiometry), and the stoichiometry could be decomposed internally to this library so as to be transparent to the end user. With that we would increase the effective bounds of when this algorithm is applicable to a problem, though how much would need to be tested.

@prismofeverything
Copy link
Member

Expand all (n>2)-meric complexation reactions into dimeric reactions. This could and should be done computationally.

@jmason42 Did you have an implementation in mind here? The main choices I see are either a tree form where you split an aggregate reaction into successive "halves" and complexation works through recursive pairing, or the simple linear approach where you track each N-mer and apply one monomer at a time. The tree would be logarithmic in space if this needs to support reactions with 1000s of simultaneous reactants. Either way it could all be done in the python before passing the stoichiometry into the obsidian layer, and it would have to reassemble the original shape of the counts vector once it gets its results back from evolve.

@jmason42
Copy link
Contributor

Yes, I see this as an issue external to this library. My basic idea would be to totally reformulate the system such that the single-monomer-addition intermediates are real species, tracked by the simulation. This is fine in the small cases, e.g. A + B + C -> ABC would require introducing and tracking the AB, BC, and AC species (and replacing the modeled reaction with A + B -> AB, B + C -> BC, A + C -> AC, AB + C -> ABC, BC + A -> ABC, AC + B -> ABC).

Easy to expand this computationally but the number of possibilities grows catastrophically in many cases. I think a 10-mer is probably around the biggest reaction you'd be comfortable decomposing, hence my comment about disabling all the complexation that has no impact in the model. This still leaves important thing likes ribosomes, which have dozens of subunits.

So, running with the ribosome idea - your first option is to model their formation somewhere else, using something like the old complexation procedure. Reasonable if it's not likely to be important for the rest of the model, though unsatisfying because you start having several different realizations of similar physical processes.

Let's say that's not acceptable, and furthermore, the explicit representation of all different subunit combinations is computationally infeasible. At this point you can apply some physics/biology, and start culling down the number of possible assembly pathways to a handful. After all, it's not easy for a bunch of protein/RNA blobs to all be in contact with each other at once. This would reduce the problem considerably but I think there are some profound data issues here - @mialydefelice would know better. Even if the data is available it would be quite a chore to implement all the particular reactions.

Stepping back, you could keep all the possible assembly pathways, but rethink the underlying representation. Maybe there's a new "partial complex" unique molecule that could be used to represent those non-functional intermediates - this way you don't need a new entry in bulk molecules for all combinatorial possibilities. Furthermore you could consider collapsing reactions; e.g. maybe you believe that AB + C -> ABC and B + C -> BC are equivalent in terms of kinetics, so there's just a generic *B + C -> *BC reaction you need to update. (This is a similar idea to what we talked about with @heejochoi and the stochastic modeling of polypeptide elongation.)

So, those are some approaches. I don't really have a clear idea of what approach would be best, but practically speaking it would be easiest to pick one strict assembly pathway (even if the biological evidence is weak to nonexistent). That would keep the number of intermediate species linear w.r.t. the number of subunits. More explicit, detailed modeling can come later. The one major downside I see to this is that going from A to AB, to ABC, to ABCD, etc. would be slower than going from {A, C} to {AB, CD}, to ABCD - if you follow. Again, troublesome for the larger cases, but without more data on what those actually are I can't recommend a single appropriate course of action. E.g. there's no sense wracking our brains trying to accommodate special cases like flagella formation.

That was a big pile of information - in summary:

  • Trim out unneeded complexes. Hopefully this eliminates most bad cases.
  • For (2 < n < ~10)-mers, expand the reactions and add the intermediates to bulk molecules.
  • For (n >= ~10)-mers, do one of the following.
    • Model using the old complexation process (warning: brings back old timescale issues).
    • Simplify complexation pathway to a tractable number of intermediates (from biological evidence, or arbitrarily. Maybe start w/ highest molecule weight and progressively add smaller molecular weight species.).
    • Represent intermediate species as element of some new unique molecule (and possibly collapse down the possible reactions, too).

One more thing I neglected to mention: if you are modeling intermediates, you probably need to model reverse reactions, too. Otherwise you'll risk silly things like an accumulation of AB and BC that can't be taken apart and reassembled to make the terminal ABC complex. We usually think of intermediates as being unstable/highly dynamic, so in most cases, reversibility makes a lot of sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants