Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize GeneralBindingModel #90

Open
jchodera opened this issue Apr 23, 2017 · 19 comments
Open

Optimize GeneralBindingModel #90

jchodera opened this issue Apr 23, 2017 · 19 comments

Comments

@jchodera
Copy link
Member

From some profiling tests, it looks like root solving in the GeneralBindingModel is the dominant cost of the new autoprotocol-based assaytools changes, with total cost being somewhere around 76% of total sampling time spent in calls to scipy.optimize.root here.

I imagine we could greatly speed this up by rewriting the target function using some optimization scheme.

@jchodera
Copy link
Member Author

@pgrinaway @gregoryross @maxentile : If you had any advice on how to optimize the target function with minimal effort, I'd be very appreciative of suggestions!

@pgrinaway
Copy link
Member

Taking a look at this now.

@jchodera
Copy link
Member Author

Thanks!

I realized that even just encoding the quantities as numpy arrays outside of the target function will likely lead to a big speedup, but I was wondering if there was some sort of numba or other thing that might trivially speed this up.

@pgrinaway
Copy link
Member

Ok, the most obvious way of optimizing this would be to implement it in Cython, but there are a few tricky points:

  • The code appends to lists--we wouldn't want to do that in Cython, since that would require manipulating Python objects (as opposed to numpy arrays which are well-represented in C).

  • The code uses dictionaries heavily. This would also diminish performance under Cython, because again it would just be calling the Python interpreter.

  • That function would have to be removed as a closure, and accept all of the data that it needs (with types fully specified) as parameters. This is not very difficult.

It's not substantially easier with numba--with both numba and Cython you'll be able to take this code and directly compile it (I think--numba might not be happy with the closure, I've never tried it), but you probably won't see much performance increase.

So, it would require a bit of re-thinking how the function works (which might even in pure python/numpy make the code faster). I can take a stab at this, if you want.

@pgrinaway
Copy link
Member

I don't think numba would be trivial (to get a speedup), but it's easy to try.

@jchodera
Copy link
Member Author

I think the key to speeding things up is to speed up just the target function that gets called many times by the root finder. It accepts a numpy array and returns two numpy arrays. I should be able to move some logic outside the target function and do something to speed up the computation inside the target.

We can still have the binding model funciton accept and return dicts, but it can do some preprocessing so that the target function runs as fast as possible on numpy arrays.

@maxentile
Copy link
Member

Also took a look!

Yeah, numba doesn't know what to do with dicts (code using dictionaries will run in "object mode" / as slow as interpreted code), so those will need to be flattened out into arrays beforehand for numba to have any effect...

@jchodera : Could you point to a few input instances to optimize for? On the input in the doctest, the Pycharm profiler says the most time-consuming functions are in blas, so I think loop-jitting might not produce a substantial speed-up on that instance. However, if there are a lot more reactions, species, or conservation_equations, the interpreter overhead would be more substantial

@jchodera
Copy link
Member Author

Here is the description of the calculation in equations, for reference:

http://assaytools.readthedocs.io/en/latest/theory.html#general-binding-model

@jchodera
Copy link
Member Author

Whoops, let me post the branch name and info on how to run the example shortly. Sorry about that!

I do realize we don't want to have any dict processing inside the target function. I can do preprocessing outside the target and feed numpy arrays in. But is that the best I can do?

@pgrinaway
Copy link
Member

I think the key to speeding things up is to speed up just the target function that gets called many times by the root finder. It accepts a numpy array and returns two numpy arrays. I should be able to move some logic outside the target function and do something to speed up the computation inside the target.

Right, this is what I meant. If we deal with dicts and resizing lists inside the target function, it's going to be pretty difficult to optimize.

@pgrinaway
Copy link
Member

I do realize we don't want to have any dict processing inside the target function. I can do preprocessing outside the target and feed numpy arrays in. But is that the best I can do?

Once you've done that, you can give numba a try. But my experience with numba is that you end up having to write essentially Cythonic /C-style code to get the substantial speed up: see what we ended up with in perses. All types are fully specified, and the numba compiler is even instructed to fail if it can't avoid using the interpreter. I spent a bit of time gradually optimizing those routines, and ended up with what you see in that link.

Same with Cython--you can see what I did in yank. I also gradually made the code more and more C-like to get maximum performance.

@pgrinaway
Copy link
Member

At first glance it might not work, but you might be able to use numexpr somewhere. I suspect that might take as much effort as refactoring for Cython, though.

@jchodera
Copy link
Member Author

Thanks! This is super helpful!

@jchodera
Copy link
Member Author

No need to spend time with this, but since someone asked, you want to check out the optimizer-speedup branch (which prints some output about how much time is spent in the fzero() call), and try running examples/autoprotocol/single-wavelength-probe-assay.py if you want to try it out.

@maxentile may be totally right that the time-consuming part may not be the target function computation. I hadn't even considered that!

I had tried adding an LRU cache to store the most recent 128 or 1024 evaluations, but that did not seem to improve performance.

@pgrinaway
Copy link
Member

Ok, I've profiled that script, and the top two functions that are causing slowness are:

  1. values from WeakValueDictionary

  2. ftarget (this is actually 47% of the total time, though its "own" time is 8.5%). One of the big consumers there is new_method, which is something deep in PyMC.

I will post the profiling results in full later, but this does suggest that ftarget could stand to be optimized.

@pgrinaway
Copy link
Member

Ah, new_method is a function inside the function create_uni_method in CommonDeterministics. Not sure why it gets called every time ftarget does.

@pgrinaway
Copy link
Member

Sorry for the spam, but it seems to me like it is bits of PyMC that are quite slow, and that the ftarget function itself might not be so bad.

@jchodera
Copy link
Member Author

That's really odd. The ftarget function should be called many times each call to equilibrium_concentrations (which is called once inside a couple of deterministics). pymc may be doing some weird wrapping of things, or perhaps a weird unwrapping of things where pymc variables are being substituted for floats deep inside the ftarget function.

I wish I knew more about how pymc worked, since this suggests that we could speed things up by unwrapping the input pymc variables once and feeding them to a pre-constructed function that only dealt in numpy floats to do the optimization with pre-computed matrices.

In other news, trying to figure out how to recode this all for tensorflow or edward is also not proving easy, since tensorflow compute graphs can't have nodes that are iterative operations like function solvers.

@pgrinaway
Copy link
Member

That's really odd. The ftarget function should be called many times each call to equilibrium_concentrations

It is called several times more than equilibrium_concentrations, you're correct.

pymc may be doing some weird wrapping of things, or perhaps a weird unwrapping of things where pymc variables are being substituted for floats deep inside the ftarget function.

I suspect this is the case. It seems like some sort of pymc variable is being instantiated each call, which, according to profiling, is extremely expensive.

In other news, trying to figure out how to recode this all for tensorflow or edward is also not proving easy, since tensorflow compute graphs can't have nodes that are iterative operations like function solvers.

Right, that's unfortunate. What about just coding it up in regular Python with numpy? It would probably be relatively fast, pythonic, nicely extensible using just Python, etc., and if some further optimization is necessary, numpy plays very nicely with Cython.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants