Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New class to help design experiments based on assaytools MCMC data #80

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

gregoryross
Copy link
Collaborator

A class and function have been written to estimate the coefficient of variation, variance, and bias of fluorescence experiments at different assay parameters, such as the protein and ligand concentrations.

This PR includes a Jupyter notebook to demonstrate the functioning of these tools on assaytools PyMC data on the p38-Bosutinib complex. This PR has been opened so we can discuss what other functionality we need from this, or if there are any glaring errors. Currently, the files are kept in a separate folder, but these can be moved to more appropriate locations once before this PR is merged.

To do

  • Probably change the model for the detector noise (currently using Gaussian noise)
  • unit tests (contingent on having a small sized PyMC data file)

…nt of variation and total error of future experiments as a function of protein concetration.
… concentrations. Cleaned up functions in AssaySimulator.
…ance, and bias for different experimental parameters. The jupyter notebook was updated to reflect this. Two superfluous files were removed.
@gregoryross
Copy link
Collaborator Author

Tagging @sonyahanson, @MehtapIsik, and @jchodera.

@sonyahanson
Copy link
Contributor

Thanks! Will take a look!

@jchodera
Copy link
Member

jchodera commented Apr 7, 2017

Whoa, the CV of the binding free energy is 1000-9000%?! This sounds extremely horribly wrong.

@jchodera
Copy link
Member

jchodera commented Apr 7, 2017

And the relative error is also in the binding free energy and varies 80-180%?

This all sounds suspiciously wrong. Errors this high would mean that the approach has almost no value.

Copy link
Member

@jchodera jchodera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through assay_simulator.py.

Currently, it seems that the only source of uncertainty in your bootstrap samples is the initial guess for DeltaG, and that the least-squares optimization results in something like a CV of 9000%, which suggests to me that the least-squares fit is so unstable as to be unusable.

There may be some other unit errors or bugs, though I didn't spot any in my pass through.

It would probably be useful to plot what is going on with the simulated fluorescence and resulting least-squares fit to see why it is getting stuck so far from the solution.

We should really be sure to incorporate the sources of uncertainty we think are important into this model. I was hoping that you could use the same pymc model to simulate the sources of uncertainty rather than have to recode everything---I think there's a way to do this---but if we do implement everything from scratch as is done here, we would want to incorporate:

  • protein concentration (something like ~10% uncertain)
  • ligand dispensed concentrations (~8% per well)
  • measurement noise and the noise floor (or background fluorescence), since this determines minimum detectable quantity of complex

if noisy:
Fmodel += np.random.normal(loc=0.0, scale=self.sigma, size=len(self.l_total))
else:
Fmodel = self.F_PL * PL + self.F_L * L_free + self.F_P * P_free + self.F_buffer * self.path_length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this missing self.F_plate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, adding it in now.

"""
The sum of squares between model fluorescence and the target
"""
model = self.simulate_fluorescence(DeltaG, p_total, noisy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be adding measurement noise here if we want to quantify the CV?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the loss function that is minimized during the fitting, so the prediction (called model here) must be a deterministic function of the parameters for gradient decent. The measurement noise is added a few lines above this:

target = self.simulate_fluorescence(p_total)

The target fluorescence has the noise added by default. As this is not clear, I'll change this to

target = self.simulate_fluorescence(p_total, noisy=True)

simulator = AssaySimulator(pymc_data=pymc_data, l_total=l_total, sample_index=ind, p_total=p_total[p], **kwargs)
# Draw fluorescence data with different values of random noise
for j in range(nsamples):
fit = simulator.fit_deltaG()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since fit_deltaG currently doesn't add any noise, won't this omit the uncertainty in the measurements?

return np.sum((model - target)**2)

# Start the initial guess within about 10% of the "true" value
guess = self.DeltaG + np.random.normal(loc=0, scale=0.1 * np.abs(self.DeltaG))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is the only source of uncertainty currently being added to your model---the initial starting guess for the least-squares fit---since noisy=False in sum_of_squares above. Is that right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described above, noise is added when the fluorescence data is simulated. https://github.com/choderalab/assaytools/blob/experiment_design/experimental_design/assay_simulator.py#L178-L179

As this isn't clear, I improve the documentation.

@jchodera
Copy link
Member

jchodera commented Apr 9, 2017

I'm wondering if there might be a way to replace AssaySimulator with a much simpler class.

Consider the following scheme:

  • Generate a pymc model with pymcmodels.make_model
  • Run the normal MCMC sampling with pymcmodels.run_mcmc(pymc_model)
  • Replace any stochastic variables in the resulting model that we want to fix (such as the true thermodynamic parameters and fluorescence yields) with point estimates by setting their observed attribute to True. The simplest thing to do is to fix them at their final values by only changing observed (possibly after a final MLE estimate), but we could replace the values with averages over the MCMC run if we want.
  • Change the observed fluorescence random variables attributes from observed=True to observed=False so that these will be sampled as random variables
  • Run the MCMC simulation again to generate realizations of the experiment using exactly the same model
  • Fit these realizations with MCMC

We can then change the relevant design parameters---the stated protein concentration, the assay volume/area, and the ligand concentrations---and see how this affects things.

Does that sound like it would be a viable approach? I haven't had a chance to play with this myself yet, but it seems like this kind of imputation is possible with pymc2.

-improving doctrings and a few comments
-removing extra factor of 100 from CV estimation
-added background fluorescence to data simulation when inner_filter=False
@gregoryross
Copy link
Collaborator Author

Thanks a lot for going through this @jchodera.

The CV predictions had an extra factor of 100 applied to them when plotting (I'd recently changed where I converted the fractions to percent). This has now been removed such that the minimum CV has a value of ~30%.

With regards to some of your comments, currently the only source of noise comes from fluorescence noise parametrized by either sigma_top or sigma_bottom. Every time AssaySimulator.fit_deltaG() is called, a target fluorescence profile is generated, with added noise, and the free energy is estimated using least-squares minimization.

I like your suggestion of using pymc itself for generating the data. The biggest pro would be that if the fluorescence model is changed, the experimental-design part would not require any modifications. The reason why I opted to write a separate class to estimate errors is that one doesn't need to run any more MCMC after the assaytools parameters have been estimated: the class uses the samples that have already been generated. The class itself is very simple, and is really just an interface to read the parameters from pymc and perform least-squares regression.

As for you suggestion about allowing the protein and ligand concentration to change for each fit, I can easily add that in now if you and @sonyahanson like?

@sonyahanson
Copy link
Contributor

Looks good to me. It'd be nice to be able to get the actual delG predictions from predict_assay_error as well as the error, with the ultimate goal of making a figure like this:

gregoryross and others added 5 commits April 12, 2017 14:32
…tionality of AssaySimulator was slighlty changed so that instead of fitting a single binding free energy, multiple estimates of the free energy are returned instead.
merging master into experiment_design
@sonyahanson
Copy link
Contributor

Added additions that allow this to work with the new logNormal wrapper. Also added a section of the notebook to see how CV's dependence on [P] varies by deltaG.

@sonyahanson
Copy link
Contributor

Three things:

  • In the new notebook the CV/error plots in lines 13 and 14 don't show the trend seen in the notebook below where the delG is fixed. @gregoryross mentioned this might be due to the error being estimated that is not compatible with the new lognormal formatting?
  • Tried to run this on a different dataset with higher affinity ligand, and make the last set of plots for DeltaG's of -18, -17, -16, -15, and -14. At the lower affinities this modeling finished in about two hours, for the higher affinities, I let it run for 24 hours and it never finished.
  • It would be good to be able to get out the predicted total fluorescence as a function of ligand for at least some of the simulated datasets. This is probably pretty simple, but noting it here.

@jchodera
Copy link
Member

@sonyahanson : I don't understand the description of the problems you list above, so come find me in meatspace if I can help.

…intial guess starts at the true value. In addition, the gradient tolerance of the BFGS optimizer was also readuced for speed.
…tted to is now outputed. Notebook updated accordingly.
@sonyahanson
Copy link
Contributor

sonyahanson commented Jun 5, 2017

Thanks, Greg! All three points addressed with the last two commits. Currently running a notebook to test on a higher affinity dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants