Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use these datasets? #3

Open
kyleabeauchamp opened this issue Nov 26, 2013 · 9 comments
Open

How to use these datasets? #3

kyleabeauchamp opened this issue Nov 26, 2013 · 9 comments

Comments

@kyleabeauchamp
Copy link
Collaborator

So it seems like for most of these datasets, there's no "right" answer, at least when compared to analytical test cases. That brings up the questions of how we can use these tests in an automated test framework.

The second issue that I'm seeing is that these tests essentially involve running python scripts that involve ~1000 lines of IO, preprocessing, analysis, and output. The scripts are not something that will be easy to integrate into an automated test framework.

@kyleabeauchamp
Copy link
Collaborator Author

I guess the first thing we should do is figure out how to port the scripts to pymbar 2.0. The easiest way may be for me to write an mbar1.0 compatability object that exactly reproduces the API of pymbar1.0, but calls pymbar2.0 code under the hood.

@kyleabeauchamp
Copy link
Collaborator Author

For example, there's the issue of U_kln versus U_kn. It would take considerable time to rewrite all the scripts here to format the data into the new format, so a compatibility layer might be key.

@kyleabeauchamp
Copy link
Collaborator Author

I also think we might want to consider looking for more simple test cases where there are unambiguous right answers, either analytical or numerical.

@jchodera
Copy link
Member

I would prefer our approach to be:

  • Analyze a dataset to see the problem someone is describing
  • Figure out how to recapitulate that problem in a synthetic dataset
  • Add that synthetic dataset to our tests

As a minimal alternative, we can just make sure the code runs on these datasets, but that is a very low bar.

@jchodera
Copy link
Member

Is @mrshirts subscribed here?

@kyleabeauchamp
Copy link
Collaborator Author

Yes

@kyleabeauchamp
Copy link
Collaborator Author

I agree with the synthetic dataset stuff. IMHO I'm just overwhelmed by the idea of us maintaining thousands of lines of user-contributed code as part of our testing protocol.

@jchodera
Copy link
Member

On Nov 26, 2013, at 5:05 PM, kyleabeauchamp notifications@github.com wrote:

I agree with the synthetic dataset stuff, though. IMHO I'm just overwhelmed by the idea of us maintaining thousands of lines of user-contributed code as part of our testing protocol.

I agree completely. There's no way we can possibly do that.

There may still be a few large datasets that we would like the code to work on or at least give consistent answers on, such as the large trypsin datasets that Michael has generated. But this seems like a low priority goal over testing systems with analytical results.

I still need to code to some analytically tractable systems for binding affinity calculations. Those could be included in our tests as well if we feel we need more diversity than just harmonic oscillators.

John

@mrshirts
Copy link
Collaborator

Hi, all-

Busy all day with classes and meetings! I'm adding these datasets because
they represent hard cases and/or interesting applications that use a lot of
data.

In all cases, there is a script that is currently working that can be run
to produce output. So at least on a high-level, one just needs a script
that calls those scripts, and inspects the output -- the only customizable
things are the filenames and the names of the output files. These are not
going to be things that are used in nightly regression tests, or even
downloaded by most users.

I don't think we want or need to maintain these things, other than perhaps
altering the call to pymbar (and I'm happy to do that as long as they are
working) They do represent hard problems that we'd like to manage. For
example, the gas-properties is a memory hog, and we'd love to reduce that.
the 8proteins case is a case where the free energy range requires that the
weights be stored in the log case, because otherwise you have exp(large
negative number) * exp(large positive number) = 0 because exp(large
negative number) = 0 to machine precision.

Going back to a question that kyle asked earlier; I suspect that in the
iterative cases, we can probably do the solutions in the exponential
domain, and then store in the log domain (though this needs to be tested).
So when doing an expectation we would do:

A = \sum exp(log W_n + log A_n).

Where W_n is the mixture distribution weight of sample n.

This would incur the cost of exponentials each time, but it's not an
iterative cost at least.

It's possible that one could have some way to test which version would be
used if fast enough if both log and exponential versions are stored. I've
defaulted to only storing log, but that may not be that costly.

Free energies of unsampled states would be

f_new = -log \sum exp(log W_n - u_newn).

Where A_n has been transformed to always be greater than 1.

Note that if we keep a legacy routine (of any flavor) that does everything
in the log domain, we can always test new extreme cases easily.

On Tue, Nov 26, 2013 at 11:39 AM, John Chodera notifications@github.comwrote:

On Nov 26, 2013, at 5:05 PM, kyleabeauchamp notifications@github.com
wrote:

I agree with the synthetic dataset stuff, though. IMHO I'm just
overwhelmed by the idea of us maintaining thousands of lines of
user-contributed code as part of our testing protocol.

I agree completely. There's no way we can possibly do that.

There may still be a few large datasets that we would like the code to
work on or at least give consistent answers on, such as the large trypsin
datasets that Michael has generated. But this seems like a low priority
goal over testing systems with analytical results.

I still need to code to some analytically tractable systems for binding
affinity calculations. Those could be included in our tests as well if we
feel we need more diversity than just harmonic oscillators.

John


Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-29308171
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants