Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Pandas readers #2

Open
kyleabeauchamp opened this issue Nov 26, 2013 · 11 comments
Open

Use Pandas readers #2

kyleabeauchamp opened this issue Nov 26, 2013 · 11 comments

Comments

@kyleabeauchamp
Copy link
Collaborator

So one thing I've noticed is that we have lots of undocumented text files and lots of functions to read and write them.

IMHO, we should pick a set of standard formats (tab-delim or CSV) and use the Pandas readers to write these files. For binary files, we can use the Pandas HDF readers.

For the tab or CSV files, we should pick a convention for dealing with the column names, as well.

If we standardize on something like this, I think it will make our lives much easier in terms of automating tests and not forgetting what calculations we actually have.

PS: for now, please just continue dropping in whatever data you have. Once the data is all included, then we can think about how we might streamline the IO and file formats.

@mrshirts
Copy link
Collaborator

Hmm. I don't know that I want to be forcing other people to use a new set of tools to write scripts using pymbar. A lot of these tools have different file types, because they were generated by different people for different purposes. Pymbar should be as easy to integrate with other tools as possible with their existing data sets, which I think probably does mean some example interpreters for different file types.

GROMACS has a (now essentially fixed) file type for free energy output, and it's not going to be changing formats to CSV. Writing an additional tool to convert it into an intermediate file format seems like overkill.

I certainly think that good documentation for existing tools is a good idea. But I don't know that we can go all Procrustes on input data.

I'm all in favor of writing a Pandas tool that interfaces into pymbar for people who want to go that route, though.
Obviously, within ourselves we can enforce conventions for new tests.

@kyleabeauchamp
Copy link
Collaborator Author

The point is that Pandas provides general tools for parsing almost arbitrary delimited text formats. We should be using those tools, not re-inventing them.

@kyleabeauchamp
Copy link
Collaborator Author

The other advantage is that you get back a DataFrame object that is essentially a 2D numpy array but with column and row names attached. It's just a way for us to make sure data + metadata stay together as much as possible.

@kyleabeauchamp
Copy link
Collaborator Author

My thinking is that we should do:

Text File -> Pandas DataFrame -> Numpy Array -> pymbar.

This sort of process could be a big help for keeping track of things. The internals of pymbar will never have to know the difference, this is just a way for us to facilitate the data before and after pymbar analysis.

I'm not suggesting that we require people to do this. I just think it's a way to give people a tool that has "batteries included", rather than something that requires writing lots of boilerplate code to do routine analyses.

@mrshirts
Copy link
Collaborator

As long as we are not requiring people to use them. I also want to avoid
people who know python but not pandas from having to figure something new
out. At least at this point, the most important goal to improve usage is to
lower barriers to entry for other people.

On Mon, Nov 25, 2013 at 8:25 PM, kyleabeauchamp notifications@github.comwrote:

My thinking is that we should do:

Text File -> Pandas DataFrame -> Numpy Array -> pymbar.

This sort of process could be a big help for keeping track of things. The
internals of pymbar will never have to know the difference, this is just a
way for us to facilitate the data before and after pymbar analysis.

I'm not suggesting that we require people to do this. I just think it's
a way to give people a tool that has "batteries included", rather than
something that requires writing lots of boilerplate code to do routine
analyses.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-29259909
.

@kyleabeauchamp
Copy link
Collaborator Author

Agreed.

@kyleabeauchamp
Copy link
Collaborator Author

Are the files in 8proteins "standard" Gromacs files of some sort?

@mrshirts
Copy link
Collaborator

No, not at all. It's a set of files that a collaborator wrote. I just
took their script and tweaked it.

On Mon, Nov 25, 2013 at 8:51 PM, kyleabeauchamp notifications@github.comwrote:

Are the files in 8proteins "standard" Gromacs files of some sort?


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-29261137
.

@kyleabeauchamp
Copy link
Collaborator Author

For example:

import pandas as pd
x = pd.read_csv("./gas-properties/50.0/mbar_results/MBAR_results.dat", delim_whitespace=True)

x

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46 entries, 0 to 45
Data columns (total 29 columns):
Temperature(K)           46  non-null values
Pressure(MPa)            46  non-null values
Hconf(kcal/mol)          46  non-null values
dHconf                   46  non-null values
Volume(A3)               46  non-null values
dV                       46  non-null values
V^2(A6)                  46  non-null values
dV^2                     46  non-null values
V*Hconf(A3*kcal/mol)     46  non-null values
dV*Hconf                 46  non-null values
Uconf(kcal/mol)          46  non-null values
dUconf                   46  non-null values
Uconf*Hconf(kcal/mol)    46  non-null values
dUconf*Hconf             46  non-null values
rho(kg/m3)               46  non-null values
drho                     46  non-null values
aP(1/K)                  46  non-null values
daP                      46  non-null values
kT(1/MPa)                46  non-null values
dkT                      46  non-null values
Cp_id(J/molK)            46  non-null values
Cp(J/molK)               46  non-null values
dCp                      46  non-null values
Cv(J/molK)               46  non-null values
dCv                      46  non-null values
uJT(K/MPa)               46  non-null values
duJT                     46  non-null values
SS(m/s)                  46  non-null values
dSS                      46  non-null values
dtypes: float64(29)



x["SS(m/s)"]

In [7]: x["SS(m/s)"]
Out[7]: 
0     892.330148
1     876.261102
2     864.882443
3     856.366720
4     844.736289
5     832.189177
6     824.279528
7     817.706078
8     809.641205
9     803.053928
10    799.857741
11    799.185156
12    797.863023
13    793.377782
14    786.916835
15    781.211851
16    777.222163
17    774.391374
18    773.078627
19    774.093310
20    776.128907
21    776.474675
22    774.163902
23    770.738474
24    768.267688
25    767.634485
26    768.315224
27    769.099101
28    769.051967
29    768.217491
30    767.575929
31    768.227600
32    770.461313
33    773.455152
34    775.802562
35    776.478374
36    775.513970
37    773.889497
38    772.856485
39    773.285955
40    775.382017
41    778.713344
42    782.425943
43    785.553992
44    787.403894
45    787.916714
Name: SS(m/s), dtype: float64

@jchodera
Copy link
Member

Have been at Merck all day so couldn't write back earlier.

I think "pymbar" should be just the minimal library and tests. This is the package we will have "yank" require and auto install if necessary. Let's not use pandas here, since it isn't needed.

"pymbar-examples" should contain reasonably sized datasets and file readers adapted to these datasets. Here, we can use pandas.

I think a good approach would be to have some analysis helper classes that simplify the analysis of, for example, replica exchange simulations. There are many complicated steps here that don't need to be duplicated in each driver script. This could probably go into the "pymbar-examples" directory and use pandas.

@kyleabeauchamp
Copy link
Collaborator Author

Yeah, that's consistent with what I'm thinking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants