New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Pandas readers #2
Comments
Hmm. I don't know that I want to be forcing other people to use a new set of tools to write scripts using pymbar. A lot of these tools have different file types, because they were generated by different people for different purposes. Pymbar should be as easy to integrate with other tools as possible with their existing data sets, which I think probably does mean some example interpreters for different file types. GROMACS has a (now essentially fixed) file type for free energy output, and it's not going to be changing formats to CSV. Writing an additional tool to convert it into an intermediate file format seems like overkill. I certainly think that good documentation for existing tools is a good idea. But I don't know that we can go all Procrustes on input data. I'm all in favor of writing a Pandas tool that interfaces into pymbar for people who want to go that route, though. |
The point is that Pandas provides general tools for parsing almost arbitrary delimited text formats. We should be using those tools, not re-inventing them. |
The other advantage is that you get back a DataFrame object that is essentially a 2D numpy array but with column and row names attached. It's just a way for us to make sure data + metadata stay together as much as possible. |
My thinking is that we should do: Text File -> Pandas DataFrame -> Numpy Array -> pymbar. This sort of process could be a big help for keeping track of things. The internals of pymbar will never have to know the difference, this is just a way for us to facilitate the data before and after pymbar analysis. I'm not suggesting that we require people to do this. I just think it's a way to give people a tool that has "batteries included", rather than something that requires writing lots of boilerplate code to do routine analyses. |
As long as we are not requiring people to use them. I also want to avoid On Mon, Nov 25, 2013 at 8:25 PM, kyleabeauchamp notifications@github.comwrote:
|
Agreed. |
Are the files in 8proteins "standard" Gromacs files of some sort? |
No, not at all. It's a set of files that a collaborator wrote. I just On Mon, Nov 25, 2013 at 8:51 PM, kyleabeauchamp notifications@github.comwrote:
|
For example:
|
Have been at Merck all day so couldn't write back earlier. I think "pymbar" should be just the minimal library and tests. This is the package we will have "yank" require and auto install if necessary. Let's not use pandas here, since it isn't needed. "pymbar-examples" should contain reasonably sized datasets and file readers adapted to these datasets. Here, we can use pandas. I think a good approach would be to have some analysis helper classes that simplify the analysis of, for example, replica exchange simulations. There are many complicated steps here that don't need to be duplicated in each driver script. This could probably go into the "pymbar-examples" directory and use pandas. |
Yeah, that's consistent with what I'm thinking. |
So one thing I've noticed is that we have lots of undocumented text files and lots of functions to read and write them.
IMHO, we should pick a set of standard formats (tab-delim or CSV) and use the Pandas readers to write these files. For binary files, we can use the Pandas HDF readers.
For the tab or CSV files, we should pick a convention for dealing with the column names, as well.
If we standardize on something like this, I think it will make our lives much easier in terms of automating tests and not forgetting what calculations we actually have.
PS: for now, please just continue dropping in whatever data you have. Once the data is all included, then we can think about how we might streamline the IO and file formats.
The text was updated successfully, but these errors were encountered: