Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling multi-PDB files #50

Open
rasbt opened this issue Apr 4, 2018 · 0 comments
Open

Handling multi-PDB files #50

rasbt opened this issue Apr 4, 2018 · 0 comments

Comments

@rasbt
Copy link
Member

rasbt commented Apr 4, 2018

I am cross-posting a discussion from the mailing list with regard to multi-PDB files containing MODEL & ENDMDL tags, which are currently not handled by BioPandas.

However, it should definitely be handled in one way or the other. Currently, I don't have any best idea on how to handle that and would welcome and thoughts and feedback (let me cross-post that on the GitHub issue tracker -- maybe better to continue the discussion about potential ways to implement it there).

I think one of the problems with the DataFrame format is that having them all in one DataFrame would probably result in a lot of weird -- or unexpected -- results, thus it would probably best to separate the structures one way or the other ...

  1. One option would be to provide a utility function (analogous to the split_multimol2 function, http://rasbt.github.io/biopandas/tutorials/Working_with_MOL2_Structures_in_DataFrames/#parsing-multi-mol2-files) that generates multiple PandasPdb objects from such a file. I.e., it would simply be a list

    pdbs = [pdb_1, pdb_2, .... pdb_n]

which would preserve the current functionality of the library without any e.g., backwards-incompatible changes. This would then also help with using the multiprocessing library more easily and efficiently for the analysis of multiple PandasPdb objects in parallel.

  1. Right now, the PandasPdb objects have a dictionary containing multiple DataFrames
    dict_keys(['ATOM', 'HETATM', 'ANISOU', 'OTHERS'])

For multi-PDB files, the dictionary could be expanded to

dict_keys(['ATOM_1', 'HETATM_1', 'ANISOU_1', 'OTHERS_1', 'ATOM_2', 'HETATM_2', 'ANISOU_2', 'OTHERS_2', ...])

I strongly favor scenario 1) though; however, I would love to hear feedback on this and are open to other suggestions!

In any case, also an error (or at least a warning) should be raised if MODEL & ENDMDL tags are found in a PDB file if the current read_pdb method is used such that this doesn't lead to any unexpected behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant