Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PDB formatting with incomplete Monomer info #7286

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

fwaibl
Copy link
Contributor

@fwaibl fwaibl commented Mar 21, 2024

Introduction

By default, the PDB writer uses the residue name UNL, and sets very reasonable atom names:

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles(r"ClCl")
>>> print(Chem.MolToPDBBlock(mol))
HETATM    1 CL1  UNL     1       0.000   0.000   0.000  1.00  0.00          CL  
HETATM    2 CL2  UNL     1       0.000   0.000   0.000  1.00  0.00          CL  
CONECT    1    2
END

We can also set the residue info manually. If this information is complete, everything works:

>>> at = mol.GetAtomWithIdx(0)
>>> res_inf = Chem.AtomPDBResidueInfo()
>>> res_inf.SetName("CLA ")
>>> res_inf.SetResidueName("CL2")
>>> res_inf.SetIsHeteroAtom(True)
>>> res_inf.SetResidueNumber(1)
>>> at.SetMonomerInfo(res_inf)
>>> print(Chem.MolToPDBBlock(mol))
HETATM    1 CLA  CL2     1       0.000   0.000   0.000  1.00  0.00          CL  
HETATM    2 CL1  UNL     1       0.000   0.000   0.000  1.00  0.00          CL  
CONECT    1    2
END

However, if the residue info is incomplete (e.g., we set the residue name but not the atom name), the PDB output will be invalid. Most importantly, the line is too short, so that the later columns get into the wrong place:

>>> mol = Chem.MolFromSmiles(r"ClCl")
>>> at = mol.GetAtomWithIdx(0)
>>> res_inf = Chem.AtomPDBResidueInfo()
>>> res_inf.SetResidueName("CL2")
>>> at.SetMonomerInfo(res_inf)
>>> print(Chem.MolToPDBBlock(mol))
ATOM      1  CL2     0       0.000   0.000   0.000  1.00  0.00          CL  
HETATM    2 CL1  UNL     1       0.000   0.000   0.000  1.00  0.00          CL  
CONECT    1    2
END

This happens if either the atom name or the residue name is missing. But IMO, the more important case is the missing atom name, since enumerating the atom names is more difficult that setting a dummy residue name.

Changes

  • This PR changes the code so that default atom names are generated if they are not present in the AtomPDBResidueInfo object. The same code is used as in the case without any residue info, and I refactored that code into a new function GetDefaultAtomNumber.
  • I also added an std::setw(4) before the atom name and an std::setw(3) before the residue name, so that additional spaces will be printed before the atom or residue name if it is too short. (See below: I should still add a unit test for that). Again, this avoids incorrect column alignment. If the user specifies a short name like "C1", it will still not be aligned properly within the column, but the result will be parsed correctly by PyMol. (PDB specifies that the alignment of atom names depends on the length of the element symbol. To do that perfectly, we would have to parse the user-specified atom name).

To do / Questions

  • I did not yet make a unit test where the atom name is set but the residue is not. I specified the minimal length in the code, but did not yet test if it works. I will add another commit for that.
  • I added a function called GetDefaultAtomNumber, since that code is now used in two different places. I only declared and defined this function in the PDBWriter.cpp file, since I don't believe that it can be useful anywhere else, and since there is no corresponding .h file. Is that ok?
  • What do you think of the alignment issue? I believe that adding spaces is an improvement, but it is still not perfect. However, the user can simply supply their atom names with the correct alignment.

What do you think of this? Thanks in advance for your help :-)

Franz Waibl added 4 commits March 19, 2024 14:44
* additionally, sets the width of the atom name to 4. If the name is set
  to something shorter than 4 letters, fill up with whitespace. This
  avoids bad column alignment. If the name is set to something with more
  than 4 characters, the PDB will still be invalid.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant