Improve PDB formatting with incomplete Monomer info #7286

fwaibl · 2024-03-21T16:12:09Z

Introduction

By default, the PDB writer uses the residue name UNL, and sets very reasonable atom names:

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles(r"ClCl")
>>> print(Chem.MolToPDBBlock(mol))
HETATM    1 CL1  UNL     1       0.000   0.000   0.000  1.00  0.00          CL  
HETATM    2 CL2  UNL     1       0.000   0.000   0.000  1.00  0.00          CL  
CONECT    1    2
END

We can also set the residue info manually. If this information is complete, everything works:

>>> at = mol.GetAtomWithIdx(0)
>>> res_inf = Chem.AtomPDBResidueInfo()
>>> res_inf.SetName("CLA ")
>>> res_inf.SetResidueName("CL2")
>>> res_inf.SetIsHeteroAtom(True)
>>> res_inf.SetResidueNumber(1)
>>> at.SetMonomerInfo(res_inf)
>>> print(Chem.MolToPDBBlock(mol))
HETATM    1 CLA  CL2     1       0.000   0.000   0.000  1.00  0.00          CL  
HETATM    2 CL1  UNL     1       0.000   0.000   0.000  1.00  0.00          CL  
CONECT    1    2
END

However, if the residue info is incomplete (e.g., we set the residue name but not the atom name), the PDB output will be invalid. Most importantly, the line is too short, so that the later columns get into the wrong place:

>>> mol = Chem.MolFromSmiles(r"ClCl")
>>> at = mol.GetAtomWithIdx(0)
>>> res_inf = Chem.AtomPDBResidueInfo()
>>> res_inf.SetResidueName("CL2")
>>> at.SetMonomerInfo(res_inf)
>>> print(Chem.MolToPDBBlock(mol))
ATOM      1  CL2     0       0.000   0.000   0.000  1.00  0.00          CL  
HETATM    2 CL1  UNL     1       0.000   0.000   0.000  1.00  0.00          CL  
CONECT    1    2
END

This happens if either the atom name or the residue name is missing. But IMO, the more important case is the missing atom name, since enumerating the atom names is more difficult that setting a dummy residue name.

Changes

This PR changes the code so that default atom names are generated if they are not present in the AtomPDBResidueInfo object. The same code is used as in the case without any residue info, and I refactored that code into a new function GetDefaultAtomNumber.
I also added an std::setw(4) before the atom name and an std::setw(3) before the residue name, so that additional spaces will be printed before the atom or residue name if it is too short. (See below: I should still add a unit test for that). Again, this avoids incorrect column alignment. If the user specifies a short name like "C1", it will still not be aligned properly within the column, but the result will be parsed correctly by PyMol. (PDB specifies that the alignment of atom names depends on the length of the element symbol. To do that perfectly, we would have to parse the user-specified atom name).

To do / Questions

I did not yet make a unit test where the atom name is set but the residue is not. I specified the minimal length in the code, but did not yet test if it works. I will add another commit for that.
I added a function called GetDefaultAtomNumber, since that code is now used in two different places. I only declared and defined this function in the PDBWriter.cpp file, since I don't believe that it can be useful anywhere else, and since there is no corresponding .h file. Is that ok?
What do you think of the alignment issue? I believe that adding spaces is an improvement, but it is still not perfect. However, the user can simply supply their atom names with the correct alignment.

What do you think of this? Thanks in advance for your help :-)

* additionally, sets the width of the atom name to 4. If the name is set to something shorter than 4 letters, fill up with whitespace. This avoids bad column alignment. If the name is set to something with more than 4 characters, the PDB will still be invalid.

Franz Waibl added 4 commits March 19, 2024 14:44

Merge branch 'master' of github.com:fwaibl/rdkit

ad2ce4c

Add a test for missing residue name and short atom names

03d1414

Fix atom name alignment in unit test

ab2fcb9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PDB formatting with incomplete Monomer info #7286

Improve PDB formatting with incomplete Monomer info #7286

fwaibl commented Mar 21, 2024 •

edited

Improve PDB formatting with incomplete Monomer info #7286

Are you sure you want to change the base?

Improve PDB formatting with incomplete Monomer info #7286

Conversation

fwaibl commented Mar 21, 2024 • edited

Introduction

Changes

To do / Questions

fwaibl commented Mar 21, 2024 •

edited