RMSD calculation for whole PDBs #69

LivC193 · 2020-10-09T15:06:36Z

Describe the workflow you want to enable

From what I can understand PandasPDB.rmsd can calculate the rmsd only if both dataframes have the same length. However, if I want to compare 2 PDBs (one chain each) where the target protein (Uniprot ID) is the same (for example 2vua vs 2vu9) I can't because although the protein is the same one has a different purification tag and thus I cannot use the rsmd function

Describe your proposed solution

it should be pretty easy to calculate de identity between 2 structures and select only those residues. This can ofc be done manually by each user, but I think it would be a great improvement.

rasbt · 2020-10-10T19:29:08Z

Hi there,
yes, you are right, right now it only computes the RMSD if the two amino acid sequences are the same. Like you suggested, a function that can handle this automatically by figuring out the relevant residues for certain cases could be coded, but I haven't got the time to implement this yet. it would be a good case for a future enhancement though.

wjs20 · 2022-02-13T10:43:55Z

I've had a go at adding a fix for this but I hit a bit of a brick wall.

My solution

Extract amino acid sequence string from pdbs to be compared.
Use difflib to find indices of matching blocks (i.e. excluding tags)
Use indices to determine set of residue numbers + insertion codes that are shared between the two structures.
Filter rows of both pdb dataframes to contain only shared residues and compute the rmsd on this subset.

Problem

Even after subsetting the longer pdb df to contain only residues present in the shorter, the 0th axis of the dataframes are still not equal. I found this was because although the subsetted dfs contain the same amino acid sequences, a small number of residues within the structure differ in their atom composition (how.. they are the same amino acid!?

I have included my python code and some intermediate outputs below so you can check it yourself.

from biopandas.pdb import PandasPdb
from biopandas.pdb.engines import amino3to1dict
from pathlib import Path
import difflib
from itertools import chain, groupby

# !mkdir data
# !curl -O https://files.rcsb.org/download/2VU9.pdb && mv 2VU9.pdb data/
# !curl -O https://files.rcsb.org/download/2VUA.pdb && mv 2VUA.pdb data/

DATA_PATH = Path('../data')
pdb_paths = !ls {DATA_PATH}/*

pdb1 = PandasPdb().read_pdb(pdb_paths[0])
pdb2 = PandasPdb().read_pdb(pdb_paths[1])

def amino3to1(pdbdf):
    tmp = pdbdf.copy()

    indices, residues = [], []
    for k, g in groupby(tmp.itertuples(), key = lambda x: x.residue_number):
        first_atom = list(g)[0]
        indices.append(first_atom.Index)
        residues.append(amino3to1dict[first_atom.residue_name])

    tmp = (
        tmp.loc[indices, ['chain_id', 'residue_number', 'insertion']]
        .assign(residue_name = residues)
        .reset_index()
    )

    return tmp

def locate_matching_blocks(seq1, seq2):
    matcher = difflib.SequenceMatcher(None, seq1, seq2)
    blocks = matcher.get_matching_blocks()[:-1]
    indices_1 = list(chain.from_iterable([
        range(block.a, block.a + block.size)
        for block in blocks
    ]))

    indices_2 = list(chain.from_iterable([
        range(block.b, block.b + block.size)
        for block in blocks
    ]))
    
    return indices_1, indices_2

seq_df1 = amino3to1(pdb1.df['ATOM'])
seq_df2 = amino3to1(pdb2.df['ATOM'])

seq1 = ''.join(seq_df1.residue_name.tolist())
seq2 = ''.join(seq_df2.residue_name.tolist())

# residue numbers 355, 356, 357 should be missing in sequence 1 (non matching tag)
seq_df1_indices_to_compare, seq_df2_indices_to_compare = locate_matching_blocks(seq1, seq2)

def add_residue_number_insertion_col(df):
    return (df.assign(
        residue_number_insertion = df.residue_number.astype(str).str.cat(df.insertion, sep = '_'))
       )

def subset_residue_number_insertions(df, subset):
    return df.loc[df.residue_number_insertion.isin(subset)]

seq1_residue_number_insertions = set(
    seq_df1.iloc[seq_df1_indices_to_compare]
    .pipe(add_residue_number_insertion_col)
    .residue_number_insertion
)

seq2_residue_number_insertions = set(
    seq_df2.iloc[seq_df2_indices_to_compare]
    .pipe(add_residue_number_insertion_col)
    .residue_number_insertion
)

subsetted_df1 = pdb1.df['ATOM'].pipe(add_residue_number_insertion_col).pipe(subset_residue_number_insertions, seq1_residue_number_insertions)

subsetted_df2 = pdb2.df['ATOM'].pipe(add_residue_number_insertion_col).pipe(subset_residue_number_insertions, seq2_residue_number_insertions)

def join_seq(df): return ''.join(amino3to1(df).residue_name)

# amino acid sequences are exactly the same
join_seq(subsetted_df1) == join_seq(subsetted_df2)
True

# shapes still unequal
print(subsetted_df1.shape, subsetted_df2.shape)
(3509, 22) (3514, 22)

If we collect the residue numbers mapped to their atom lists for each subsetted, dataframe you can see that although the residues match, some residue numbers are composed of different atoms in each dataframe.

atom_groups1 = [
    (k, [o.atom_name for o in list(g)] )
    for k, g in groupby(subsetted_df1.itertuples(), key = lambda x: x.residue_number_insertion)
]
atom_groups2 = [
    (k, [o.atom_name for o in list(g)] )
    for k, g in groupby(subsetted_df2.itertuples(), key = lambda x: x.residue_number_insertion)
]

[(g1,g2) for g1,g2 in zip(atom_groups1, atom_groups2) if g1[1] != g2[1]]

[(('873_', ['N', 'CA', 'C', 'O', 'CB', 'CG', 'ND1', 'CD2', 'CE1', 'NE2']),
  ('873_', ['CA', 'C', 'O', 'CB', 'CG', 'ND1', 'CD2', 'CE1', 'NE2'])),
 (('899_', ['N', 'N', 'CA', 'CA', 'C', 'C', 'O', 'O', 'CB', 'CB', 'CG', 'CG', 'OD1', 'OD1', 'ND2', 'ND2']),
  ('899_', ['N', 'C', 'O', 'CA', 'CB', 'CG', 'OD1', 'ND2', 'CA', 'CB', 'CG', 'OD1', 'ND2'])),
 (('925_', ['N', 'CA', 'C', 'O', 'CB', 'CG', 'CD', 'OE1', 'OE2']),
  ('925_', ['N', 'C', 'O', 'CA', 'CB', 'CG', 'CD', 'OE1', 'OE2', 'CA', 'CB', 'CG', 'CD', 'OE1', 'OE2']))
...
]

There are 8 mismatches in the pair given in the description of the issue. Not sure what the best way to proceed is (or why this is happening..)

@rasbt any thoughts?

a-r-j · 2022-03-11T00:44:58Z

Hey @wjs20 I’m pretty certain this will be caused by altlocs/insertions (read more) You’ll likely have to filter them out.

rasbt added the enhancement label Oct 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RMSD calculation for whole PDBs #69

RMSD calculation for whole PDBs #69

LivC193 commented Oct 9, 2020

rasbt commented Oct 10, 2020

wjs20 commented Feb 13, 2022

a-r-j commented Mar 11, 2022

RMSD calculation for whole PDBs #69

RMSD calculation for whole PDBs #69

Comments

LivC193 commented Oct 9, 2020

Describe the workflow you want to enable

Describe your proposed solution

rasbt commented Oct 10, 2020

wjs20 commented Feb 13, 2022

My solution

Problem

a-r-j commented Mar 11, 2022