Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RMSD calculation for whole PDBs #69

Open
LivC193 opened this issue Oct 9, 2020 · 3 comments
Open

RMSD calculation for whole PDBs #69

LivC193 opened this issue Oct 9, 2020 · 3 comments

Comments

@LivC193
Copy link

LivC193 commented Oct 9, 2020

Describe the workflow you want to enable

From what I can understand PandasPDB.rmsd can calculate the rmsd only if both dataframes have the same length. However, if I want to compare 2 PDBs (one chain each) where the target protein (Uniprot ID) is the same (for example 2vua vs 2vu9) I can't because although the protein is the same one has a different purification tag and thus I cannot use the rsmd function

Describe your proposed solution

it should be pretty easy to calculate de identity between 2 structures and select only those residues. This can ofc be done manually by each user, but I think it would be a great improvement.

@rasbt
Copy link
Member

rasbt commented Oct 10, 2020

Hi there,
yes, you are right, right now it only computes the RMSD if the two amino acid sequences are the same. Like you suggested, a function that can handle this automatically by figuring out the relevant residues for certain cases could be coded, but I haven't got the time to implement this yet. it would be a good case for a future enhancement though.

@wjs20
Copy link

wjs20 commented Feb 13, 2022

I've had a go at adding a fix for this but I hit a bit of a brick wall.

My solution

  1. Extract amino acid sequence string from pdbs to be compared.
  2. Use difflib to find indices of matching blocks (i.e. excluding tags)
  3. Use indices to determine set of residue numbers + insertion codes that are shared between the two structures.
  4. Filter rows of both pdb dataframes to contain only shared residues and compute the rmsd on this subset.

Problem

Even after subsetting the longer pdb df to contain only residues present in the shorter, the 0th axis of the dataframes are still not equal. I found this was because although the subsetted dfs contain the same amino acid sequences, a small number of residues within the structure differ in their atom composition (how.. they are the same amino acid!?

I have included my python code and some intermediate outputs below so you can check it yourself.

from biopandas.pdb import PandasPdb
from biopandas.pdb.engines import amino3to1dict
from pathlib import Path
import difflib
from itertools import chain, groupby

# !mkdir data
# !curl -O https://files.rcsb.org/download/2VU9.pdb && mv 2VU9.pdb data/
# !curl -O https://files.rcsb.org/download/2VUA.pdb && mv 2VUA.pdb data/

DATA_PATH = Path('../data')
pdb_paths = !ls {DATA_PATH}/*

pdb1 = PandasPdb().read_pdb(pdb_paths[0])
pdb2 = PandasPdb().read_pdb(pdb_paths[1])

def amino3to1(pdbdf):
    tmp = pdbdf.copy()

    indices, residues = [], []
    for k, g in groupby(tmp.itertuples(), key = lambda x: x.residue_number):
        first_atom = list(g)[0]
        indices.append(first_atom.Index)
        residues.append(amino3to1dict[first_atom.residue_name])

    tmp = (
        tmp.loc[indices, ['chain_id', 'residue_number', 'insertion']]
        .assign(residue_name = residues)
        .reset_index()
    )

    return tmp

def locate_matching_blocks(seq1, seq2):
    matcher = difflib.SequenceMatcher(None, seq1, seq2)
    blocks = matcher.get_matching_blocks()[:-1]
    indices_1 = list(chain.from_iterable([
        range(block.a, block.a + block.size)
        for block in blocks
    ]))

    indices_2 = list(chain.from_iterable([
        range(block.b, block.b + block.size)
        for block in blocks
    ]))
    
    return indices_1, indices_2

seq_df1 = amino3to1(pdb1.df['ATOM'])
seq_df2 = amino3to1(pdb2.df['ATOM'])

seq1 = ''.join(seq_df1.residue_name.tolist())
seq2 = ''.join(seq_df2.residue_name.tolist())

# residue numbers 355, 356, 357 should be missing in sequence 1 (non matching tag)
seq_df1_indices_to_compare, seq_df2_indices_to_compare = locate_matching_blocks(seq1, seq2)

def add_residue_number_insertion_col(df):
    return (df.assign(
        residue_number_insertion = df.residue_number.astype(str).str.cat(df.insertion, sep = '_'))
       )

def subset_residue_number_insertions(df, subset):
    return df.loc[df.residue_number_insertion.isin(subset)]

seq1_residue_number_insertions = set(
    seq_df1.iloc[seq_df1_indices_to_compare]
    .pipe(add_residue_number_insertion_col)
    .residue_number_insertion
)

seq2_residue_number_insertions = set(
    seq_df2.iloc[seq_df2_indices_to_compare]
    .pipe(add_residue_number_insertion_col)
    .residue_number_insertion
)

subsetted_df1 = pdb1.df['ATOM'].pipe(add_residue_number_insertion_col).pipe(subset_residue_number_insertions, seq1_residue_number_insertions)

subsetted_df2 = pdb2.df['ATOM'].pipe(add_residue_number_insertion_col).pipe(subset_residue_number_insertions, seq2_residue_number_insertions)

def join_seq(df): return ''.join(amino3to1(df).residue_name)

# amino acid sequences are exactly the same
join_seq(subsetted_df1) == join_seq(subsetted_df2)
True

# shapes still unequal
print(subsetted_df1.shape, subsetted_df2.shape)
(3509, 22) (3514, 22)

If we collect the residue numbers mapped to their atom lists for each subsetted, dataframe you can see that although the residues match, some residue numbers are composed of different atoms in each dataframe.

atom_groups1 = [
    (k, [o.atom_name for o in list(g)] )
    for k, g in groupby(subsetted_df1.itertuples(), key = lambda x: x.residue_number_insertion)
]
atom_groups2 = [
    (k, [o.atom_name for o in list(g)] )
    for k, g in groupby(subsetted_df2.itertuples(), key = lambda x: x.residue_number_insertion)
]

[(g1,g2) for g1,g2 in zip(atom_groups1, atom_groups2) if g1[1] != g2[1]]

[(('873_', ['N', 'CA', 'C', 'O', 'CB', 'CG', 'ND1', 'CD2', 'CE1', 'NE2']),
  ('873_', ['CA', 'C', 'O', 'CB', 'CG', 'ND1', 'CD2', 'CE1', 'NE2'])),
 (('899_', ['N', 'N', 'CA', 'CA', 'C', 'C', 'O', 'O', 'CB', 'CB', 'CG', 'CG', 'OD1', 'OD1', 'ND2', 'ND2']),
  ('899_', ['N', 'C', 'O', 'CA', 'CB', 'CG', 'OD1', 'ND2', 'CA', 'CB', 'CG', 'OD1', 'ND2'])),
 (('925_', ['N', 'CA', 'C', 'O', 'CB', 'CG', 'CD', 'OE1', 'OE2']),
  ('925_', ['N', 'C', 'O', 'CA', 'CB', 'CG', 'CD', 'OE1', 'OE2', 'CA', 'CB', 'CG', 'CD', 'OE1', 'OE2']))
...
]

There are 8 mismatches in the pair given in the description of the issue. Not sure what the best way to proceed is (or why this is happening..)

@rasbt any thoughts?

@a-r-j
Copy link
Contributor

a-r-j commented Mar 11, 2022

Hey @wjs20 I’m pretty certain this will be caused by altlocs/insertions (read more) You’ll likely have to filter them out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants