Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Renumbering residues #51

Open
ajasja opened this issue May 25, 2018 · 9 comments
Open

Renumbering residues #51

ajasja opened this issue May 25, 2018 · 9 comments

Comments

@ajasja
Copy link

ajasja commented May 25, 2018

Hi! Nice library, has a lot of potential.

Is there a way to renumber residues?
Renumbering atoms seems trivial (just assign a range to the atom_number), however renumbering residues would probably require some heavy duty group_by magic and could be built in.
(renumbering atoms could also be built in:)

@rasbt
Copy link
Member

rasbt commented May 25, 2018

Hi,

I just see that I didn't seem to have implemented something like that.

There are some easy ways to do that using pandas base functionality. E.g., to decrease the residue numbers by 1 you can simply do

ppdb.df['ATOM']['residue_number'] -= 1

However, if the residue numbers are not in order, or if there are gaps, like (1, 2, 10, 20), which you want to rename to (1, 2, 3, 4, 5), you would have to do it differently. E.g.,

you could first get all the unique residue numbers in the order they appear:

from collections import OrderedDict

ordered_unique_elements = \
    list(OrderedDict.fromkeys(ppdb.df['ATOM']['residue_number']))

and then map from the old residue numbers to the new, contiguous residue numbers:

mapping_dict = {ordered_unique_elements[i]: i+1 
                for i in range(0, len(ordered_unique_elements))}

ppdb.df['ATOM']['residue_number'] = \
    ppdb.df['ATOM']['residue_number'].map(mapping_dict)

I could actually add that as a method to BioPandas, or maybe just explain it in documentation. What do you think?

@ajasja
Copy link
Author

ajasja commented May 26, 2018

Wow, that is some elegant python code!
I would recommend adding a renumber_residues method to biopandas. I would expect this is a common enough operation. Form completeness I would also add a renumber_atoms.

Do you think both methods should handle renumbering the ANISOU records at the same time? Otherwise the records might go out of sync.

What I'm trying to achieve is to split a pdb by chains, reorder the chains and combine them in a different pdb. I've looked also at pdbtools, however that is more command line based and I'd like to do that in python code.

@wojdyr
Copy link
Contributor

wojdyr commented May 31, 2018

@rasbt when you re-number sequentially you should also consider insertion codes (i.e. get all unique residue numbers + icodes, assign new numbers and remove the insertion codes).

@rasbt
Copy link
Member

rasbt commented May 31, 2018

Good point. Yeah, with the renumbering, there are so many things to consider, all of which are pretty use-case specific. (Probably why I haven't made such a function/method in the past).

I am still thinking whether a standardized renumbering method should be added vs extending the documentation with easy-to-follow examples that give people more flexibility ...

@drewaight
Copy link

I would second a renumbering function, especially for antibody sequences. The insertion code makes it pretty difficult

@rasbt
Copy link
Member

rasbt commented Jan 18, 2020

Sounds good, I agree. I am currently caught up with a pretty long to do list of other things (and the semester is going to start Tue); so I am not sure when I will get to this, yet. If someone wants to take a crack at it, I welcome PRs.

@drewaight
Copy link

Insertion codes were never much of an issue for me until i started in working with antibodies, where they are everything (different programs use different numbering, its a nightmare!) Anyway with the help of Stack Overflow I was able to figure this out, (https://stackoverflow.com/questions/59804249/mapping-tuple-dictionary-to-multiple-columns-of-a-dataframe). I will make a PR when I'm less embarrassed of my brute force methods and ugly code. For now here are my notes.

ppdb.amino3to1 will 'cut_out' duplicate residue numbers with insertions. You sequence needs to be rid of insertion codes (unique 'residue_number') for the sequence to be returned correctly. For an antibody complex for instance, I split off the heavy and light chain sequences from ppdb.df['ATOM'] into separate dataframes and renumbered them sequentially without insertion codes with the following function (inspired by sebastian)

def seq_order(df):
    from collections import OrderedDict
    df['residue_insertion'] = df['residue_number'].astype(str)+df['insertion'].astype(str)
    ordered_seq = list(OrderedDict.fromkeys(df['residue_insertion']))
    seq_dict = {ordered_seq[i]: i+1 for i in range(0, len(ordered_seq))}
    df['residue_insertion'] = df['residue_insertion'].map(seq_dict)
    df['residue_number'] = df['residue_insertion']
    df.drop(['residue_insertion'], axis=1, inplace = True)
    df['insertion'] = None
    return(df)

I added the renumbered heavy and light chain dataframes back into ppdb.df['ATOM']
to run ppdb.amino3to1() (i think this fuction only works on PandasPdb and not on subset dataframes)

I worked with my renumbering script (Anarci) to output a dataframe such the output had columns corresponding to the 'residue_num' 'insertion' 'new_res' and 'new_ins'

   residue_number           insertion        new_res      new_ins
0               2                                1         
1               3                                2        
2               3                 A              3        
3               5                                4              A

then a left sided merge back into the corresponding heavy or light chain dataframe (I'm a little unsure how this works still), drop the origional residue_numbers and rename the new. Merge the whole thing back into the PandasPdb and write out.

I'm sure theres a more elegant way, but please give me a break, I'm a crystallographer. I love Biopandas by the way. I hope this helps anyone struggling with the same issue.

@luwei0917
Copy link

def seq_order(df):
    from collections import OrderedDict
    df['residue_insertion'] = df['residue_number'].astype(str)+df['insertion'].fillna('')
    ordered_seq = list(OrderedDict.fromkeys(df['residue_insertion']))
    seq_dict = {ordered_seq[i]: i+1 for i in range(0, len(ordered_seq))}
    df['residue_insertion'] = df['residue_insertion'].map(seq_dict)
    df['residue_number'] = df['residue_insertion']
    df.drop(['residue_insertion'], axis=1, inplace = True)

    return df

is slightly better in my opinion.

@johnnytam100
Copy link

Same request for such a built-in feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants