Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chromosome order #146

Open
Phlya opened this issue Jan 8, 2021 · 10 comments
Open

chromosome order #146

Phlya opened this issue Jan 8, 2021 · 10 comments

Comments

@Phlya
Copy link

Phlya commented Jan 8, 2021

Is there a way to specify chromosome order when installing a genome? E.g., when downloading a human genome I want the chromosomes to be sorted in the "normal" way aschr1, chr2, ... chr10, ... chr22, chrX, chrY, chrM. But currently they are sorted as chr1, chr10, ..., chr2, ..., chr22, chr3, ... chr9, chrM, chrX, chrY. Is it just the same order as in UCSC (I downloaded from there)?

@siebrenf
Copy link
Member

siebrenf commented Jan 8, 2021

genomepy does not (currently) support this. This snippet should do the trick:

from pyfaidx import Fasta
import re 

oldfa = '/path/to/genome.fa'
newfa = '/path/to/sorted_genome.fa'

def sorted_nicely( l ): 
    """ 
    Sort the given iterable in the way that humans expect.
    source: https://stackoverflow.com/questions/2669059/how-to-sort-alpha-numeric-set-in-python
    """ 
    convert = lambda text: int(text) if text.isdigit() else text 
    alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ] 
    return sorted(l, key = alphanum_key)

fa = Fasta(oldfa)
sorted_chr = sorted_nicely(fa.keys())
with open(newfa) as sorted_fa:
    for chr in sorted_chr:
        sorted_fa.write('>'+chr+'\n')
        sorted_fa.write(fa[chr][:])

@Phlya
Copy link
Author

Phlya commented Jan 8, 2021

Thank you. Is there a way to perform this, but still use genomepy to manage the new sorted genome?

@siebrenf
Copy link
Member

siebrenf commented Jan 8, 2021

Assuming that with manage, you mean also applies this sorting to the support files (fa.sizes, gaps.bed & fa.fai). If so, you can run this after the previous code:

from genomepy import Genome
import os

g = Genome(oldfa)
os.unlink(g.index_file)
os.unlink(g.sizes_file)
os.unlink(g.gaps_file)

g = Genome(newfa)

if oldfa and newfa were the same it should just overwrite the old stuff (check if this does what you intended though)

@Phlya
Copy link
Author

Phlya commented Jan 8, 2021

OK, thank you!
Would this feature be something you might be interested in? I might try to make a PR to add it then. The ordering of chromosomes is pretty important, and non-trivial in some organisms.

@siebrenf
Copy link
Member

Thank you for the offer, but I don't think this is a desired feature. Chromosomes are sorted by size, which all tools I know can use, and some tools might not work otherwise (due to assumptions). Natural sorting would also fail for certain genomes (such as those that use roman numerals).

@Phlya
Copy link
Author

Phlya commented Jan 11, 2021

Are chromosomes sorted by size? That's not what I experienced, for me they were sorted alphabetically, as I indicated in the first post.
I'm not suggesting necessarily natural sorting, but just arbitrary sorting. Natural sorting can be an option, but also the actual desired order of chromosomes could be provided.

@siebrenf
Copy link
Member

My bad, you're right about the default order.
Inserting a sorting function into genomepy (as user) would however be equally complex to running the sorting afterward. And since we can't add sorting methods for every genome, I feel this step is best left to the user.

@Phlya
Copy link
Author

Phlya commented Jan 11, 2021

Doesn't need to be a function, just the actual list of chromosome names in the right order would work.

@simonvh
Copy link
Member

simonvh commented Jan 11, 2021

Hi @Phlya, I think we're hesitant to add functionality to genomepy where the use-case is not clear as this is functionality that needs to be tested and maintained down the line and adds additional complexity to the command line tool.
However, maybe some further information would help. Can you provide a clear use-case for this, i.e. an example where the order of the chromosomes in a genome FASTA need to be changed? I would expect most tools to expect lexicographical order, but maybe I'm wrong here.

@Phlya
Copy link
Author

Phlya commented Jan 11, 2021

No worries, I understand if this functionality might be not the most important.

My main use-case is that I want to use genomepy in a pipeline to manage the genome reference. Different tools and steps rely on a chromsizes file, and that needs to align with how the data is processed, including the order of chromosomes, and some tools also need the genome sequence. I need to investigate if the order of chromosomes in the fasta file needs to match the order in the chromsizes file, but even if it's not strictly required for the tools I am using now, I am not keen on the idea of having a chromsizes file with a different order than the fasta - I feel that's a dangerous combination that can lead to errors in some cases.

And of course I'd rather have my chromosomes sorted in the logical order (which matches the natural sorting for human, but could be arbitrarily different in other organisms).

I hope this explains my use case and gives you some context. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants