Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyfastx sometimes gives an erroneously shifted subsequence from hg19 reference #11

Open
hartonen opened this issue Mar 11, 2020 · 5 comments

Comments

@hartonen
Copy link

Hi!

I have tried for a few days now to understand why I sometimes get different subsequences than using 'bedtools getfasta' from hg19 reference. I was wondering if you could elaborate the indexing used by pyfastx in case I am making some error. I demonstrate this by fetching these two subsequences from chromosome 20:

seq1: start=60757997, end=60758117
seq2: start=60757999, end=60758119

both sequences are 120bp long and seq2 is shifted downstream by 2bp from seq1, meaning they should overlap with 118bp.

Using bedtools getfasta, I get the expected result:

$ cat test.bed 
chr20   60757997        60758117
chr20   60757999        60758119
$ bedtools getfasta -fi /home/work/genomes/hg19/hg19.fa -bed test.bed -fo bedtools.fasta
$ cat bedtools.fasta 
>chr20:60757997-60758117
GTCGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGA
>chr20:60757999-60758119
CGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGT

However, when trying to fetch the same sequences with pyfastx, I get a seq2 that is shifted by 3bp instead of 2bp. The commands issued in ipython:

In [11]: genome = pyfastx.Fasta('/home/work/genomes/hg19/hg19.fa')

In [12]: str(genome["chr20"][60757997:60758118])
Out[12]: 'GTCGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGA'

In [13]: str(genome["chr20"][60757999:60758120])
Out[13]: 'GGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGTG'

So seq1 is identical to what bedtools returns but for some reason seq2 starts from different position even though the start position is defined to be the same as in bedtools example.

I also replicated the test with third tool, seqkit, and it agrees with bedtools:

$ seqkit grep -p 'chr20' /home/work/genomes/hg19/hg19.fa | seqkit subseq --bed test.bed
[INFO] read BED file ...
[INFO] 2 BED features loaded
>chr20_60757998-60758117:. 
GTCGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGC
CCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGA
>chr20_60758000-60758119:. 
CGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCC
CGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGT

I have also deleted both the .fai and .fxi index files and rebuilt them but the issue persists. Can you explain why I get the different sequence than with bedtools? And why sometimes I get the same sequence as with other tools?

Let me know if I was unclear or some information is missing from my report.

Best regards,
Tuomo

@lmdu
Copy link
Owner

lmdu commented Mar 12, 2020

Hi Tuomo,

The position you used to slice sequence is not correct. In bed file, the start position is 0-based, and the end position is 1-based. This means that (60757997, 60758117) in bed file is corresponding to subsequence from 60757998 to 60758117. In pyfastx, Sequence object is just like a Python string. The sliced position is 0-based and not include the end position. So, the sliced position genome["chr20"][60757997:60758118] is corresponding to subsequence from 60757998 to 60758118.

You can just use genome['chr20'][60757997:60758117] and genome['chr20'][60757999:60758119] to get right sequence.

@hartonen
Copy link
Author

Hi and thank you for a quick reply!

I see that I was maybe a bit unclear in explaining the source of my confusion so I will try again. So I want to fetch the following 120bp long sequences:

seq1:

>chr20:60757997-60758117
GTCGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGA

seq2:

>chr20:60757999-60758119
CGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGT

I know that I can get the exact same sequence seq1 with pyfastx with:

In [17]: str(genome['chr20'][60757997:60758118])
Out[17]: 'GTCGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGA'

This is the same 120bp long sequence than my seq1. Now my seq2 is 2bp upstream of seq1, so I should get that exact same sequence by adding 2bp to end and start coordinates right? But instead of getting seq2, I get a sequence that is 3bp upstream of seq1:

In [19]: str(genome['chr20'][60757999:60758120])
Out[19]: 'GGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGTG'

Actually, I cannot find any coordinate that would return me a sequence that would start with CGGCGTG... as the seq2 I have defined above. It seems that for some reason pyfastx skips the C that is on 3rd position of seq1:

In [21]: start0 = 60757997
    ...: end0 = 60758118
    ...: for i in range(0,4):
    ...:     print('start='+str(start0+i)+', end='+str(end0+i))
    ...:     print(str(genome['chr20'][start0+i:end0+i]))
    ...:     
start=60757997, end=60758118
GTCGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGA
start=60757998, end=60758119
TCGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAG
start=60757999, end=60758120
GGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGTG
start=60758000, end=60758121
GGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGTGC

To me it seems that sometimes increasing the start coordinate by 1 shifts the sequence sometimes by 0bp, sometimes by 1bp and sometimes by 2bp. Also the last of these sequences is 121bp long and the others 120bp long even though the difference between the start and end coordinates is always the same.

So to me this is weird:

  • if I add 1 to start coordinate 60757997, sequence shifts by 1
  • if I add 1 to start coordinate 60757998, sequence shifts by 2
  • if I add 1 to start coordinate 60757999, sequence shifts by 0

Also I don't understand why changing the end position changes the start position of the sequence:

In [23]: str(genome['chr20'][60758000:60758120])
Out[23]: 'GCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGTG'

In [24]: str(genome['chr20'][60758000:60758121])
Out[24]: 'GGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGTGC'

Hopefully I was able to better elaborate my points of confusion.

Best regards,
Tuomo

@lmdu
Copy link
Owner

lmdu commented Mar 13, 2020

Thank you for reporting issue. It is a strange bug. Could you send me your Python version and operating system. I have test it on ubuntu with Python 3.6.9 and it woks fine.

@hartonen
Copy link
Author

Hi!

So the OS:

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

and Python:

In [26]: import sys

In [27]: print(sys.version)
3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0]

And I am using the standard hg19 reference genome that can be downloaded from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz

Best regards,
Tuomo

@lmdu
Copy link
Owner

lmdu commented Mar 14, 2020

Hi Tuomo,

I am so sorry. I did my best to test your code on Ubuntu and Windows with Python3.6 and Python3.7. I really didn't meet the bug you described. But I just released a new version. In this version, I have fixed error start postion for reading sequence caused by fseek when processing large file. I am not sure whether this version can solve your problem. But you might try it. By the way, which tool you have used to install pyfastx, pip or conda? I recommended you try the precompiled version in pypi. If it still doesn't work, please let me know. Thank you for that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants