Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Description in FASTA title prevents correct reformatting #17

Open
wbazant opened this issue Feb 23, 2021 · 0 comments
Open

Description in FASTA title prevents correct reformatting #17

wbazant opened this issue Feb 23, 2021 · 0 comments

Comments

@wbazant
Copy link

wbazant commented Feb 23, 2021

I have a file with reads formatted like:

@SRR3132037.1 1 length=142
AAGGCGGCTGTTGTTATGTTATCGGCTCCATGCGAAACAGCAAAATTATTACAGCCCAGCGCATGTTACTGGTTCGCCTGTACCAGTTAGCAGAAAGGCGCGACCGTGTAACAGACGTACCGCGCCCCATCGCTATTTATCT
+SRR3132037.1 1 length=142
/EEEEEEAEE/EEAEAEEEEEEEEEEE/EEEEEEEE6EEEEEEEEAEEEEEAEEAAA/AEEEEEAEEEEEEE/EE/EEAAE/EEEEEEEEAEE//EEE<EEEA/<E</<E/<EAEEAAA<EEEEEEA6EEEEA<E/EEEE<E

It's not a bad file according to https://en.wikipedia.org/wiki/FASTQ_format, and it's a default format from fastq-dump.

It's reformatted as such:

head -n4 reordered_c1gxnbm__reformatted_identifierslaj96u_u_reads reordered_6p35sxlp_reformatted_identifiersz4yibe76_reads_R
==> reordered_c1gxnbm__reformatted_identifierslaj96u_u_reads <==
@SRR3132037.1.11length=142
AAGGCGGCTGTTGTTATGTTATCGGCTCCATGCGAAACAGCAAAATTATTACAGCCCAGCGCATGTTACTGGTTCGCCTGTACCAGTTAGCAGAAAGGCGCGACCGTGTAACAGACGTACCGCGCCCCATCGCTATTTATCT
+SRR3132037.1.1 1 length=142
/EEEEEEAEE/EEAEAEEEEEEEEEEE/EEEEEEEE6EEEEEEEEAEEEEEAEEAAA/AEEEEEAEEEEEEE/EE/EEAAE/EEEEEEEEAEE//EEE<EEEA/<E</<E/<EAEEAAA<EEEEEEA6EEEEA<E/EEEE<E

==> reordered_6p35sxlp_reformatted_identifiersz4yibe76_reads_R <==
@SRR3132037.1.21length=142
AACCAGTTTGAATTTGGCATTTTCAACGACCGCACCGACAACGGCATCGCCTTTCAAATCTGCCGGGCAGTCTTTTAAATCGGCAATGGACTGGTGATCCATAATAATATGTACGCCTTTATCCCGCGCCGCACCTAATCCC
+SRR3132037.1.2 1 length=142
/A/EEEEAEEEEEEEEEEEEEEAEEEEAAEEAEE<6E<EEEAEEEEEEAEAEEEAEAEEEEEAE/AEAEE<EE/EEEAEE/6EEA<EEEE//EEAAEA/E//EEA/A<EA/E/EEEAAE/AE</<6<<6AAE/EAAE<<A66

I've found this issue when running a fork under #16, which assumes the mates of reads are labelled. I think it's still an issue otherwise but not very severe - the kneaddata_bowtie2_discordant_pairs would, I guess, not match SRR3132037.1.11length=142 and SRR3132037.1.21length=142, as a consequence put them in a wrong file and the --strict option there wouldn't work correctly.

I have a workaround for the issue on my end, but if you like I am happy to fix it in the codebase. I see that utilities.get_reformatted_identifiers does some reformatting. I would change it to remove the description, and also format the quality line to just "+".

wbazant added a commit to VEuPathDB/humann-nextflow that referenced this issue Feb 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant