Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty contig files despite stats looking OK #478

Closed
alexbacita opened this issue May 14, 2024 · 11 comments
Closed

Empty contig files despite stats looking OK #478

alexbacita opened this issue May 14, 2024 · 11 comments

Comments

@alexbacita
Copy link

Hi,

I'm running the following line: abyss-pe name=test_pe k=96 B=2G in='out_R1_001.fastq out_R2_001.fastq'

Despite getting stats on contigs:

n n:500 L50 min N75 N50 N25 E-size max sum name
381190 760 203 500 712 1090 1912 1837 9667 783583 test_pe-unitigs.fa
381122 714 165 500 729 1212 2365 2264 12653 790301 test_pe-contigs.fa
381107 705 156 500 729 1222 2468 2625 15574 790239 test_pe-scaffolds.fa

The following files are empty: 'test_pe-unitigs.fa; test_pe-contigs.fa'; test_pe-scaffolds.fa' although I do have some sequneces under the iterations of test-1.fa, test-2.fa and so on

Please can you let me know how to interpret this and troubleshoot

Many thanks
Alex

@alexbacita
Copy link
Author

zsh:test:1: unknown condition: -le
abyss-pe (ABySS) 2.3.7
Written by Shaun Jackman and Anthony Raymond.

Copyright 2012 Canada's Michael Smith Genome Science Centre

Description: Ubuntu 22.04.4 LTS

    | MergeContigs   -k96 -o test_pe-8.fa - test_pe-7.dot test_pe-7.path

The minimum coverage of single-end contigs is 1.08333.
The minimum coverage of merged contigs is 3.89583.
Consider increasing the coverage threshold parameter, c, to 3.89583.
ln -sf test_pe-8.fa test_pe-scaffolds.fa
PathOverlap --overlap -k96 --dot test_pe-7.dot test_pe-7.path >test_pe-8.dot
ln -sf test_pe-8.dot test_pe-scaffolds.dot
abyss-fac test_pe-unitigs.fa test_pe-contigs.fa test_pe-scaffolds.fa |tee test_pe-stats.tab
n n:500 L50 min N75 N50 N25 E-size max sum name
381190 760 203 500 712 1090 1912 1837 9667 783583 test_pe-unitigs.fa
381122 714 165 500 729 1212 2365 2264 12653 790301 test_pe-contigs.fa
381107 705 156 500 729 1222 2468 2625 15574 790239 test_pe-scaffolds.fa
time user=3.00s system=0.79s elapsed=9.69s cpu=39% memory=3 job=abyss-fac test_pe-unitigs.fa test_pe-contigs.fa test_pe-scaffolds.fa
time user=0.00s system=0.00s elapsed=9.70s cpu=0% memory=1 job=
ln -sf test_pe-stats.tab test_pe-stats
tr '\t' , <test_pe-stats.tab >test_pe-stats.csv
abyss-tabtomd test_pe-stats.tab >test_pe-stats.md

@lcoombe
Copy link
Member

lcoombe commented May 14, 2024

Hi @alexbacita,

Are you seeing the files as 'empty' based on looking at the file size, or inspecting the contents?
Those files (test_pe-unitigs.fa; test_pe-contigs.fa; test_pe-scaffolds.fa) are soft-links to various numbered (-3, -6, -8) fasta files.

If you run abyss-fac test_pe-scaffolds.fa, for example, you should see the same statistics as you are showing above.

Thank you for your interest in ABySS!
Lauren

@alexbacita
Copy link
Author

Hi @lcoombe,

Thanks for the reply! I was just looking at size indeed. Ideally I would like to generate a consensus read to understand my sample sequence. What file should I use for this purpose? What is the distinction between the -3 -6 -8 files?

Many thanks,
Alex

@lcoombe
Copy link
Member

lcoombe commented May 15, 2024

Hi Alex,

The names associated with each of those soft links refer to progressive stages of the ABySS assembly - unitigs, contigs and scaffolds. We generally consider the scaffolds (-8.fa) to be the final file of the assembly.

You could think of it like this - the contigs are generated using the assembly graph plus your input data to make more joins between your unitigs, and the scaffolds are generated using the same information with slightly different algorithms to make more joins between your contigs.

Hope that makes sense!
Lauren

@alexbacita
Copy link
Author

Thanks Lauren! That makes sense, please could you advise on how to obtain one single/continous consensus sequence? Can this be done in ABySS or do I need to use a different software? Apologies for my ignorance, I'm new to de-novo assembly

@lcoombe
Copy link
Member

lcoombe commented May 15, 2024

Hi Alex,

I'm not really sure what you mean with 'one single/continuous consensus sequence'? It would help if I knew more about what are you trying to assemble, and what is your input data?

ABySS will assemble your input reads, but whether you expect the full sequence in one piece or in multiple pieces really depends on what you are trying to assemble. Even if your target region is assembled in a single piece, there will likely be more than one sequence in the final assembly. And, ABySS will generally not output multiple assemblies/copies of the same region, so there wouldn't be any sense of doing a 'consensus' of multiple sequences after assembly.

@alexbacita
Copy link
Author

alexbacita commented May 15, 2024

Thanks Lauren,

I have a synthetic oligo that has been amplified with phi29 RCA to understand the bias of the enzyme.

When aligning against the reference sequence we have 95% reads mapped in the un-amplified but only 2% reads mapped in the RCA product. So I'm using trying to understand the nature of this product by doing de-novo assembly.

Good to know that there no overlaps of the same region generated from ABySS! If I understand correctly I still need to join the scaffolds?

@lcoombe
Copy link
Member

lcoombe commented May 15, 2024

Hi Alex,

So if I'm understanding correctly, you're just trying to get an idea of what the reads are?

If you know what you are looking for in the target, you could do an alignment to the potential region of interest to see which contigs map there. If you don't really know what you're looking for at all, you could BLAST the assembled pieces.

It is possible that the region you are talking about is in one piece, or in multiple pieces - you won't really know until you do the analysis. If your intention is to get a single piece for a particular region, it can be worth doing k/kc sweeps, which can impact the contiguity of your assembly.

@alexbacita
Copy link
Author

Hi Lauren,

The region of interest is unknown, also it's a synthetic product so BLAST won't work I think. Pretty sure there are multiple pieces - if I understand correctly the scaffolds from the ABySS output represent independent fragments assembled to their original size from the 150PE reads? If I was doing whole genome analysis and I would put the PE reads into ABySS would the output be a continuous sequence of the assembled genome (equivalent to E. Coli genome for example) or is there a separate function for to achieve that? Thank you so much for the discussion and support!

@lcoombe
Copy link
Member

lcoombe commented May 16, 2024

Hi Alex,

If you have multiple independent fragments sequenced, then yes, even an optimal assembly would generate multiple different pieces.
For a given genome like E. coli, a perfect assembly would yield a single piece, but not all assemblies (from any assembler, including ABySS), will be guaranteed to give you that perfect assembly. That's the same with assembling the genome of any species - you are ideally wanting one piece per chromosome, but that is usually not what you get, especially for larger genomes.
With some parameter settings, you may get a single piece within your assembled scaffolds representing the full genome, or you may have the underlying genome in multiple pieces. If you know what you are looking for, that's when we do parameter sweeps, particularly of k and kc to try to get as contiguous of an assembly as possible.

@alexbacita
Copy link
Author

Hi @lcoombe

Thank you very much for the clarifications and detailed replies, much appreciated!

Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants