Genome size assembled 50% shorter than expected #324

frsepulveda · 2023-06-05T13:56:30Z

Hi! i'm doing a hybrid assembly of a red algae with haploid genome with pacbio and illumina PE data. The genome size of my specie is aprox 100Mb. Masurca "worked fine" but the final genome size obtained is a lot shorter than expected (final genome size about 60Mb).
This is my configuration file:

PE=p1 300 45 /media/server/Elements/DNA/DNBseq/soapnuke/clean/1D/POS_Hembra_1.fastq.gz /media/server/Elements/DNA/DNBseq/soapnuke/clean/1D/POS_Hembra_2.fastq.gz
#pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped
#if you have both types of reads supply them both as NANOPORE type
PACBIO=/media/server/Elements/DNA/pacBio/Limpios/1A1_CLEAN.fa

PARAMETERS
#PLEASE READ all comments to essential parameters below, and set the parameters according to your project
#set this to 1 if your Illumina mate pair (jumping) library reads are shorter than 100bp
EXTEND_JUMP_READS=0
#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
GRAPH_KMER_SIZE = auto
#set this to 1 for all Illumina-only assemblies
#set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc)
USE_LINKING_MATES = 0
#specifies whether to run the assembly on the grid
USE_GRID=0
#specifies grid engine to use SGE or SLURM
GRID_ENGINE=SGE
#specifies queue (for SGE) or partition (for SLURM) to use when running on the grid MANDATORY
GRID_QUEUE=all.q
#batch size in the amount of long read sequence for each batch on the grid
GRID_BATCH_SIZE=500000000
#use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads
#can increase this to 30 or 35 if your long reads reads have N50<7000bp
LHE_COVERAGE=25
#this parameter is useful if you have too many Illumina jumping library reads. Typically set it to 60 for bacteria and 300 for the other organisms
LIMIT_JUMP_COVERAGE = 300
#these are the additional parameters to Celera Assembler; do not worry about performance, number or processors or batch sizes -- these are computed automatically.
#CABOG ASSEMBLY ONLY: set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
CA_PARAMETERS = cgwErrorRate=0.15
#CABOG ASSEMBLY ONLY: whether to attempt to close gaps in scaffolds with Illumina or long read data
CLOSE_GAPS=1
#number of cpus to use, set this to the number of CPUs/threads per node you will be using
NUM_THREADS = 40
#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*20
JF_SIZE = 2000000000
#ILLUMINA ONLY. Set this to 1 to use SOAPdenovo contigging/scaffolding module.
#Assembly will be worse but will run faster. Useful for very large (>=8Gbp) genomes from Illumina-only data
SOAP_ASSEMBLY=0
#If you are doing Hybrid Illumina paired end + Nanopore/PacBio assembly ONLY (no Illumina mate pairs or OTHER frg files).
#Set this to 1 to use Flye assembler for final assembly of corrected mega-reads.
#A lot faster than CABOG, AND QUALITY IS THE SAME OR BETTER.
#DO NOT use if you have less than 20x coverage by long reads.
flye.log

FLYE_ASSEMBLY=1
END

Also, i'm attaching the flye.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genome size assembled 50% shorter than expected #324

Genome size assembled 50% shorter than expected #324

frsepulveda commented Jun 5, 2023

Genome size assembled 50% shorter than expected #324

Genome size assembled 50% shorter than expected #324

Comments

frsepulveda commented Jun 5, 2023