GitHub - alekseyzimin/wgs

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
bin		bin
lib		lib
src		src
.gitignore		.gitignore
.gitmod		.gitmod
LICENSE.txt		LICENSE.txt
Makefile.am		Makefile.am
README		README
Tupfile		Tupfile
Tuprules.tup		Tuprules.tup
configure.ac		configure.ac

Repository files navigation

These are Release Notes for Celera Assembler version 6.1, released May 1st, 2010.

This distribution package provides a stable, tested, documented version of the software. The
distribution is usable on most Unix-like platforms, and some platforms have pre-compiled binary
distributions ready for installation.

The source code package includes full source code, Makefiles, and scripts. A subset of the kmer
package (http://kmer.sourceforge.net/, version r1821), used by some modules of Celera Assembler, is
included.

This package was prepared by scientists at the J. Craig Venter Institute (http://www.jcvi.org/) with
funding provided by the National Institutes of Health (http://www.nih.gov/).

Full documentation can be found online at http://wgs-assembly.sourceforge.net/.

Citation

Please cite Celera Assembler in publications that refer to its algorithm or its output. The standard
citation is the original paper [Myers et al. (2000) A Whole-Genome Assembly of Drosophila. Science
287 2196-2204]. More recent papers describe modifications for human genome assembly [Istrail et
al. 2004; Levy et al. 2007], metagenomics assembly [Venter et al. 2004; Rusch et al. 2007],
haplotype separation [Levy et al. 2007; Denisov et al. 2008], and Sanger+pyrosequencing hybrid
assembly [Goldberg et al. 2006; Miller et al. 2008]. There are links to these papers, and more, in
the on-line documentation (http://wgs-assembler.sourceforge.net/).

Compilation and Installation

The source code package needs to be compiled and installed before it can be used. The following
commands will work on nearly all Unix-like platforms.

% bzip2 -dc wgs-6.1.tar.bz2 | tar -xf -
% cd wgs-6.1
% cd kmer
% sh configure.sh
% gmake install
% cd ../src
% gmake
% cd ..

Binary distributions need only be unpacked.

% bzip2 -dc wgs-6.1-Linux-amd64.tar.bz2 | tar -xf -

Example

Here is one example of how to use Celera Assembler. . We download DNA sequence from the NCBI TraceDB
hosted by the National Institutes of Health. We request paired-end fragment reads generated by
Sanger chemistry on a bacterial genome. We convert the data to Celera Assembler format (a *.frg
file). Then we launch the assembler software.

Please see the expanded version of this example for more details.

% ftp ftp ftp.ncbi.nih.gov
ftp> cd pub/TraceDB/porphyromonas_gingivalis_w83
[...]
ftp> ls -l
229 Entering Extended Passive Mode (|||50055|)
150 Opening ASCII mode data connection for file list
-r--r--r-- 1 ftp anonymous 803288 Feb 19 17:29 anc.porphyromonas_gingivalis_w83.001.gz
-r--r--r-- 1 ftp anonymous 226576 Feb 19 17:29 clip.porphyromonas_gingivalis_w83.001.gz
-r--r--r-- 1 ftp anonymous 10261474 Feb 19 17:29 fasta.porphyromonas_gingivalis_w83.001.gz
-r--r--r-- 1 ftp anonymous 18039942 Feb 19 17:29 qual.porphyromonas_gingivalis_w83.001.gz
-r--r--r-- 1 ftp anonymous 1143755 Feb 27 08:01 xml.porphyromonas_gingivalis_w83.001.gz
226 Transfer complete.
ftp> bin
ftp> prompt
ftp> mget fasta* qual* xml*
ftp> bye

Convert the NCBI files to files suitable as Celera Assembler input. The last trace-to-frg step takes
about a minute. The 'ls' command demonstrates the expected output files.

% perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -xml xml.porphyromonas_gingivalis_w83.001.gz
% perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -lib xml.porphyromonas_gingivalis_w83.001.gz
WARNING: creating bogus distance for library 'T0611'
WARNING: creating bogus distance for library 'T0612'
WARNING: creating bogus distance for library 'T13146'
WARNING: creating bogus distance for library 'T24315'
porphyromonas_gingivalis_w83.001: frags=40039 links=17113 (85.48%)
% perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -frg xml.porphyromonas_gingivalis_w83.001.gz
Found 20020 in tafrg-porphyromonas_gingivalis_w83/porphyromonas_gingivalis_w83.001.frglib.
% ls -l *.frg*
-rw-r--r-- 1 bri bri 550 Jul 15 21:54 porphyromonas_gingivalis_w83.1.lib.frg
-rw-r--r-- 1 bri bri 19119148 Jul 15 21:55 porphyromonas_gingivalis_w83.2.001.frg.bz2
-rw-r--r-- 1 bri bri 704432 Jul 15 21:54 porphyromonas_gingivalis_w83.3.lkg.frg

Run Celera Assembler with runCA.

% time perl wgs/Linux-amd64/bin/runCA -p pging -d testassembly porphyromonas_gingivalis_w83.*
[...lots of status output...]
[...about four minutes later...]
% ls -l testassembly/9-terminator/*fasta
-rw-r--r-- 1 bri bri 2373611 Jul 15 22:07 testassembly/9-terminator/pging.ctg.fasta
-rw-r--r-- 1 bri bri 39177 Jul 15 22:07 testassembly/9-terminator/pging.deg.fasta
-rw-r--r-- 1 bri bri 2392028 Jul 15 22:07 testassembly/9-terminator/pging.scf.fasta
-rw-r--r-- 1 bri bri 445876 Jul 15 22:07 testassembly/9-terminator/pging.singleton.fasta
-rw-r--r-- 1 bri bri 2692602 Jul 15 22:07 testassembly/9-terminator/pging.utg.fasta

Changes

Celera Assembler 6 made major changes to the format of the data stores used during an
assembly. Because of this, it is not possible to use any previous version of Celera Assembler with
data generated by Celera Assembler 6.1. Likewise, Celera Assembler 6.1 is not able to use data
generated by any previous version. Specifically, any assemblies started with any previous version
cannot be finised with Celera Assembler 6.1; they must be recomputed from scratch.

Features Available Starting in 6.1

1. More Reads: Designed for Sanger sequencing technology, Celera Assembler 4 could comfortably
handle tens of millions of reads. Celera Assembler 5 eventually handled hundreds of millions
of reads. Celera Assembler 6.1 should be able to handle up to a billion reads, though we
haven't tested this.

2. Duplicate 454 Reads: In previous versions, 454 reads that were a prefix of some other 454 read
were removed, but only if the prefix of the reads were identical. Starting in Celera Assembler
6.1, errors are allowed in the prefix match. Additionally, duplicate mate-pairs derived from
the same molecule are detected and removed.

3. Illumina Data: Celera Assembler 6.1 introduces native support for DNA sequencing reads from
the Illumina (or Solexa) platform. Celera Assembler reads three varieties of fastq files,
though it requires an FRG meta-data file that can be generated with Celera Assembler 's
fastqToCA utility. Celera Assembler 6.1 has been validated using 100x of Illumina GA II 100bp
reads, paired at 200bp insert size, from escherichia coli. The quality of support for shorter
reads or larger genomes is still unknown. The minimum read and overlap lengths remain at 64bp
and 40bp, respectively, but these can be adjusted at compile time. Celera Assembler has almost
no code specific to the sequencing platform. Therefore, Celera Assembler can assemble mixtures
of read types including Sanger, 454, and Illumina.

4. Consensus: Celera Assembler 6.1 experiences many fewer consensus failures than Celera
Assembler 5. The consensus code is more reliable due to extensive refactoring, problem
discovery, and improvement.

5. Phasing: By user request, there is a new option to control the phasing of variantes. With
phasing off, the majority allele at every variant column is promoted to the consensus. With
phasing on, the choice of majority allele could reflect a group of neighbor variant
columns. As a result, the minority base at some columns could get promoted to the
consensus. Phasing was deemed unacceptable for some applications. Phasing was always enabled
in Celera Assembler 5; it is disabled by default in Celera Assembler 6. See the cnsPhasing
option.

6. Closure/Finishing: Celera Assembler 6.1 offers additional support for finishing. This refers
to improving the quality of an assembly by analysis, manipulation, targeted sequencing, and
re-assembly. The pipeline now accepts closure reads and their placement
constraints. Specifically, the input FRG format has a new PLC placement message type. The
assembler considers these constraints, not as absolutes, but on par with other information. It
uses them, not only to close gaps, but throughout the pipeline. The constraints could be
derived from PCR primer sites and apply to the corresponding PCR reads, for example.

7. Pipeline Engineering: The traditional Celera Assembler used its own message passing interface
wherein information passed from module to module in large (text or binary) files. Over the
years, Celera Assembler has shifted to the use of its own database structure, called
Store. Previous versions of Celera Assembler introduced the read database (gkpStore) and the
overlap database (ovlStore). Celera Assembler 6.1 introduces a database for unitigs and
contigs (tigStore) to replace the *.cgb message files. This change reduces I/O. It also
facilitates debugging and mid-course alterations to problematic assemblies. There are
utilities for dumping, querying, and updating the database.

Improvements

1. Unitig Toggling: The unitig toggling process is supported directly from runCA. The algorithm
now recognizes and toggles an additional case: a false-repeat unitig preventing the join of
two scaffolds by being placed on the end of one scaffold and the beginning of the other.

2. Scaffolding: An experimental option was added to eject a contig in a scaffold when it has no
overlaps to its neighbors ([[runCA#).

3. Scaffolding: Experimental option to revert current gap in a scaffold to a saved state while
processing rocks/stones to see if an overlap now exists.

4. Long 454 Reads: Very long bad reads sometimes generated on 454 platforms are correctly
trimmed.

5. Disk Usage: The intermediate output files from the overlap stage are now removed after the
corresponding overlap store is created.

6. Automatic Selection of Unitigger: runCA automatically selects the correct unitigger to use
based on the input fragments. Previously it was not possible to override this selection in
some instances. It is now possible to specify the unitigger to use.

7. Scaffolding: An ancient bug was discovered and fixed. When evaluating if a pair of scaffolds
can be merged, the number of satisfied and unsatisfied mate pairs are examined. If the merged
scaffolds add too many unsatisfied mate pairs, the merge is not attempted. This examination
was flawed, resulting in mate pairs never preventing a scaffold merge. The merge could still
be prevented if there is no sequence alignment between the scaffolds.

8. Several existing options were removed:
* bogEjectUnhappyContain was removed. It is always enabled now.
* cgwOutputIntermediate was removed. The need for this option was removed by allowing
terminator to read input directly from the tigStore and a CGW checkpoint file.

9. Several existing options were renamed:
* doOverlapTrimming was renamed to doOverlapBasedTrimming. The original option is still
allowed, but will be removed in a future release.
* bogPromiscuous was renamed to bogBreakAtIntersections. Note that the sense of this
option is reversed: bogPromiscuous=1 is the same as bogBreakAtIntersections=0.

10. Several new options were added:
* utgErrorLimit was added.
* stopBefore was added.
* sgeName was added.
* doDeDuplication was added.
* doMerBasedTrimming was added.
* saveOverlaps was added.
* kickOutNonOvlContigs was added.
* doUnjiggleWhenMerging was added.
* cnsPhasing was added.
* toggleMaxDistance was added.
* toggleDoNotDemote was added.

Bug Fixes

The full list of changes is available, however many of these changes are due to problems added since
Celera Assembler 5.4 was released. Major bug fixes relating to bugs present in previous releases are
listed in the "Improvements" section above.

Known Problems

These may be fixed in future releases.

1. In sffToCA, paired-end read lengths are calculated incorrectly in rare cases when '-trim hard'
is used. Until the issue is resolved, users should always use '-trim chop' on 454 paired-end
libraries.

2. There is no explicit support for the high-coverage induced by XLR runs on small
genomes. Coverage such as 80X induces combinations of sequencing errors that confound Celera
Assembler. At best this leads to higher reported rates of allelic variation. At worst this
leads to a fractured assembly. Sampling from high-coverage reads can yield a better assembly.

3. There is little support for assembly of data sets with a small ratio of mate pairs to unpaired
fragments. This ratio happens when XLR is applied to many fragment libraries but few
paired-end libraries. This ratio impedes the assembly of unitigs into contigs and
scaffolds. It leads to fewer bases in scaffolds and more bases in degenerate contigs.

4. There is no support for bar-coded 454 data. Users with bar-coded data may use some other
utility to remove the bar code sequence and partition reads by bar code into separate SFF
files.

5. There is no support for data from ABI SOLiD.

6. There is no support for cDNA, exon-enriched DNA, or DNA amplified with bias of any sort.

Legal

Copyright 1999-2004 by Applera Corporation. Copyright 2005-2009 by the J. Craig Venter
Institute. The Celera Assembler software, also known as the wgs-assembler and CABOG, is open-source
and available free of charge subject to the GNU General Public License, version 2.