Skip to content

alekseyzimin/wgs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

These are Release Notes for Celera Assembler version 6.1, released May 1st, 2010.

This distribution package provides a stable, tested, documented version of the software. The
distribution is usable on most Unix-like platforms, and some platforms have pre-compiled binary
distributions ready for installation.

The source code package includes full source code, Makefiles, and scripts. A subset of the kmer
package (http://kmer.sourceforge.net/, version r1821), used by some modules of Celera Assembler, is
included.

This package was prepared by scientists at the J. Craig Venter Institute (http://www.jcvi.org/) with
funding provided by the National Institutes of Health (http://www.nih.gov/).

Full documentation can be found online at http://wgs-assembly.sourceforge.net/.

Citation

Please cite Celera Assembler in publications that refer to its algorithm or its output. The standard
citation is the original paper [Myers et al. (2000) A Whole-Genome Assembly of Drosophila. Science
287 2196-2204]. More recent papers describe modifications for human genome assembly [Istrail et
al. 2004; Levy et al. 2007], metagenomics assembly [Venter et al. 2004; Rusch et al. 2007],
haplotype separation [Levy et al. 2007; Denisov et al. 2008], and Sanger+pyrosequencing hybrid
assembly [Goldberg et al. 2006; Miller et al. 2008]. There are links to these papers, and more, in
the on-line documentation (http://wgs-assembler.sourceforge.net/).

Compilation and Installation

The source code package needs to be compiled and installed before it can be used. The following
commands will work on nearly all Unix-like platforms.

% bzip2 -dc wgs-6.1.tar.bz2 | tar -xf -
% cd wgs-6.1
% cd kmer
% sh configure.sh
% gmake install
% cd ../src
% gmake
% cd ..

Binary distributions need only be unpacked.

% bzip2 -dc wgs-6.1-Linux-amd64.tar.bz2 | tar -xf -


Example

Here is one example of how to use Celera Assembler. . We download DNA sequence from the NCBI TraceDB
hosted by the National Institutes of Health. We request paired-end fragment reads generated by
Sanger chemistry on a bacterial genome. We convert the data to Celera Assembler format (a *.frg
file). Then we launch the assembler software.

Please see the expanded version of this example for more details.

% ftp ftp ftp.ncbi.nih.gov
ftp> cd pub/TraceDB/porphyromonas_gingivalis_w83
[...]
ftp> ls -l
229 Entering Extended Passive Mode (|||50055|)
150 Opening ASCII mode data connection for file list
-r--r--r--   1 ftp      anonymous   803288 Feb 19 17:29 anc.porphyromonas_gingivalis_w83.001.gz
-r--r--r--   1 ftp      anonymous   226576 Feb 19 17:29 clip.porphyromonas_gingivalis_w83.001.gz
-r--r--r--   1 ftp      anonymous 10261474 Feb 19 17:29 fasta.porphyromonas_gingivalis_w83.001.gz
-r--r--r--   1 ftp      anonymous 18039942 Feb 19 17:29 qual.porphyromonas_gingivalis_w83.001.gz
-r--r--r--   1 ftp      anonymous  1143755 Feb 27 08:01 xml.porphyromonas_gingivalis_w83.001.gz
226 Transfer complete.
ftp> bin
ftp> prompt
ftp> mget fasta* qual* xml*
ftp> bye

Convert the NCBI files to files suitable as Celera Assembler input. The last trace-to-frg step takes
about a minute. The 'ls' command demonstrates the expected output files.

% perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -xml xml.porphyromonas_gingivalis_w83.001.gz
% perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -lib xml.porphyromonas_gingivalis_w83.001.gz
WARNING: creating bogus distance for library 'T0611'
WARNING: creating bogus distance for library 'T0612'
WARNING: creating bogus distance for library 'T13146'
WARNING: creating bogus distance for library 'T24315'
porphyromonas_gingivalis_w83.001: frags=40039 links=17113 (85.48%)
% perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -frg xml.porphyromonas_gingivalis_w83.001.gz
Found 20020 in tafrg-porphyromonas_gingivalis_w83/porphyromonas_gingivalis_w83.001.frglib.
% ls -l *.frg*
-rw-r--r--  1 bri  bri       550 Jul 15 21:54 porphyromonas_gingivalis_w83.1.lib.frg
-rw-r--r--  1 bri  bri  19119148 Jul 15 21:55 porphyromonas_gingivalis_w83.2.001.frg.bz2
-rw-r--r--  1 bri  bri    704432 Jul 15 21:54 porphyromonas_gingivalis_w83.3.lkg.frg

Run Celera Assembler with runCA.

% time perl wgs/Linux-amd64/bin/runCA -p pging -d testassembly porphyromonas_gingivalis_w83.*
[...lots of status output...]
[...about four minutes later...]
% ls -l testassembly/9-terminator/*fasta
-rw-r--r--  1 bri  bri  2373611 Jul 15 22:07 testassembly/9-terminator/pging.ctg.fasta
-rw-r--r--  1 bri  bri    39177 Jul 15 22:07 testassembly/9-terminator/pging.deg.fasta
-rw-r--r--  1 bri  bri  2392028 Jul 15 22:07 testassembly/9-terminator/pging.scf.fasta
-rw-r--r--  1 bri  bri   445876 Jul 15 22:07 testassembly/9-terminator/pging.singleton.fasta
-rw-r--r--  1 bri  bri  2692602 Jul 15 22:07 testassembly/9-terminator/pging.utg.fasta

Changes

Celera Assembler 6 made major changes to the format of the data stores used during an
assembly. Because of this, it is not possible to use any previous version of Celera Assembler with
data generated by Celera Assembler 6.1. Likewise, Celera Assembler 6.1 is not able to use data
generated by any previous version. Specifically, any assemblies started with any previous version
cannot be finised with Celera Assembler 6.1; they must be recomputed from scratch.

Features Available Starting in 6.1

   1. More Reads: Designed for Sanger sequencing technology, Celera Assembler 4 could comfortably
      handle tens of millions of reads. Celera Assembler 5 eventually handled hundreds of millions
      of reads. Celera Assembler 6.1 should be able to handle up to a billion reads, though we
      haven't tested this.

   2. Duplicate 454 Reads: In previous versions, 454 reads that were a prefix of some other 454 read
      were removed, but only if the prefix of the reads were identical. Starting in Celera Assembler
      6.1, errors are allowed in the prefix match. Additionally, duplicate mate-pairs derived from
      the same molecule are detected and removed.

   3. Illumina Data: Celera Assembler 6.1 introduces native support for DNA sequencing reads from
      the Illumina (or Solexa) platform. Celera Assembler reads three varieties of fastq files,
      though it requires an FRG meta-data file that can be generated with Celera Assembler 's
      fastqToCA utility. Celera Assembler 6.1 has been validated using 100x of Illumina GA II 100bp
      reads, paired at 200bp insert size, from escherichia coli. The quality of support for shorter
      reads or larger genomes is still unknown. The minimum read and overlap lengths remain at 64bp
      and 40bp, respectively, but these can be adjusted at compile time. Celera Assembler has almost
      no code specific to the sequencing platform. Therefore, Celera Assembler can assemble mixtures
      of read types including Sanger, 454, and Illumina.

   4. Consensus: Celera Assembler 6.1 experiences many fewer consensus failures than Celera
      Assembler 5. The consensus code is more reliable due to extensive refactoring, problem
      discovery, and improvement.

   5. Phasing: By user request, there is a new option to control the phasing of variantes. With
      phasing off, the majority allele at every variant column is promoted to the consensus. With
      phasing on, the choice of majority allele could reflect a group of neighbor variant
      columns. As a result, the minority base at some columns could get promoted to the
      consensus. Phasing was deemed unacceptable for some applications. Phasing was always enabled
      in Celera Assembler 5; it is disabled by default in Celera Assembler 6. See the cnsPhasing
      option.

   6. Closure/Finishing: Celera Assembler 6.1 offers additional support for finishing. This refers
      to improving the quality of an assembly by analysis, manipulation, targeted sequencing, and
      re-assembly. The pipeline now accepts closure reads and their placement
      constraints. Specifically, the input FRG format has a new PLC placement message type. The
      assembler considers these constraints, not as absolutes, but on par with other information. It
      uses them, not only to close gaps, but throughout the pipeline. The constraints could be
      derived from PCR primer sites and apply to the corresponding PCR reads, for example.

   7. Pipeline Engineering: The traditional Celera Assembler used its own message passing interface
      wherein information passed from module to module in large (text or binary) files. Over the
      years, Celera Assembler has shifted to the use of its own database structure, called
      Store. Previous versions of Celera Assembler introduced the read database (gkpStore) and the
      overlap database (ovlStore). Celera Assembler 6.1 introduces a database for unitigs and
      contigs (tigStore) to replace the *.cgb message files. This change reduces I/O. It also
      facilitates debugging and mid-course alterations to problematic assemblies. There are
      utilities for dumping, querying, and updating the database.

Improvements

   1. Unitig Toggling: The unitig toggling process is supported directly from runCA. The algorithm
      now recognizes and toggles an additional case: a false-repeat unitig preventing the join of
      two scaffolds by being placed on the end of one scaffold and the beginning of the other.

   2. Scaffolding: An experimental option was added to eject a contig in a scaffold when it has no
      overlaps to its neighbors ([[runCA#).

   3. Scaffolding: Experimental option to revert current gap in a scaffold to a saved state while
      processing rocks/stones to see if an overlap now exists.

   4. Long 454 Reads: Very long bad reads sometimes generated on 454 platforms are correctly
      trimmed.

   5. Disk Usage: The intermediate output files from the overlap stage are now removed after the
      corresponding overlap store is created.

   6. Automatic Selection of Unitigger: runCA automatically selects the correct unitigger to use
      based on the input fragments. Previously it was not possible to override this selection in
      some instances. It is now possible to specify the unitigger to use.

   7. Scaffolding: An ancient bug was discovered and fixed. When evaluating if a pair of scaffolds
      can be merged, the number of satisfied and unsatisfied mate pairs are examined. If the merged
      scaffolds add too many unsatisfied mate pairs, the merge is not attempted. This examination
      was flawed, resulting in mate pairs never preventing a scaffold merge. The merge could still
      be prevented if there is no sequence alignment between the scaffolds.

   8. Several existing options were removed:
          * bogEjectUnhappyContain was removed. It is always enabled now.
          * cgwOutputIntermediate was removed. The need for this option was removed by allowing
            terminator to read input directly from the tigStore and a CGW checkpoint file.

   9. Several existing options were renamed:
          * doOverlapTrimming was renamed to doOverlapBasedTrimming. The original option is still
            allowed, but will be removed in a future release.
          * bogPromiscuous was renamed to bogBreakAtIntersections. Note that the sense of this
            option is reversed: bogPromiscuous=1 is the same as bogBreakAtIntersections=0.

  10. Several new options were added:
          * utgErrorLimit was added.
          * stopBefore was added.
          * sgeName was added.
          * doDeDuplication was added.
          * doMerBasedTrimming was added.
          * saveOverlaps was added.
          * kickOutNonOvlContigs was added.
          * doUnjiggleWhenMerging was added.
          * cnsPhasing was added.
          * toggleMaxDistance was added.
          * toggleDoNotDemote was added. 

Bug Fixes

The full list of changes is available, however many of these changes are due to problems added since
Celera Assembler 5.4 was released. Major bug fixes relating to bugs present in previous releases are
listed in the "Improvements" section above.

Known Problems

These may be fixed in future releases.

   1. In sffToCA, paired-end read lengths are calculated incorrectly in rare cases when '-trim hard'
      is used. Until the issue is resolved, users should always use '-trim chop' on 454 paired-end
      libraries.

   2. There is no explicit support for the high-coverage induced by XLR runs on small
      genomes. Coverage such as 80X induces combinations of sequencing errors that confound Celera
      Assembler. At best this leads to higher reported rates of allelic variation. At worst this
      leads to a fractured assembly. Sampling from high-coverage reads can yield a better assembly.

   3. There is little support for assembly of data sets with a small ratio of mate pairs to unpaired
      fragments. This ratio happens when XLR is applied to many fragment libraries but few
      paired-end libraries. This ratio impedes the assembly of unitigs into contigs and
      scaffolds. It leads to fewer bases in scaffolds and more bases in degenerate contigs.

   4. There is no support for bar-coded 454 data. Users with bar-coded data may use some other
      utility to remove the bar code sequence and partition reads by bar code into separate SFF
      files.

   5. There is no support for data from ABI SOLiD.

   6. There is no support for cDNA, exon-enriched DNA, or DNA amplified with bias of any sort. 

Legal

Copyright 1999-2004 by Applera Corporation. Copyright 2005-2009 by the J. Craig Venter
Institute. The Celera Assembler software, also known as the wgs-assembler and CABOG, is open-source
and available free of charge subject to the GNU General Public License, version 2.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published