Skip to content

sigven/cacao

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cacao - callable cancer loci

Contents

Overview

cacao is a computational workflow that provides software and data to assess sequencing depth for clinically actionable/pathogenic loci in cancer for a given sequence alignment (BAM/CRAM). Most importantly, the software will pinpoint genomic loci of clinical relevance in cancer that has sufficient sequencing coverage for reliable variant calling. In combination with the actual variants that have been identified, it may thus serve to confirm negative findings, a matter of significant clinical value that is underappreciated in current cancer sequencing analysis. The specific requirements to denote loci as callable (i.e. depth & alignment quality) can be configured by the user, and should thus reflect how the input are used for variant calling (RNA/DNA, germline/somatic calling)

Technically, cacao combines the speed of mosdepth with the powerful R markdown framework for interactive data reporting. It currently employs the Docker technology for software encapsulation to ease the installation process (A Conda package is in the making)

News

  • December 9th 2020: 0.3.1 release
    • Updated track directory for ClinVar and CIViC
    • Dockerfile uses renv for improved installation of R package dependencies

Annotation resources (v0.3.1)

Three clinical genomic tracks in BED format have been created:

  • Loci with pathogenic and likely pathogenic variants in protein-coding genes related to cancer predisposition and inherited cancer syndromes (BRCA1, BRCA2, ATM etc.)
  • Loci associated with actionable somatic variants (related to prognosis, diagnosis, or drug sensitivity, e.g. BRAF V600E)
    • Variants have been retrieved from CIViC (data harvested December 9th 2020)
    • Only variants that can be mapped unambigusously to the genome are considered as sources of actionable loci
  • Loci identified as somatic mutational hotspots (i.e. likely driver alterations) in cancer

IMPORTANT: At each variant identified from the three sources above, we have used a surrounding sequence window of approximately 10bp for which the mean depth is calculated and representing the loci coverage.

All three tracks (hereditary, somatic_actionable, and somatic_hotspot) are available for GRCh37 and GRCh38, and there is also tab-separated files that link each locus to its associated

  • variants and phenotypes (ClinVar),
  • clinical evidence items (therapeutic context, evidence level, from CIViC)
  • tumor types (cancerhotspots.org)

Example reports

  • An example report from the CACAO workflow showing callable cancer loci in an RNA sequence alignment.

Getting started

Installation

  • Prerequisites:
    • Make sure that Docker is installed and running
    • The CACAO workflow script cacao_wflow.py requires that Python3 is installed
  • Download the latest release
  • Pull the latest docker image docker pull sigven/cacao:0.3.1

Usage

Run the CACAO workflow with the cacao_wflow.py Python script, which takes the following required and optional arguments:

usage:
cacao_wflow.py -h [options]
--query_aln BAM/CRAM
--track_dir TRACK_DIR
--output_dir OUTPUT_DIR
--genome_assembly grch37|grch38
--sample_id SAMPLE_ID
--mode hereditary|somatic|any

cacao - assessment of sequencing coverage at pathogenic and actionable loci in
cancer

Required arguments:
  --query_aln QUERY_ALN
                        Query alignment file (BAM/CRAM)
  --track_dir TRACK_DIR
                        Directory with BED tracks of pathogenic/actionable cancer loci for grch37/grch38
  --output_dir OUTPUT_DIR
                        Output directory
  --genome_assembly {grch37,grch38}
                        Human genome assembly build: grch37 or grch38
  --mode {hereditary,somatic,any}
                        Choice of loci and clinical cancer context (cancer predisposition/tumor sequencing)
  --sample_id SAMPLE_ID
                        Sample identifier - prefix for output files

Optional arguments:
  -h, --help            show this help message and exit
  --mapq MAPQ           mapping quality threshold (default: 0)
  --threads THREADS     Number of mosdepth BAM decompression threads. (use 4
                        or fewer) (default: 0)
  --callability_levels_germline CALLABILITY_LEVELS_GERMLINE
                        Simple colon-separated string that defines four levels
                        of variant callability: NO_COVERAGE (0), LOW_COVERAGE
                        (1-9), CALLABLE (10-99), HIGH_COVERAGE (>= 100).
                        Initial value must be 0. (default: 0:10:100)
  --callability_levels_somatic CALLABILITY_LEVELS_SOMATIC
                        Simple colon-separated string that defines four levels
                        of variant callability: NO_COVERAGE (0), LOW_COVERAGE
                        (1-29), CALLABLE (30-199), HIGH_COVERAGE (>= 200).
                        Initial value must be 0. (default: 0:30:200)
  --query_target QUERY_TARGET
                        BED file with genome target regions subject to
                        sequencing/analysis (default: None)
  --force_overwrite     By default, the script will fail with an error if any
                        output file already exists. You can force the
                        overwrite of existing result files by using this flag
                        (default: False)
  --version             show program's version number and exit

Documentation

Coming

Contact

sigven AT ifi.uio.no