Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module/ichorcna/1.0 #178

Merged
merged 9 commits into from
May 4, 2021
Merged

Module/ichorcna/1.0 #178

merged 9 commits into from
May 4, 2021

Conversation

Jwong684
Copy link
Member

@Jwong684 Jwong684 commented Apr 1, 2021

Pull Request Checklists

Important: When opening a pull request, keep only the applicable checklist and delete all other sections.

Checklist for New Module

Required

  • I used the cookiecutter template and updated the placeholder rules.

  • The snakemake rules follow the design guidelines.

    • All references to the rules object (e.g. for input files) are wrapped with str().
  • Every rule in the module is either listed under localrules or has the threads and resources directives.

  • Input and output files are being symlinked into the CFG["inputs"] and CFG["outputs"] subdirectories, respectively.

  • I updated the final target rule (*_all) to include every output rule.

  • I explained important module design decisions in CHANGELOG.md.

  • I tested the module on real data for all supported seq_type values.

  • I updated the default.yaml configuration file to provide default values for each rule in the module snakefile.

  • I did not set any global wildcard constraints. Any/all wildcard constraints are set on a per-rule basis.

  • I ensured that all symbolic links are relative and self-contained (i.e. do not point outside of the repository).

  • I replaced every value that should (or might need to) be updated in the default configuration file with __UPDATE__.

  • I recursively searched for all comments containing TODO to ensure none were left. For example:

    grep -r TODO modules/<module_name>/1.0

If applicable

  • I added more granular output subdirectories.

  • I added rules to the reference_files workflow to generate any new reference files.

  • I added subdirectories with large intermediate files to the list of scratch_subdirectories in the default.yaml configuration file.

  • I updated the list of available wildcards for the input files in the default.yaml configuration file.

Checklist for Updated Module

To be completed.

Copy link
Collaborator

@rdmorin rdmorin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One big thing I'd like to raise is why the source code for ichorcna is seemingly bundled with this module. This should auto-install rather than containing all the source code.

demo/Snakefile Outdated
@@ -83,4 +84,5 @@ rule all:
rules._strelka_all.input,
rules._bwa_mem_all.input,
rules._liftover_all.input,
rules._controlfreec_all.input
rules._controlfreec_all.input,
rules._ichorcna_all.input
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a newline after this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

demo/config.yaml Outdated
ichorcna:
inputs:
sample_bam: "data/{sample_id}.bam"
sample_bai: "data/{sample_id}.bam.bai"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a newline at the end if this is the last line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

readcounter:
readCounterScript: "{MODSDIR}/src/readCounter"
chrs:
hg19: "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we wouldn't rely on the user to specify the chromosomes this way but I can live with this for now.

hs37d5: "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y"
hg38: "chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY"
grch38: "chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY"
qual: 20
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain with a comment what this does.
e.g.
qual: 20 #set the minimum mapping quality (or whatever this actually means)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"500000": "{MODSDIR}/src/inst/extdata/HD_ULP_PoN_{genome_build}_500kb_median_normAutosome_median.rds"
# must use gc wig file corresponding to same binSize (required)
ichorCNA_gcWig:
"1000000": "{MODSDIR}/src/inst/extdata/gc_{genome_build}_1000kb.wig"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this genome_build naming match the one we use ? I assume it does since this was run in GAMBL, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately ichorCNA's github repo is messy and inconsistent with their naming conventions. In my original version, I had to manually rename some of their reference files to fit this format. In the current version, there's one rule with a bunch of symlinks that renames the reference files so it would fit in downstream rules.


### Directories ###

resultsDir = "results/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a standard for LCR-modules. What is the purpose/benefit of this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left them in by accident when I was initially crafting the module. They're not used at all in the module. They've been removed now

CFG["runs"]["binSize"] = str(CFG["options"]["readcounter"]["binSize"])

# Symlinks the input files into the module results directory (under '00-inputs/')
rule _ichorcna_input_bam:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work on crams?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, unfortunately readCounter does not work on CRAMs. readCounter (part of bamtools) never updated for CRAMs, though it seemed like a popular idea (pezmaster31/bamtools#149)

rule _run_ichorcna:
input:
tum = CFG["dirs"]["readDepth"] + "{seq_type}--{genome_build}/{binSize}/{tumour_id}.bin{binSize}.wig",
# norm = CFG["dirs"]["readDepth"] + "{seq_type}--{genome_build}/{binSize}/{normal_id}.bin{binSize}.wig"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this and any other extraneous lines

seg = CFG["dirs"]["seg"] + "{seq_type}--{genome_build}/{binSize}/{tumour_id}--{normal_id}--{pair_status}/{tumour_id}.seg",
plot = CFG["dirs"]["seg"] + "{seq_type}--{genome_build}/{binSize}/{tumour_id}--{normal_id}--{pair_status}/{tumour_id}/{tumour_id}_genomeWide.pdf",
#rdata = "results/ichorCNA/{sample_id}/{sample_id}.RData"
params:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you name your params identically to the variable name you want to use all of this is redundant because you can use unpacking to set them all (I think)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow here


# Perform some clean-up tasks, including storing the module-specific
# configuration on disk and deleting the `CFG` variable
op.cleanup_module(CFG)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add newline

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

bai = CFG["dirs"]["inputs"] + "bam/{seq_type}--{genome_build}/{sample_id}.bam.bai", # specific to readCounter
crai = CFG["dirs"]["inputs"] + "bam/{seq_type}--{genome_build}/{sample_id}.bam.crai"
run:
op.relative_symlink(input.bam, output.bam)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we started to use the op.absolute_symlink here instead of relative symlink. This will also need the check for oncopipe version, and you can find example in one of the recent modules (battenberg/1.1, pathseq/1.0 etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed bam inputs to op.absolute_symlink

seg = CFG["dirs"]["outputs"] + "{seq_type}--{genome_build}/seg/{binSize}/{tumour_id}--{normal_id}--{pair_status}.seg",
plot = CFG["dirs"]["outputs"] + "{seq_type}--{genome_build}/plot/{binSize}/{tumour_id}--{normal_id}--{pair_status}_genomeWide.pdf"
run:
op.relative_symlink(input.corrDepth, output.corrDepth)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the latest version of oncopipe also supports argument in_module=TRUE, which creates "shallow" symlinks. You can add it here as well, and the recent modules have an example of this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for all output relative_symlinks

##### RULES #####
# ---------------------------------------------------------------------------- #

CFG["runs"]["binSize"] = str(CFG["options"]["readcounter"]["binSize"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do? I understand that it basically tells rule all the binsize for each sample. If it is the same for all samples, you can in the rule all just use str(CFG["options"]["readcounter"]["binSize"]) and then you can multiply itty the number of samples, for example *len(CFG["runs"]["tumour_sample_id"] so it expands the length for each sample. There is example in the rule all of module liftover

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your interpretation is correct - I modified it to your method from liftOver.

@Jwong684
Copy link
Member Author

Modified version has been tested - it runs to completion for a grch37, hg19, hg38 sample.

Since I had reference files I needed to link and modify from the github repo of ichorCNA, I decided to get snakemake to clone their repo into the CFG["dirs"]["inputs"] directory so I could access these files in downstream rules.

The naming convention for ichorCNA's reference files is messy (ex. HD_ULP_PoN_1Mb_median_normAutosome_mapScoreFilteredmedian.rds for hg19, and HD_ULP_PoNhg38_1Mb_median_normAutosome_median.rds for hg38). I originally just renamed these files when I had them as part of the module, and symlinked reference files for related genome_builds (ex. hs37d5, hg19, grch37 would all have symlinks to the same reference file). I incorporated this symlinking step inside the module.

Also added Chris's suggestion of tweaking the default parameters.

@@ -0,0 +1,423 @@
# file: ichorCNA.R
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is now downloaded directly, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this one is the one that I had to modify to allow hs37d5 to be included. They hard-coded the available genomes

hg38: "{MODSDIR}/src/inst/extdata/GRCh38.GCA_000001405.2_centromere_acen.txt"
ichorCNA_minMapScore: 0.75
ichorCNA_chrs:
grch37: "c('1', '2', '3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','X')"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be moved to snakelike, instead of being in config? There, you can use file listing chromosomes generated by reference files (main_chromosomes.txt) or use function to generate chromosome names (there is example in sage module)

binSize: 1000000 # set window size to compute coverage
# available binSizes are: 1000000, 500000, 50000, 10000
run:
ichorCNA_libdir: ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these files in inst are being auto downloaded? They are probably then not needed in the config since you set the naming convention and the path to these. files in the snakelike rule

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the chromosomes, I could use the function for the readCounter step since I need the command to look like:
"1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y"
but for the ichorCNA step, it needs to be in R-vector format:
"c('1', '2', '3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','X')"
Wouldn't it be simpler to keep it like this in the config? Otherwise, I'd need to make a function to include c( ), and ' '

'ichorCNA_libdir:' parameter is supposed to be set to where the github repo resides for ichorCNA (it'll use this to search for inst/extdata/ underneath that directory). I just left it in the config since it might be useful if someone needs to modify the path one day

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is that vector being given to ichorCNA?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the command it would be --chrs "c('1', '2', '3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','X')"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could construct that in Python probably and then just add that as a "param" in your rule.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Jwong684 Jwong684 requested review from rdmorin and Kdreval May 4, 2021 02:04
@rdmorin rdmorin merged commit b701fff into master May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants