arrayexpress-cram-submission

A python 3 pipeline for submitting CRAM files generated by ArrayExpress to the European Nucleotide Archive (ENA).

Installation

Get a copy of the project and install python dependencies with pip install -r requirements.txt.

Or build the docker image and skipt the next section.

Setup

luigid

and visit it on port 8082, e.g. http://localhost:8082. Then run a pipeline and follow the progress in your browser. The localhost needs to be the name of the local (or farm) server you are working on.

Submitting all CRAM files for a species

Bash

export ena_user=webin-xxx
export ena_password=xxxxxxxx
luigi --module pipeline SubmitSpecies --species oryza_sativa

Docker

docker run \
	-e "ena_user=webin-xxx" \
	-e "ena_password=xxxxxxxx" \
	<image> SubmitSpecies --species oryza_sativa

This will

make a request to getRunsByOrganism on the ArrayExpress API to fetch a list of all Oryza sativa CRAM files which have been marked as 'Complete'
upload each CRAM file to the European Nucleotide Archive (ENA) FTP server
collect metadata required for the submission
create 'submission' and 'analysis' XML documents required for programmatic submission to the ENA and submit them
store the resulting submission and analysis accessions in an SQLite database

Testing

Add --test --limit 3 to the luigi command to sumit to the ENA test server (results are not publicly visible) and sumit only 3 CRAM files instead of all.

Submitting CRAM files for all plant species

Bash

export ena_user=webin-xxx
export ena_password=xxxxxxxx
luigi --module pipeline SubmitAllSpecies

Docker

docker run \
	-e "ena_user=webin-xxx" \
	-e "ena_password=xxxxxxxx" \
	<image> SubmitAllSpecies

This will make a request to getOrganisms on the ArrayExpress API to fetch a list of all plant species, and run SubmitSpecies (described above) for each.

Scaling

Multiple workers can be run in parallel on the same host by adding the --workers parameter. E.g. luigi --module pipeline SubmitAllSpecies --workers 8. Since this pipeline is limited by the throughput of the ArrayExpress and ENA FTP servers, increasing the number of workers beyond this will not improve performance.

Background

Every step from discovering CRAM files, over collecting metadata, to submitting to ENA is implemented as a luigi Task.

This makes it easy to deal with failures that inevitably will happen when e.g. some of 30k+ long running tasks that have dependencies between each other will fail. Instead of cleaning up after failed tasks, resetting state, or being forced to start again from scratch we can rely on luigi to check the completion of (atomic) tasks and resume safely.

Programmatic submission to the European Nucleotide Archive requires the creation of 'submission' and 'analysis' XML documents, following the provided schemas. These documents are created with the help of generateDS, which generates an API to match the schemas. The generated code is in ena/schema.

Should the schemas change, a new version of the API can be generated with

generateDS.py -o "SRA_analysis.py" -s "SRA_analysis_sub.py" SRA.analysis.xsd
generateDS.py -o "SRA_submission.py" -s "SRA_submission_sub.py" SRA.submission.xsd

Execution Summary

Luigi will print a summary of all work that has been done. Here's simplified example:

===== Luigi Execution Summary =====
Scheduled 40132 tasks of which:
* 7923 present dependencies were encountered:
    - 7918 StoreEnaSubmissionResult(...)
* 32186 ran successfully:
    - 10705 StoreEnaSubmissionResult(...)
    - 36 SubmitSpecies(species=aegilops_tauschii) ...
    - 10705 SubmitToEna(...)
    - 10702 UploadCramToENA(...)
* 11 failed:
    - 8 SubmitToEna(...)
    - 3 UploadCramToENA(...)
* 3 were left pending, among these:
    * 3 were missing external dependencies:
        - 1 SubmitAllSpecies(limit=0)
        - 2 SubmitSpecies(species=oryza_sativa) and SubmitSpecies(species=zea_mays)

present dependency means the task has completed successfully on a previous run of luigi. StoreEnaSubmissionResult is the task that stores accessions returned from ENA in SQLite. So this section tells us that 7918 files have been submitted before.

ran successfully means the task completed successfully just now. So we can see that 10705 files have been submitted, and 36 species have been processed fully.

failed means just that - errors are in the log or the scheduler web interface. 8 files couldn't be submitted to the ENA rest endpoint. 3 files couldn't be uploaded to the ENA ftp server.

These failures cause problems upstream for tasks that depend on them, which are described after that. We can see that oryza_sativa and zea_mays were affected by the submission and upload failures.

The good news is that 32186 tasks ran successfully and won't be run again in the future, only the 11 failed ones.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
arrayexpress		arrayexpress
ena		ena
tests		tests
.coveragerc		.coveragerc
.travis.yml		.travis.yml
Dockerfile		Dockerfile
README.md		README.md
pipeline.py		pipeline.py
requirements.txt		requirements.txt
urls.cfg		urls.cfg
urls.py		urls.py

EnsemblGenomes/arrayexpress-cram-submission

Folders and files

Latest commit

History

Repository files navigation

arrayexpress-cram-submission

Installation

Setup

Submitting all CRAM files for a species

Bash

Docker

Testing

Submitting CRAM files for all plant species

Bash

Docker

Scaling

Background

Execution Summary

About

Resources

Stars

Watchers

Forks

Languages