Skip to content

michalbukowski/fetch-genomes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

fetch-genomes

Requirements

The script should work in all Python 3 environents with Pandas library installed. Below I provide versions for which I tested the script:

  • Python (3.11)
  • Pandas (1.5.3)

Short description

The script allows for downloading genomes from NCBI GenBank FTP server, based on the content of assembly_summary_genbank.txt or any TSV file that provides desirable data on genome assemblies in the following colums: assembly_accession, taxid, assembly_level, asm_name and ftp_path. For more information see README.txt at the NCBI GenBank FTP site.

You can run it without any command line options or provide one or more taxid values to narrow down the number of genomes you want to retrieve. Taxa IDs of your interest you may find in NCBI Taxonomy database. Genomes of requested taxa and all subordinate subtaxa will be retrieved.

Example usage

  • Download genomes for taxid 1279 (genus Staphylococcus) and 1350 (genus Enterococcus) and all subtaxa, i.e. all genomes assigned to the genera as well as all subordinate species, subspecies etc. Save the genomes to a default directory (genomes) in the current location:
./fetch_genomes.py -t 1279 1350
  • If you have problems with network connection, you may rerun the script until all genomes are successfully retrived, i.e. when you see in the end a message saying: [INFO] All files have been successfully fetched. Simply resume previous downolading or retry to download skipped genomes based on saved filtered assembly summary from a previous search (existing files will not be redownloaded):
./fetch_genomes.py -a assembly_summary_copy.tsv
  • Retrive filtered assembly summary only, i.e. assembly summary on genomes belonging to requested taxa, without downloading anything else in order to examine it or modify before use:
./fetch_genomes.py -t 1279 1350 -s

Command line options

Option Use
‑a‑‑assembly‑summary A path to a custom local file in TSV format that contains information on assemblies that are to be downloaded, default: assembly summary will be fetched from NCBI GenBank FTP site
‑c‑‑summary-copy A path to a TSV file where to save the filtered assembly summary for chosen taxids in TSV format, default: assembly_summary_copy.tsv
‑t‑‑taxids Space-separated IDs of taxa to retrive genomic sequences for, default: all existing(!)
‑l‑‑assembly-levels Space-separated assembly levels that will be taken into consideration: chromosome (chr), scaffold (scff), complete (cmpl), contig (ctg), default: all levels
‑o‑‑output-dir A path to the directory for downloaded genomes, dafault: genomes
‑f‑‑formats Formats of data to be downloaded: genomic sequences in nucleotide fasta format (fna), genomic sequences in GenBank format (gbff), annotation table (gff), RNA sequences in nucleotide fasta format (rna), coding sequences (CDS) in nucleotide fasta format (cds), translations of CDS in protein fasta format (prot), default: fna
‑n‑‑non-interactive Do not ask questions and overwrite existing data (be absolutely sure what you do)
‑s‑‑summary-only For given taxids or all, only download assembly summary

Releases

No releases published

Packages

No packages published

Languages