GitHub - michalbukowski/fetch-genomes: Download genomes from NCBI GenBank FTP site

fetch-genomes

Requirements

The script should work in all Python 3 environents with Pandas library installed. Below I provide versions for which I tested the script:

Python (3.11)
Pandas (1.5.3)

Short description

The script allows for downloading genomes from NCBI GenBank FTP server, based on the content of assembly_summary_genbank.txt or any TSV file that provides desirable data on genome assemblies in the following colums: assembly_accession, taxid, assembly_level, asm_name and ftp_path. For more information see README.txt at the NCBI GenBank FTP site.

You can run it without any command line options or provide one or more taxid values to narrow down the number of genomes you want to retrieve. Taxa IDs of your interest you may find in NCBI Taxonomy database. Genomes of requested taxa and all subordinate subtaxa will be retrieved.

Example usage

Download genomes for taxid 1279 (genus Staphylococcus) and 1350 (genus Enterococcus) and all subtaxa, i.e. all genomes assigned to the genera as well as all subordinate species, subspecies etc. Save the genomes to a default directory (genomes) in the current location:

./fetch_genomes.py -t 1279 1350

If you have problems with network connection, you may rerun the script until all genomes are successfully retrived, i.e. when you see in the end a message saying: [INFO] All files have been successfully fetched. Simply resume previous downolading or retry to download skipped genomes based on saved filtered assembly summary from a previous search (existing files will not be redownloaded):

./fetch_genomes.py -a assembly_summary_copy.tsv

Retrive filtered assembly summary only, i.e. assembly summary on genomes belonging to requested taxa, without downloading anything else in order to examine it or modify before use:

./fetch_genomes.py -t 1279 1350 -s

Command line options

Option	Use
`‑a`, `‑‑assembly‑summary`	A path to a custom local file in TSV format that contains information on assemblies that are to be downloaded, default: assembly summary will be fetched from NCBI GenBank FTP site
`‑c`, `‑‑summary-copy`	A path to a TSV file where to save the filtered assembly summary for chosen taxids in TSV format, default: assembly_summary_copy.tsv
`‑t`, `‑‑taxids`	Space-separated IDs of taxa to retrive genomic sequences for, default: all existing(!)
`‑l`, `‑‑assembly-levels`	Space-separated assembly levels that will be taken into consideration: chromosome (`chr`), scaffold (`scff`), complete (`cmpl`), contig (`ctg`), default: all levels
`‑o`, `‑‑output-dir`	A path to the directory for downloaded genomes, dafault: genomes
`‑f`, `‑‑formats`	Formats of data to be downloaded: genomic sequences in nucleotide fasta format (`fna`), genomic sequences in GenBank format (`gbff`), annotation table (`gff`), RNA sequences in nucleotide fasta format (`rna`), coding sequences (`CDS`) in nucleotide fasta format (`cds`), translations of CDS in protein fasta format (`prot`), default: fna
`‑n`, `‑‑non-interactive`	Do not ask questions and overwrite existing data (be absolutely sure what you do)
`‑s`, `‑‑summary-only`	For given taxids or all, only download assembly summary

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
LICENSE		LICENSE
README.md		README.md
fetch_genomes.py		fetch_genomes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

fetch_genomes.py

fetch_genomes.py

Repository files navigation

fetch-genomes

Requirements

Short description

Example usage

Command line options

About

Releases

Packages

Languages

License

michalbukowski/fetch-genomes

Folders and files

Latest commit

History

Repository files navigation

fetch-genomes

Requirements

Short description

Example usage

Command line options

About

Topics

Resources

License

Stars

Watchers

Forks

Languages