Skip to content

Accessing the inverted repeats of archived plastid genomes

License

Notifications You must be signed in to change notification settings

michaelgruenstaeudl/airpg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

airpg: Automatically accessing the inverted repeats of archived plastid genomes

Build Status PyPI status PyPI pyversions PyPI version shields.io PyPI license

A Python package for automatically accessing the inverted repeats of thousands of plastid genomes stored on NCBI Nucleotide

INSTALLATION

To get the most recent stable version of airpg, run:

pip install airpg

Or, alternatively, if you want to get the latest development version of airpg, run:

pip install git+https://github.com/michaelgruenstaeudl/airpg.git

Update May 2024 (tested on Debian)

To install airpg, clone it via git, cd into the cloned directory, open a terminal and run:

sudo pip install .

EXAMPLE USAGE

Tutorial 1: Very short survey (runtime ca. 5 min.; for the impatient)

Survey of all plastid genomes of flowering plants submitted to NCBI Nucleotide within the past 10 days.

Tutorial 2: Short survey (runtime ca. 15 min.; for testing)

Survey of all plastid genomes of flowering plants submitted to NCBI Nucleotide within the current month.

Tutorial 3: Medium survey (runtime ca. 5 hours)

Survey of all plastid genomes of flowering plants submitted to NCBI Nucleotide in 2019 only. Note: The results of this survey are available on Zenodo via DOI 10.5281/zenodo.4335906

Tutorial 4: Full survey (runtime ca. 19 hours; with explanations)

Survey of all plastid genomes of flowering plants submitted to NCBI Nucleotide from January 2000 until, and including, December 2020. Note: The results of this survey are available on Zenodo via DOI 10.5281/zenodo.4335906

TIPS & TRICKS

How to sort output_script1.tsv

sort -t$'\t' -k7.1,7.4 -k7.6,7.7 -k7.9,7.10 -n output_script1.tsv > output_script1.sorted.tsv
awk '{print $2}' output_script1.sorted.tsv > output_script1.sorted.index

How to sort output_script2.tsv and output_script3.tsv

awk 'NR==FNR{o[FNR]=$1; next} {t[$1]=$0} END{for(x=1; x<=FNR; x++){y=o[x]; print t[y]}}' output_script1.sorted.index output_script2.tsv > output_script2.sorted.tsv
awk 'NR==FNR{o[FNR]=$1; next} {t[$1]=$0} END{for(x=1; x<=FNR; x++){y=o[x]; print t[y]}}' output_script1.sorted.index output_script3.tsv > output_script3.sorted.tsv

How to measure the number of angiosperm families represented by the plastid genomes archived on GenBank

# Using the sorted output of script1 as input
awk -F'\t' '{print $11}' output_script1.sorted.tsv | tr ";" "\n" | grep "aceae" | grep -v "incertae sedis" | sort -u | wc -l

CHANGELOG

See CHANGELOG.md for a list of recent changes to the software.