Skip to content

R package to merge the NCBI taxonomy and the GBIF backbone taxonomy

License

Notifications You must be signed in to change notification settings

MoultDB/taxonbridge

Repository files navigation

taxonbridge

R-CMD-check badge CRAN Status CRAN Downloads

Biological taxonomies establish conventions by which researchers can catalogue and systematically compare their work using nomenclature such as numeric identifiers and binomial names. The ideal taxonomy is unambiguous and exhaustive; however, no perfect taxonomy exists. The degree to which a taxonomy is useful to a researcher depends on context provided by, for example, the taxonomic neighborhood of a species or the geological timeframe of the study. Collating the most relevant taxonomic information from multiple taxonomies is hampered by arbitrary assignment of numeric identifiers by database administrators, ambiguity in scientific names, and duplication. The NCBI is the go-to resource for many scientists, but its taxonomy only includes data on species with sequence data. In contrast, the Global Biodiversity Information Facility (GBIF) backbone taxonomy references a more extensive list of extinct and extant species, and it is integrated with 100 other taxonomic databases. Unfortunately, the GBIF backbone taxonomy excludes the NCBI taxonomy. Since the NCBI and GBIF use different numeric identifiers, it is easy to imagine how using scientific names could lead to errors when mapping from one taxonomy to the other. As a case in point, additional lineage information could be used to validate mapping by recursively comparing parental taxon names. The goal of taxonbridge is hence to provide a set of tools for merging the GBIF backbone and NCBI taxonomies in order to derive a consistent, deduplicated and disambiguated custom taxonomy for any given study. See the data provenance flow diagram and scientific poster for more details.

Installation

CRAN version:

To install taxonbridge from CRAN type:

install.packages("taxonbridge")
library(taxonbridge)

Note that the version on CRAN might not reflect the most recent changes made to the development version of taxonbridge.

Development version:

You can install the development version of taxonbridge with devtools:

install.packages(c("devtools", "rmarkdown"))
devtools::install_github("MoultDB/taxonbridge", build_vignettes = TRUE)
library(taxonbridge)

taxonbridge can be also be updated/re-installed/overwritten with either of the preceding installation options.

Available methods and how to use them

See the taxonbridge documentation for detailed descriptions of the available methods and see the workflow for how to use the methods. Custom taxonomies in taxonbridge always have the following 29 columns. All columns have the character data type. Column names with links are GBIF column names that are also Darwin Core controlled vocabulary terms.

Column name Description
taxonID GBIF identifier
canonicalName GBIF/NCBI scientific name
taxonRank GBIF rank
parentNameUsageID GBIF parent ID
acceptedNameUsageID GBIF accepted ID
originalNameUsageID GBIF original ID
taxonomicStatus GBIF taxonomic status
kingdom GBIF kingdom name
phylum GBIF phylum name
class GBIF class name
order GBIF order name
family GBIF family name
genericName GBIF genus name
specificEpithet GBIF species name
infraspecificEpithet GBIF subspecies name
from_GBIF GBIF provenance indicator
ncbi_id NCBI identifier
ncbi_lineage_names NCBI full lineage names
ncbi_lineage_ids NCBI full lineage IDs
ncbi_rank NCBI rank
ncbi_lineage_ranks NCBI full lineage ranks
ncbi_kingdom NCBI kingdom name
ncbi_phylum NCBI phylum name
ncbi_class NCBI class name
ncbi_order NCBI order name
ncbi_family NCBI family name
ncbi_genus NCBI genus name
ncbi_species NCBI species name
from_NCBI NCBI provenance indicator

Examples

A 2000 row example subset of a previously merged taxonomy is bundled with taxonbridge and can be loaded as follow:

library(taxonbridge)
example_1 <- load_sample()

Want to try more than a sample? Download a larger dataset and load it as follow:

library(taxonbridge)
example_2 <- load_population("path/to/downloaded/dataset")

You can also prepare a dataset yourself which requires the use of external data and software. The most current NCBI and GBIF taxonomic data can be downloaded as follow:

download_gbif()
download_ncbi()

Once the downloads are complete, the paths to the downloaded files will be reported to your terminal. A single file is downloaded from the GBIF (Taxon.tsv) while four files are downloaded from the NCBI (nodes.dmp, names.dmp, delnodes.dmp and merged.dmp). Take note of the location of these files. The NCBI files require further parsing with Taxonkit. Read the download_ncbi() documentation for instructions on how to parse the NCBI files.

Downloading the GBIF and NCBI taxonomic data, parsing the NCBI files, and merging the taxonomies can easily be carried out in one command if Taxonkit is already installed on your system:

library(taxonbridge)
custom_taxonomy <- load_taxonomies(download_gbif(), download_ncbi(taxonkitpath = "/path/to/taxonkit"))

See the workflow and vignette for more ideas on what to do with loaded data in taxonbridge.

References

Global Biodiversity Information Facility (GBIF) backbone taxonomy

National Center for Biotechnology Information (NCBI) taxonomy

Citation

To cite taxonbridge in publications use:

Veldsman WP, Campli G, Dind S, Rech de Laval V, Drage HB, Waterhouse RM and Robinson-Rechavi M (2022) Taxonbridge: an R package to create custom taxonomies based on the NCBI and GBIF taxonomies, bioRxiv, 490269. DOI: https://doi.org/10.1101/2022.05.02.490269

About

R package to merge the NCBI taxonomy and the GBIF backbone taxonomy

Topics

Resources

License

Stars

Watchers

Forks

Languages