Skip to content

Commit

Permalink
Merge pull request #722 from dengzq1234/ete4_tutorial_taxa
Browse files Browse the repository at this point in the history
Ete4 tutorial taxa
  • Loading branch information
jordibc committed Nov 2, 2023
2 parents 1582ea2 + b8d315a commit 415a1cc
Showing 1 changed file with 398 additions and 0 deletions.
398 changes: 398 additions & 0 deletions doc/tutorial/tutorial_taxonomy.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,398 @@
.. currentmodule:: ete4

Connecting with Taxonomy Databases
==================

.. contents::

Overview
--------
ETE4 contains *ncbi_taxonomy* and *gtdb_taxonomy* modules which provide
utilities to efficiently query a local copy of the NCBI or GTDB taxonomy
databases. The class ``NCBITaxa`` and ``GTDBTaxa`` offer methods to convert
from taxid to names (and vice versa), to fetch pruned topologies connecting
a given set of species, or to download rank, names and lineage track information.

It is also fully integrated with PhyloTree instances through the
``PhyloNode.annotate_ncbi_taxa()`` and ``PhyloNode.annotate_gtdb_taxa()``method.
Differences between NCBI and GTDB taxonomies in ETE4
----------------------------------------------------
The NCBI taxonomy database is a comprehensive resource for organism names and
classifications.It is updated daily and offers multiple access points including a web
portal, an FTP server. The database releases its data in a package called "taxdump.tar.gz" which
contains several .dmp files.
Taxon in NCBI taxonomyis usually a numeric identifier, commonly representing
taxa ("TaxID"), but it can also signify other entities like genetic codes or citations, such as
9606 represents Homo Sapiens.
On the other hand, GTDB taxonomy is distributed as simple text files, uses a genome-based
approach for classification, and the identifiers are usually specific to genomes rather
than taxa.
Since ETE Toolkit version 3, ete3 parses taxdump file to local sqlite database to fullfill the
methods in ncbi_taxonomy module. We applied the same strategy to GTDBTaxa. While the original GTDB
taxonomy data differs from NCBI taxonomy files, a conversion step is essential for integration.
To integrate GTDB into the ETE Toolkit v4, a conversion process was necessary. A third-party script
(https://github.com/nick-youngblut/gtdb_to_taxdump) was employed to convert the GTDB taxonomy to the
NCBI-like taxdump format. We already preprared GTDB taxonomy dump file from different releases version
and store in https://github.com/etetoolkit/ete-data/tree/main/gtdb_taxonomy.
Setting up local copies of the NCBI and GTDB taxonomy databases
-------------------------------------------------------------
The first time you attempt to use NCBITaxa or GTDBTaxa, ETE will detect that your local
database is empty and it will attempt to download the latest taxonomy database(NCBI ~600MB;GTDB ~72MB) and will
store a parsed version of it in your home directory: ~/.local/share/ete/.
All future imports of NCBITaxa or GTDBTaxa will detect the local database and will
skip this step.
Example::
# Load NCBI module
from ete4 import NCBITaxa
ncbi = NCBITaxa()
ncbi.update_taxonomy_database()
# Load GTDB module
from ete4 import GTDBTaxa
gtdb = GTDBTaxa()
gtdb.update_taxonomy_database()
# Load GTDB module with specific release version
from ete4 import GTDBTaxa
gtdb = GTDBTaxa()
# latest release updated in https://github.com/dengzq1234/ete-data/tree/main/gtdb_taxonomy
gtdb.update_taxonomy_database()
# or
gtdb.update_taxonomy_database("gtdbdump.tar.gz")
# update with custom release 202
gtdb.update_taxonomy_database('gtdb202dump.tar.gz')
Getting taxid information
-------------------------
NCBI taxonomy
~~~~~~~~~~~~~
you can fetch species names, ranks and linage track information for your taxids using the following
methods:
- NCBITaxa.get_rank()
- NCBITaxa.get_lineage()
- NCBITaxa.get_taxid_translator()
- NCBITaxa.get_name_translator()
- NCBITaxa.translate_to_names()
The so called get-translator-functions will return a dictionary converting between taxids and species names.
Either species or linage names/taxids are accepted as input.
Example::
from ete4 import NCBITaxa
ncbi = NCBITaxa()
taxid2name = ncbi.get_taxid_translator([9606, 9443])
print(taxid2name)
# {9443: 'Primates', 9606: 'Homo sapiens'}
name2taxid = ncbi.get_name_translator(['Homo sapiens', 'primates'])
print(name2taxid)
# {'Homo sapiens': [9606], 'primates': [9443]}
# when the same name points to several taxa, all taxids are returned
name2taxid = ncbi.get_name_translator(['Bacteria'])
print(name2taxid)
# {'Bacteria': [2, 629395]}
Other functions allow to extract further information using taxid numbers as a query.
Example::
from ete4 import NCBITaxa
ncbi = NCBITaxa()
print(ncbi.get_rank([9606, 9443]))
# {9443: u'order', 9606: u'species'}
print(ncbi.get_lineage(9606))
# [1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742,
# 7776, 117570, 117571, 8287, 1338369, 32523, 32524, 40674, 32525, 9347,
# 1437010, 314146, 9443, 376913, 314293, 9526, 314295, 9604, 207598, 9605,
# 9606]
Combine combine all at once:
Example::
from ete4 import NCBITaxa
ncbi = NCBITaxa()
lineage = ncbi.get_lineage(9606)
print(lineage)
# [1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742,
# 7776, 117570, 117571, 8287, 1338369, 32523, 32524, 40674, 32525, 9347,
# 1437010, 314146, 9443, 376913, 314293, 9526, 314295, 9604, 207598, 9605,
# 9606]
names = ncbi.get_taxid_translator(lineage)
print([names[taxid] for taxid in lineage])
# [u'root', u'cellular organisms', u'Eukaryota', u'Opisthokonta', u'Metazoa',
# u'Eumetazoa', u'Bilateria', u'Deuterostomia', u'Chordata', u'Craniata',
# u'Vertebrata', u'Gnathostomata', u'Teleostomi', u'Euteleostomi',
# u'Sarcopterygii', u'Dipnotetrapodomorpha', u'Tetrapoda', u'Amniota',
# u'Mammalia', u'Theria', u'Eutheria', u'Boreoeutheria', u'Euarchontoglires',
# u'Primates', u'Haplorrhini', u'Simiiformes', u'Catarrhini', u'Hominoidea',
# u'Hominidae', u'Homininae', u'Homo', u'Homo sapiens']
GTDB taxonomy
~~~~~~~~~~~~~
In the NCBI taxonomy database, each species is assigned a unique numeric taxid.
For example, the taxid 9606 refers to Homo sapiens. These taxids serve as
essential keys for tracking lineages within the database.
However, the GTDB database doesn't originally offer numeric taxids like NCBI
does. In the GTDBTaxa module, we've introduced taxids for each species to
facilitate lineage tracking. These taxids, while not officially recognized in
the GTDB database, serve as convenient keys. They help in connecting the lineage
and taxonomic ranks within the local database, making it easier for users to
fetch and relate taxonomic information.
Like NCBITaxa, GTDBTaxa contains similar methods:
- GTDBTaxa.get_rank()
- GTDBTaxa.get_lineage()
- GTDBTaxa.get_taxid_translator()
- GTDBTaxa.get_name_translator()
- GTDBTaxa.translate_to_names()
- GTDBTaxa.get_name_lineage()
Getting descendant taxa
-----------------------
Given a taxid or a taxa name from an internal node in the NCBI/GTDB taxonomy tree,
their descendants can be retrieved as follows:
NCBI taxonomy
Example::
# example in NCBI taxonomy
from ete4 import NCBITaxa
ncbi = NCBITaxa()
descendants = ncbi.get_descendant_taxa('Homo')
print(ncbi.translate_to_names(descendants))
# [u'Homo heidelbergensis', u'Homo sapiens ssp. Denisova',
# u'Homo sapiens neanderthalensis']
# you can easily ignore subspecies, so only taxa labeled as "species" will be reported:
descendants = ncbi.get_descendant_taxa('Homo', collapse_subspecies=True)
print(ncbi.translate_to_names(descendants))
# [u'Homo sapiens', u'Homo heidelbergensis']
# or even returned as an annotated tree
tree = ncbi.get_descendant_taxa('Homo', collapse_subspecies=True, return_tree=True)
print(tree.to_str(props=['sci_name','taxid']))
"""
╭╴environmental samples,2665952╶╌╴Homo sapiens environmental sample,2665953
├╴Homo sapiens,9606
╴Homo,9605╶┤
├╴Homo heidelbergensis,1425170
╰╴unclassified Homo,2813598╶╌╴Homo sp.,2813599
"""
GTDB taxonomy
Example::
from ete4 import GTDBTaxa
gtdb = GTDBTaxa()
descendants = gtdb.get_descendant_taxa('f__Thorarchaeaceae')
print(descendants)
# ['GB_GCA_003662765.1', 'GB_GCA_003662805.1', 'GB_GCA_003345555.1', 'GB_GCA_003345595.1', 'GB_GCA_001940705.1', 'GB_GCA_001563335.1', 'GB_GCA_011364905.1', 'GB_GCA_004376265.1', 'GB_GCA_002825515.1', 'GB_GCA_001563325.1', 'GB_GCA_011364985.1', 'GB_GCA_011365025.1', 'GB_GCA_004524565.1', 'GB_GCA_004524595.1', 'GB_GCA_002825465.1', 'GB_GCA_002825535.1', 'GB_GCA_003345545.1', 'GB_GCA_004524445.1', 'GB_GCA_013388835.1', 'GB_GCA_008080745.1', 'GB_GCA_004524435.1', 'GB_GCA_013138615.1']
#ignore subspecies, so only taxa labeled as "species" will be reported
descendants = gtdb.get_descendant_taxa('f__Thorarchaeaceae', collapse_subspecies=True)
print(descendants)
#['s__MP8T-1 sp002825535', 's__MP8T-1 sp003345545', 's__MP8T-1 sp002825465', 's__MP8T-1 sp004524565', 's__MP8T-1 sp004524595', 's__SMTZ1-83 sp011364985', 's__SMTZ1-83 sp011365025', 's__SMTZ1-83 sp001563325', 's__TEKIR-14 sp004524445', 's__SHMX01 sp008080745', 's__OWC5 sp003345595', 's__OWC5 sp003345555', 's__JACAEL01 sp013388835', 's__B65-G9 sp003662765', 's__SMTZ1-45 sp001563335', 's__SMTZ1-45 sp011364905', 's__SMTZ1-45 sp001940705', 's__SMTZ1-45 sp004376265', 's__SMTZ1-45 sp002825515', 's__WTCK01 sp013138615', 's__TEKIR-12S sp004524435']
#returned as an annotated tree
descendants = gtdb.get_descendant_taxa('f__Thorarchaeaceae', collapse_subspecies=True, return_tree=True)
print(descendants.to_str(props=['sci_name','rank']))
"""
╭╴s__MP8T-1 sp002825535,species
├╴s__MP8T-1 sp003345545,species
╭╴g__MP8T-1,genus╶┼╴s__MP8T-1 sp002825465,species
│ │
│ ├╴s__MP8T-1 sp004524565,species
│ │
│ ╰╴s__MP8T-1 sp004524595,species
│ ╭╴s__SMTZ1-83 sp011364985,species
│ │
├╴g__SMTZ1-83,genus╶┼╴s__SMTZ1-83 sp011365025,species
│ │
│ ╰╴s__SMTZ1-83 sp001563325,species
├╴g__TEKIR-14,genus╶╌╴s__TEKIR-14 sp004524445,species
├╴g__SHMX01,genus╶╌╴s__SHMX01 sp008080745,species
│ ╭╴s__OWC5 sp003345595,species
├╴g__OWC5,genus╶┤
╴f__Thorarchaeaceae,family╶┤ ╰╴s__OWC5 sp003345555,species
├╴g__JACAEL01,genus╶╌╴s__JACAEL01 sp013388835,species
├╴g__B65-G9,genus╶╌╴s__B65-G9 sp003662765,species
│ ╭╴s__SMTZ1-45 sp001563335,species
│ │
│ ├╴s__SMTZ1-45 sp011364905,species
│ │
├╴g__SMTZ1-45,genus╶┼╴s__SMTZ1-45 sp001940705,species
│ │
│ ├╴s__SMTZ1-45 sp004376265,species
│ │
│ ╰╴s__SMTZ1-45 sp002825515,species
├╴g__WTCK01,genus╶╌╴s__WTCK01 sp013138615,species
╰╴g__TEKIR-12S,genus╶╌╴s__TEKIR-12S sp004524435,species
"""
Getting species tree topology
----------------------------------
Getting the taxonomy tree for a given set of species is one of the most useful ways
to get all information at once. The method NCBITaxa.get_topology() or GTDBTaxa.get_topology() allows to query your
local NCBI/GTDB database and extract the smallest tree that connects all your query taxids.
It returns a normal ETE tree in which all nodes, internal or leaves, are annotated for
lineage, scientific names, ranks, and so on.
NCBI taxonomy
Example::
from ete4 import NCBITaxa
ncbi = NCBITaxa()
tree = ncbi.get_topology([9606, 9598, 10090, 7707, 8782])
print(tree.to_str(props=["sci_name", "rank"]))
"""
╭╴Dendrochirotida,order
│ ╭╴Homo sapiens,species
╴Deuterostomia,clade╶┤ ╭╴Homininae,subfamily╶┤
│ ╭╴Euarchontoglires,superorder╶┤ ╰╴Pan troglodytes,species
│ │ │
╰╴Amniota,clade╶┤ ╰╴Mus musculus,species
╰╴Aves,class
"""
# all intermediate nodes connecting the species can also be kept in the tree
tree = ncbi.get_topology([2, 33208], intermediate_nodes=True)
print(tree.to_str(props=["sci_name"]))
"""
╭╴Eukaryota╶╌╴Opisthokonta╶╌╴Metazoa
╴cellular organisms╶┤
╰╴Bacteria
"""
GTDB taxonomy
Example::
from ete4 import GTDBTaxa
gtdb = GTDBTaxa()
tree = gtdb.get_topology(["p__Huberarchaeota", "o__Peptococcales", "f__Korarchaeaceae"])
print(tree.to_str(props=['sci_name', 'rank']))
"""
╭╴p__Huberarchaeota,phylum
╭╴d__Archaea,superkingdom╶┤
╴root,no rank╶┤ ╰╴f__Korarchaeaceae,family
╰╴o__Peptococcales,order
"""
# all intermediate nodes connecting the species can also be kept in the tree
tree = gtdb.get_topology(["p__Huberarchaeota", "o__Peptococcales", "f__Korarchaeaceae"], intermediate_nodes=True, collapse_subspecies=True, annotate=True)
print(tree.to_str(props=['sci_name', 'rank']))
"""
╭╴p__Huberarchaeota,phylum
╭╴d__Archaea,superkingdom╶┤
╴root,no rank╶┤ ╰╴p__Thermoproteota,phylum╶╌╴c__Korarchaeia,class╶╌╴o__Korarchaeales,order╶╌╴f__Korarchaeaceae,family
╰╴d__Bacteria,superkingdom╶╌╴p__Firmicutes_B,phylum╶╌╴c__Peptococcia,class╶╌╴o__Peptococcales,order
"""
Automatic tree annotation using NCBI/GTDB taxonomy
---------------------------------------------
NCBI/GTDB taxonomy annotation consists of adding additional information to any internal a leaf node
in a give user tree. Only an property containing the taxid associated to each node
is required for the nodes in the query tree. The annotation process will add the
following features to the nodes:
- sci_name
- taxid
- named_lineage
- lineage
- rank
Note that, for internal nodes, taxid can be automatically inferred based on their sibling
nodes. The easiest way to annotate a tree is to use a PhyloTree instance where the species
name attribute is transparently used as the taxid attribute. Note that
the :PhyloNode:`annotate_ncbi_taxa`: or :PhyloNode:`annotate_gtdb_taxa`: function will also return the used name, lineage and
rank translators.
Remember that species names in PhyloTree instances are automatically extracted from leaf names. The parsing method can be easily adapted to any formatting:
NCBI taxonomy
Example::
from ete4 import PhyloTree
# load the whole leaf name as species taxid
tree = PhyloTree('((9606, 9598), 10090);', sp_naming_function=lambda name: name)
tax2names, tax2lineages, tax2rank = tree.annotate_ncbi_taxa()
# split names by '|' and return the first part as the species taxid
tree = PhyloTree('((9606|protA, 9598|protA), 10090|protB);', sp_naming_function=lambda name: name.split('|')[0])
tax2names, tax2lineages, tax2rank = tree.annotate_ncbi_taxa()
print(tree.to_str(props=["name", "sci_name", "taxid"]))
"""
╭╴9606|protA,Homo sapiens,9606
╭╴(empty),Homininae,207598╶┤
╴(empty),Euarchontoglires,314146╶┤ ╰╴9598|protA,Pan troglodytes,9598
╰╴10090|protB,Mus musculus,10090
"""
GTDB taxonomy
Example::
from ete4 import PhyloTree
# load the whole leaf name as species taxid
newick = '((p__Huberarchaeota,f__Korarchaeaceae)d__Archaea,o__Peptococcales);'
tree= PhyloTree(newick)
tax2name, tax2track, tax2rank = gtdb.annotate_tree(tree, taxid_attr="name")
print(tree.to_str(props=['sci_name', 'rank']))
"""
╭╴p__Huberarchaeota,phylum
╭╴d__Archaea,superkingdom╶┤
╴root,no rank╶┤ ╰╴f__Korarchaeaceae,family
╰╴o__Peptococcales,order
"""

0 comments on commit 415a1cc

Please sign in to comment.