Merge pull request #722 from dengzq1234/ete4_tutorial_taxa

Ete4 tutorial taxa
etetoolkit · Nov 2, 2023 · 415a1cc · 415a1cc
2 parents 1582ea2 + b8d315a
commit 415a1cc
Showing 1 changed file with 398 additions and 0 deletions.
diff --git a/doc/tutorial/tutorial_taxonomy.rst b/doc/tutorial/tutorial_taxonomy.rst
@@ -0,0 +1,398 @@
+.. currentmodule:: ete4
+
+Connecting with Taxonomy Databases
+==================
+
+.. contents::
+
+Overview
+--------
+ETE4 contains *ncbi_taxonomy* and *gtdb_taxonomy* modules which provide 
+utilities to efficiently query a local copy of the NCBI or GTDB taxonomy 
+databases. The class ``NCBITaxa`` and ``GTDBTaxa`` offer methods to convert 
+from taxid to names (and vice versa), to fetch pruned topologies connecting 
+a given set of species, or to download rank, names and lineage track information.
+
+It is also fully integrated with PhyloTree instances through the 
+``PhyloNode.annotate_ncbi_taxa()`` and ``PhyloNode.annotate_gtdb_taxa()``method.
+
+Differences between NCBI and GTDB taxonomies in ETE4
+----------------------------------------------------
+
+The NCBI taxonomy database is a comprehensive resource for organism names and 
+classifications.It is updated daily and offers multiple access points including a web 
+portal, an FTP server. The database releases its data in a package called "taxdump.tar.gz" which 
+contains several .dmp files. 
+
+Taxon in NCBI taxonomyis usually a numeric identifier, commonly representing 
+taxa ("TaxID"), but it can also signify other entities like genetic codes or citations, such as 
+9606 represents Homo Sapiens. 
+
+On the other hand, GTDB taxonomy is distributed as simple text files, uses a genome-based 
+approach for classification, and the identifiers are usually specific to genomes rather 
+than taxa. 
+
+Since ETE Toolkit version 3, ete3 parses taxdump file to local sqlite database to fullfill the 
+methods in ncbi_taxonomy module. We applied the same strategy to GTDBTaxa. While the original GTDB 
+taxonomy data differs from NCBI taxonomy files, a conversion step is essential for integration. 
+
+To integrate GTDB into the ETE Toolkit v4, a conversion process was necessary. A third-party script
+(https://github.com/nick-youngblut/gtdb_to_taxdump) was employed to convert the GTDB taxonomy to the 
+NCBI-like taxdump format. We already preprared GTDB taxonomy dump file from different releases version 
+and store in https://github.com/etetoolkit/ete-data/tree/main/gtdb_taxonomy.
+
+Setting up local copies of the NCBI and GTDB taxonomy databases
+-------------------------------------------------------------
+The first time you attempt to use NCBITaxa or GTDBTaxa, ETE will detect that your local 
+database is empty and it will attempt to download the latest taxonomy database(NCBI ~600MB;GTDB ~72MB) and will 
+store a parsed version of it in your home directory: ~/.local/share/ete/. 
+All future imports of NCBITaxa or GTDBTaxa will detect the local database and will 
+skip this step.
+
+Example::
+  # Load NCBI module
+  from ete4 import NCBITaxa
+  ncbi = NCBITaxa()
+  ncbi.update_taxonomy_database()
+  
+  # Load GTDB module
+  from ete4 import GTDBTaxa
+  gtdb = GTDBTaxa()
+  gtdb.update_taxonomy_database()
+
+  # Load GTDB module with specific release version
+  from ete4 import GTDBTaxa
+  gtdb = GTDBTaxa()
+
+  # latest release updated in https://github.com/dengzq1234/ete-data/tree/main/gtdb_taxonomy
+  gtdb.update_taxonomy_database()
+  # or 
+  gtdb.update_taxonomy_database("gtdbdump.tar.gz") 
+
+  # update with custom release 202
+  gtdb.update_taxonomy_database('gtdb202dump.tar.gz')
+
+Getting taxid information
+-------------------------
+
+NCBI taxonomy
+~~~~~~~~~~~~~
+you can fetch species names, ranks and linage track information for your taxids using the following 
+methods:
+
+- NCBITaxa.get_rank()
+- NCBITaxa.get_lineage()
+- NCBITaxa.get_taxid_translator()
+- NCBITaxa.get_name_translator()
+- NCBITaxa.translate_to_names()
+The so called get-translator-functions will return a dictionary converting between taxids and species names. 
+Either species or linage names/taxids are accepted as input.
+
+Example::
+    from ete4 import NCBITaxa
+    ncbi = NCBITaxa()
+    taxid2name = ncbi.get_taxid_translator([9606, 9443])
+    print(taxid2name)
+    # {9443: 'Primates', 9606: 'Homo sapiens'}
+
+    name2taxid = ncbi.get_name_translator(['Homo sapiens', 'primates'])
+    print(name2taxid)
+    # {'Homo sapiens': [9606], 'primates': [9443]}
+
+    # when the same name points to several taxa, all taxids are returned
+    name2taxid = ncbi.get_name_translator(['Bacteria'])
+    print(name2taxid)
+    # {'Bacteria': [2, 629395]}
+
+Other functions allow to extract further information using taxid numbers as a query.
+
+Example::
+    from ete4 import NCBITaxa
+    ncbi = NCBITaxa()
+
+    print(ncbi.get_rank([9606, 9443]))
+    # {9443: u'order', 9606: u'species'}
+
+    print(ncbi.get_lineage(9606))
+    # [1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742,
+    # 7776, 117570, 117571, 8287, 1338369, 32523, 32524, 40674, 32525, 9347,
+    # 1437010, 314146, 9443, 376913, 314293, 9526, 314295, 9604, 207598, 9605,
+    # 9606]
+
+Combine combine all at once:
+
+Example::
+    from ete4 import NCBITaxa
+    ncbi = NCBITaxa()
+
+    lineage = ncbi.get_lineage(9606)
+    print(lineage)
+
+    # [1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742,
+    # 7776, 117570, 117571, 8287, 1338369, 32523, 32524, 40674, 32525, 9347,
+    # 1437010, 314146, 9443, 376913, 314293, 9526, 314295, 9604, 207598, 9605,
+    # 9606]
+
+    names = ncbi.get_taxid_translator(lineage)
+    print([names[taxid] for taxid in lineage])
+
+    # [u'root', u'cellular organisms', u'Eukaryota', u'Opisthokonta', u'Metazoa',
+    # u'Eumetazoa', u'Bilateria', u'Deuterostomia', u'Chordata', u'Craniata',
+    # u'Vertebrata', u'Gnathostomata', u'Teleostomi', u'Euteleostomi',
+    # u'Sarcopterygii', u'Dipnotetrapodomorpha', u'Tetrapoda', u'Amniota',
+    # u'Mammalia', u'Theria', u'Eutheria', u'Boreoeutheria', u'Euarchontoglires',
+    # u'Primates', u'Haplorrhini', u'Simiiformes', u'Catarrhini', u'Hominoidea',
+    # u'Hominidae', u'Homininae', u'Homo', u'Homo sapiens']
+
+
+GTDB taxonomy
+~~~~~~~~~~~~~
+In the NCBI taxonomy database, each species is assigned a unique numeric taxid. 
+For example, the taxid 9606 refers to Homo sapiens. These taxids serve as 
+essential keys for tracking lineages within the database.
+
+However, the GTDB database doesn't originally offer numeric taxids like NCBI 
+does. In the GTDBTaxa module, we've introduced taxids for each species to 
+facilitate lineage tracking. These taxids, while not officially recognized in 
+the GTDB database, serve as convenient keys. They help in connecting the lineage 
+and taxonomic ranks within the local database, making it easier for users to 
+fetch and relate taxonomic information.
+
+Like NCBITaxa, GTDBTaxa contains similar methods:
+
+- GTDBTaxa.get_rank()
+- GTDBTaxa.get_lineage()
+- GTDBTaxa.get_taxid_translator()
+- GTDBTaxa.get_name_translator()
+- GTDBTaxa.translate_to_names()
+- GTDBTaxa.get_name_lineage()
+
+Getting descendant taxa
+-----------------------
+Given a taxid or a taxa name from an internal node in the NCBI/GTDB taxonomy tree, 
+their descendants can be retrieved as follows:
+
+NCBI taxonomy
+Example::
+    # example in NCBI taxonomy
+    from ete4 import NCBITaxa
+    ncbi = NCBITaxa()
+
+    descendants = ncbi.get_descendant_taxa('Homo')
+    print(ncbi.translate_to_names(descendants))
+
+    # [u'Homo heidelbergensis', u'Homo sapiens ssp. Denisova', 
+    # u'Homo sapiens neanderthalensis']
+
+    # you can easily ignore subspecies, so only taxa labeled as "species" will be reported:
+    descendants = ncbi.get_descendant_taxa('Homo', collapse_subspecies=True)
+    print(ncbi.translate_to_names(descendants))
+
+    # [u'Homo sapiens', u'Homo heidelbergensis']
+
+    # or even returned as an annotated tree
+    tree = ncbi.get_descendant_taxa('Homo', collapse_subspecies=True, return_tree=True)
+
+    print(tree.to_str(props=['sci_name','taxid']))
+
+    """
+               ╭╴environmental samples,2665952╶╌╴Homo sapiens environmental sample,2665953
+               │
+               ├╴Homo sapiens,9606
+    ╴Homo,9605╶┤
+               ├╴Homo heidelbergensis,1425170
+               │
+               ╰╴unclassified Homo,2813598╶╌╴Homo sp.,2813599
+    """
+    
+GTDB taxonomy
+Example::
+    from ete4 import GTDBTaxa
+    gtdb = GTDBTaxa()
+    descendants = gtdb.get_descendant_taxa('f__Thorarchaeaceae')
+    print(descendants)
+    # ['GB_GCA_003662765.1', 'GB_GCA_003662805.1', 'GB_GCA_003345555.1', 'GB_GCA_003345595.1', 'GB_GCA_001940705.1', 'GB_GCA_001563335.1', 'GB_GCA_011364905.1', 'GB_GCA_004376265.1', 'GB_GCA_002825515.1', 'GB_GCA_001563325.1', 'GB_GCA_011364985.1', 'GB_GCA_011365025.1', 'GB_GCA_004524565.1', 'GB_GCA_004524595.1', 'GB_GCA_002825465.1', 'GB_GCA_002825535.1', 'GB_GCA_003345545.1', 'GB_GCA_004524445.1', 'GB_GCA_013388835.1', 'GB_GCA_008080745.1', 'GB_GCA_004524435.1', 'GB_GCA_013138615.1']
+
+    #ignore subspecies, so only taxa labeled as "species" will be reported
+    descendants = gtdb.get_descendant_taxa('f__Thorarchaeaceae', collapse_subspecies=True)
+    
+    print(descendants)
+    
+    #['s__MP8T-1 sp002825535', 's__MP8T-1 sp003345545', 's__MP8T-1 sp002825465', 's__MP8T-1 sp004524565', 's__MP8T-1 sp004524595', 's__SMTZ1-83 sp011364985', 's__SMTZ1-83 sp011365025', 's__SMTZ1-83 sp001563325', 's__TEKIR-14 sp004524445', 's__SHMX01 sp008080745', 's__OWC5 sp003345595', 's__OWC5 sp003345555', 's__JACAEL01 sp013388835', 's__B65-G9 sp003662765', 's__SMTZ1-45 sp001563335', 's__SMTZ1-45 sp011364905', 's__SMTZ1-45 sp001940705', 's__SMTZ1-45 sp004376265', 's__SMTZ1-45 sp002825515', 's__WTCK01 sp013138615', 's__TEKIR-12S sp004524435']
+
+    
+    #returned as an annotated tree
+    descendants = gtdb.get_descendant_taxa('f__Thorarchaeaceae', collapse_subspecies=True, return_tree=True)
+    print(descendants.to_str(props=['sci_name','rank']))
+    """
+                                                ╭╴s__MP8T-1 sp002825535,species
+                                                │
+                                                ├╴s__MP8T-1 sp003345545,species
+                                                │
+                            ╭╴g__MP8T-1,genus╶┼╴s__MP8T-1 sp002825465,species
+                            │                 │
+                            │                 ├╴s__MP8T-1 sp004524565,species
+                            │                 │
+                            │                 ╰╴s__MP8T-1 sp004524595,species
+                            │
+                            │                   ╭╴s__SMTZ1-83 sp011364985,species
+                            │                   │
+                            ├╴g__SMTZ1-83,genus╶┼╴s__SMTZ1-83 sp011365025,species
+                            │                   │
+                            │                   ╰╴s__SMTZ1-83 sp001563325,species
+                            │
+                            ├╴g__TEKIR-14,genus╶╌╴s__TEKIR-14 sp004524445,species
+                            │
+                            ├╴g__SHMX01,genus╶╌╴s__SHMX01 sp008080745,species
+                            │
+                            │               ╭╴s__OWC5 sp003345595,species
+                            ├╴g__OWC5,genus╶┤
+    ╴f__Thorarchaeaceae,family╶┤               ╰╴s__OWC5 sp003345555,species
+                            │
+                            ├╴g__JACAEL01,genus╶╌╴s__JACAEL01 sp013388835,species
+                            │
+                            ├╴g__B65-G9,genus╶╌╴s__B65-G9 sp003662765,species
+                            │
+                            │                   ╭╴s__SMTZ1-45 sp001563335,species
+                            │                   │
+                            │                   ├╴s__SMTZ1-45 sp011364905,species
+                            │                   │
+                            ├╴g__SMTZ1-45,genus╶┼╴s__SMTZ1-45 sp001940705,species
+                            │                   │
+                            │                   ├╴s__SMTZ1-45 sp004376265,species
+                            │                   │
+                            │                   ╰╴s__SMTZ1-45 sp002825515,species
+                            │
+                            ├╴g__WTCK01,genus╶╌╴s__WTCK01 sp013138615,species
+                            │
+                            ╰╴g__TEKIR-12S,genus╶╌╴s__TEKIR-12S sp004524435,species
+    """
+
+Getting species tree topology
+----------------------------------
+Getting the taxonomy tree for a given set of species is one of the most useful ways 
+to get all information at once. The method NCBITaxa.get_topology() or GTDBTaxa.get_topology() allows to query your 
+local NCBI/GTDB database and extract the smallest tree that connects all your query taxids. 
+It returns a normal ETE tree in which all nodes, internal or leaves, are annotated for 
+lineage, scientific names, ranks, and so on.
+
+NCBI taxonomy
+Example::
+    from ete4 import NCBITaxa
+    ncbi = NCBITaxa()
+
+    tree = ncbi.get_topology([9606, 9598, 10090, 7707, 8782])
+
+    print(tree.to_str(props=["sci_name", "rank"]))
+    """
+                        ╭╴Dendrochirotida,order
+                        │
+                        │                                                                   ╭╴Homo sapiens,species
+    ╴Deuterostomia,clade╶┤                                             ╭╴Homininae,subfamily╶┤
+                        │               ╭╴Euarchontoglires,superorder╶┤                     ╰╴Pan troglodytes,species
+                        │               │                             │
+                        ╰╴Amniota,clade╶┤                             ╰╴Mus musculus,species
+                                        │
+                                        ╰╴Aves,class
+    """
+
+    # all intermediate nodes connecting the species can also be kept in the tree
+    tree = ncbi.get_topology([2, 33208], intermediate_nodes=True)
+    print(tree.to_str(props=["sci_name"]))
+    """
+                        ╭╴Eukaryota╶╌╴Opisthokonta╶╌╴Metazoa
+    ╴cellular organisms╶┤
+                        ╰╴Bacteria
+    """
+
+GTDB taxonomy
+Example::
+    from ete4 import GTDBTaxa
+    gtdb = GTDBTaxa()
+
+    tree = gtdb.get_topology(["p__Huberarchaeota", "o__Peptococcales", "f__Korarchaeaceae"])
+    print(tree.to_str(props=['sci_name', 'rank']))
+
+    """
+                                            ╭╴p__Huberarchaeota,phylum
+                ╭╴d__Archaea,superkingdom╶┤
+    ╴root,no rank╶┤                         ╰╴f__Korarchaeaceae,family
+                │
+                ╰╴o__Peptococcales,order
+
+    """
+
+    # all intermediate nodes connecting the species can also be kept in the tree
+    tree = gtdb.get_topology(["p__Huberarchaeota", "o__Peptococcales", "f__Korarchaeaceae"], intermediate_nodes=True, collapse_subspecies=True, annotate=True)
+    print(tree.to_str(props=['sci_name', 'rank']))
+    """
+                                            ╭╴p__Huberarchaeota,phylum
+                ╭╴d__Archaea,superkingdom╶┤
+    ╴root,no rank╶┤                         ╰╴p__Thermoproteota,phylum╶╌╴c__Korarchaeia,class╶╌╴o__Korarchaeales,order╶╌╴f__Korarchaeaceae,family
+                │
+                ╰╴d__Bacteria,superkingdom╶╌╴p__Firmicutes_B,phylum╶╌╴c__Peptococcia,class╶╌╴o__Peptococcales,order
+    """
+
+Automatic tree annotation using NCBI/GTDB taxonomy
+---------------------------------------------
+NCBI/GTDB taxonomy annotation consists of adding additional information to any internal a leaf node 
+in a give user tree. Only an property containing the taxid associated to each node 
+is required for the nodes in the query tree. The annotation process will add the 
+following features to the nodes:
+
+- sci_name
+- taxid
+- named_lineage
+- lineage
+- rank
+
+Note that, for internal nodes, taxid can be automatically inferred based on their sibling 
+nodes. The easiest way to annotate a tree is to use a PhyloTree instance where the species 
+name attribute is transparently used as the taxid attribute. Note that 
+the :PhyloNode:`annotate_ncbi_taxa`: or :PhyloNode:`annotate_gtdb_taxa`: function will also return the used name, lineage and 
+rank translators.
+
+Remember that species names in PhyloTree instances are automatically extracted from leaf names. The parsing method can be easily adapted to any formatting:
+
+NCBI taxonomy
+Example::
+    from ete4 import PhyloTree
+
+    # load the whole leaf name as species taxid
+    tree = PhyloTree('((9606, 9598), 10090);', sp_naming_function=lambda name: name)
+    tax2names, tax2lineages, tax2rank = tree.annotate_ncbi_taxa()
+
+
+    # split names by '|' and return the first part as the species taxid
+    tree = PhyloTree('((9606|protA, 9598|protA), 10090|protB);', sp_naming_function=lambda name: name.split('|')[0])
+    tax2names, tax2lineages, tax2rank = tree.annotate_ncbi_taxa()
+
+    print(tree.to_str(props=["name", "sci_name", "taxid"]))
+
+    """
+                                                                ╭╴9606|protA,Homo sapiens,9606
+                                     ╭╴(empty),Homininae,207598╶┤
+    ╴(empty),Euarchontoglires,314146╶┤                          ╰╴9598|protA,Pan troglodytes,9598
+                                     │
+                                     ╰╴10090|protB,Mus musculus,10090
+    """
+
+GTDB taxonomy
+Example::
+    from ete4 import PhyloTree
+
+    # load the whole leaf name as species taxid
+    newick = '((p__Huberarchaeota,f__Korarchaeaceae)d__Archaea,o__Peptococcales);'
+
+    tree= PhyloTree(newick)
+    tax2name, tax2track, tax2rank = gtdb.annotate_tree(tree, taxid_attr="name")
+
+    print(tree.to_str(props=['sci_name', 'rank']))
+
+    """
+                                            ╭╴p__Huberarchaeota,phylum
+                  ╭╴d__Archaea,superkingdom╶┤
+    ╴root,no rank╶┤                         ╰╴f__Korarchaeaceae,family
+                  │
+                  ╰╴o__Peptococcales,order
+    """