Name Alignment

datasets

taxonomies

url	enabled	type
names.csv	true	text/csv

url	enabled	type
https://docs.google.com/spreadsheets/u/0/d/1d-4X2xFdf-PkhXRsRu63Wx00kJiZsrNyCM9QqG2rvPA/export?format=tsv	false	text/tab-separated-values

id	enabled	name	type
mdd	false	Mammal Diversity Database	application/nomer

url	enabled	type
https://example.org/data.tsv	false	text/tab-separated-values

url	enabled	type
https://serv.biokic.asu.edu/ecdysis/content/dwca/UCSB-IZC_DwC-A.zip	false	application/dwca

url	enabled	type
https://scan-bugs.org:443/portal/webservices/dwc/rss.xml	false	application/rss+xml

id	enabled	name
itis	true	Integrated Taxonomic Information System

id	enabled	name
ncbi	true	NCBI Taxonomy

id	enabled	name
discoverlife	true	Discover Life Taxonomy

id	enabled	name
batnames	false	Bat Names

id	enabled	name
col	false	Catalogue of Life

id	enabled	name
gbif	false	GBIF Backbone Taxonomy

id	enabled	name
globi	false	GloBI Taxon Graph

id	enabled	name
indexfungorum	false	Index Fungorum

id	enabled	name
mdd	false	Mammal Diversity Database

id	enabled	name
ott	false	Open Tree of Life Taxonomy

id	enabled	name
pbdb	false	Paleobiology Database

id	enabled	name
plazi	false	Plazi Treatments

id	enabled	name
tpt	false	Terrestrial Parasite Tracker Taxonomies

id	enabled	name
wfo	false	World of Flora Online

Name Alignment

To find your automatically created name alignment report, click on "Name Alignment by Nomer" above, click on a workflow run, and download the alignment-report artifact.

💡 Note that only logged-in GitHub users with access can download the alignment report generated by GitHub Actions.

Background

Aligning taxonomic names is a common task in biodiversity informatics.

This template repository offers an automated method to align scientific names in csv/tsv files and darwin core archive with common taxonomic name lists like Catalogue of Life, NCBI Taxonomy, Integrated Taxonomic Information System (ITIS), and GBIF Backbone taxonomy.

Getting Started

Follow steps below or alternatively visit the Big Bee pages with materials and recordings of the Name Alignment Workshop held 18 January 2023 to learn more.

create your own repository using this repository as a template
edit the README.md and add the urls / filenames to the resources you'd like to review. Note that only the following types are supported at time of writing (June 2022): text/csv, text/tab-separated-values, application/dwca, and application/rss+xml. Also, delete any taxonomy entries that you are not interested in: the fewer taxonomies to align with, the faster the review.
edit taxonomies list in the README.md front-matter to select those you are interested to work with. Many are configured by default, and you can customize to make the configuration work best for your names.
for now only names in column "scientificName" (tsv/csv), and "http://rs.tdwg.org/dwc/terms/scientificName" (DwC-A) will be aligned
commit the changes to github
inspect results of name alignment in "Github Actions" (e.g., sample results) )
download name alignment report from the "artifacts" section
to re-create/re-run results, change your name list in github or select "re-run jobs" in Github Actions.

Origin

This repository was conceived on 2022-03-08 during the Alien CSI Hack-a-thon in Romania by Christina, Quentin, Jorrit, Jasmijn, .... For more information see https://github.com/alien-csi/alien-csi-hackathon .

Contributors

name	affiliation	orcid
Jorrit Poelen	GloBI; Ronin Institute	https://orcid.org/0000-0003-3138-4118
your name	your affiliation	your orcid

Feedback / issues

This repository uses scripts in https://github.com/globalbioticinteractions/globinizer. These script use commandline tools like GloBI's nomer, cut, sed, etc.

Misc Notes

install nomer java8 / java11 -

https://github.com/globalbioticinteractions/nomer

e.g., Carl Boettiger taxondb R package

Print names and add a tab in front, to prepare for nomer.

cat foodorganisms.txt | sed 's/^/\t/g' > foodorganisms.tsv

Nomer expects the format to be:

[id][tab][name]

e.g., id\tname NCBI:9606\tHomo sapiens

Print names to screen and append itis taxonomic interpretation, and write/redirect to a file 'name-itis.tsv'

cat foodorganisms.tsv | nomer append itis > name-itis.tsv

open in LibreOffice Calc

Repeat with 'gbif' instead of 'itis'

Provenance of DwC-A Names

The name context of names extracted from DwC-A are captured in a funny looking text:

line:zip:hash://sha256/fe63af46ed66abd253ee148e383fb51da6695ce3848d0bde39af18aa77d364fb!/occurrences.csv!/L10

extracted from a generated names-aligned.tsv:

$ cat names-aligned.tsv | grep hash | grep occurrence | head -n1
line:zip:hash://sha256/fe63af46ed66abd253ee148e383fb51da6695ce3848d0bde39af18aa77d364fb!/occurrences.csv!/L10	Lasioglossum	SAME_AS	line:zip:hash://sha256/fe63af46ed66abd253ee148e383fb51da6695ce3848d0bde39af18aa77d364fb!/occurrences.csv!/L10	Lasioglossum								HAS_ACCEPTED_NAME	COL:5B4P	Lasioglossum	genus		Biota | Animalia | Arthropoda | Insecta | Hymenoptera | Apoidea | Halictidae | Halictinae | Halictini | Lasioglossum	COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:HYM | COL:625GP | COL:625H4 | COL:JMV | COL:KV7 | COL:5B4P	unranked | kingdom | phylum | class | order | superfamily | family | subfamily | tribe | genus	https://www.catalogueoflife.org/data/taxon/5B4P

This text identifies the row from which the name was extracted. In this case, line 10, from file occurrences.csv contained in the zip file with content id hash://sha256/fe63af46ed66abd253ee148e383fb51da6695ce3848d0bde39af18aa77d364fb . If you retain the tracked dataset (in this case UC Santa Barbara Invertebrate Zoology Collection accessed on 2022-06-30) provided in the data/ folder of the name aligment archive, you can use Preston to dig up the original record using:

$ preston cat 'line:zip:hash://sha256/fe63af46ed66abd253ee148e383fb51da6695ce3848d0bde39af18aa77d364fb!/occurrences.csv!/L10' 
881449,UCSB,IZC,,b03a3f0c-bfa5-4e02-b5d3-56ff38626302,PreservedSpecimen,a8a4f8b1-38f1-4e10-9b75-b2e86ac196fc,UCSB-IZC00038312,,Animalia|Arthropoda|Hexapoda|Insecta|Pterygota|Neoptera|Hymenoptera|Apocrita|Aculeata|Apoidea|Halictidae|Halictinae|Halictini,Animalia,Arthropoda,Insecta,Hymenoptera,Halictidae,Lasioglossum,186125,"Curtis, 1833",Lasioglossum,,,,,Genus,"EEMB/ENV S 96",24-May-2022,,,,,,"Sophie Cameron",,2022-04-26,2022,4,26,116,,,,"Newly restored salt marsh",PAN2,,,,,,,"on flower of Eschscholzia californica",,,Adult,Female,1,Pinned,"United States",California,"Santa Barbara",,"University of California Santa Barbara North Campus Open Space",,34.42174,-119.87186,WGS84,10,,,,GPS,,,,,,,,,,,,"2022-05-31 10:52:55",http://creativecommons.org/publicdomain/zero/1.0/,"The Regents of the University of California",https://www.ccber.ucsb.edu/collections/databases-searching-specimen-data-and-images,urn:uuid:a8a4f8b1-38f1-4e10-9b75-b2e86ac196fc,https://serv.biokic.asu.edu/ecdysis/collections/individual/index.php?occid=881449

which links to a preserved specimen with occurrenceId b03a3f0c-bfa5-4e02-b5d3-56ff38626302 and landing page at https://serv.biokic.asu.edu/ecdysis/collections/individual/index.php?occid=881449 . Also see screenshot made on 2022-06-30.

With this context, you can trace the origin and context of the name in great detail. This detail can be used to troubleshoot bugs in the name alignment process, or provide granular feedback to those that maintain the dataset or taxonomy.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
names.csv		names.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

img

img

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

names.csv

names.csv

Repository files navigation

Name Alignment

Background

Getting Started

Origin

Contributors

Feedback / issues

Misc Notes

Provenance of DwC-A Names

About

Releases

Packages

License

globalbioticinteractions/name-alignment-template

Folders and files

Latest commit

History

Repository files navigation

Name Alignment

Background

Getting Started

Origin

Contributors

Feedback / issues

Misc Notes

Provenance of DwC-A Names

About

Resources

License

Stars

Watchers

Forks