Define rules for scientific name identity #35

mdoering · 2017-10-23T09:53:25Z

Define clear rules what exactly makes a name the same name. A scientific name in the sense of the Clearinghouse of CoL+ has an identity and a unique, stable identifier. If possible these identifiers should be reusing ids issued by the participating nomenclators like IPNI.

A name includes it’s authorship. Two homonyms with different authors therefore represent two different name entities.

The same name can usually be represented by many different strings which we refer to as lexical variations. For each name a standard representation, the canonical form, exists. Lexical variations exist for various reasons. Author spelling, transliterations, epithet gender, additional infrageneric or infraspecific indications or cited species authors in infraspecific names are common reasons. Listed here are 7 distinct names with some of their string representations:

 1. Aus bus Linnaeus 1758
    - Aus bus Linn. 1758
    - Aus bus Linn 1758
    - Aus bus L.
    - Aus ba Linn 1758.
    - Aus (Hus) bus L.

 2. Xus bus (Linn, 1758)
    - Xus bus (Linn) Smith

 3. Xus cus Smith, 1850
    - Xus cus Sm.

 4. Xus cus Jones 1900

 5. Xus bus cus Smith 1850
    - Xus bus subsp. cus Smith 1850

 6. Xus dus Pyle 2000

 7. Foo bar var. lion Smith 1850
    - Foo bar L. var. lion Smith
    - Foo bar subsp. dar var. lion Smith 1850
    - Foo bar Lin. subsp. dar Mill. var. lion Smith 1850

New names (sp./gen. nov.), new recombinations of the same epithet (comb. nov.), a name at a new rank (stat. nov.) or replacement names (nom. nov.) are all treated as distinct names.

Open questions to be addressed:

How to treat various spelling variations. Should (some) misspellings, different transliterations, ligatures, umlauts or a wrong gender ending be considered a different name or just a lexical variant?
Is the (intended) publication of a name a requirement?
What about chresonyms?
Is an ambiregnal name published both under the botanical and zoological code a single or two names?

We need to capture examples of the various cases.

The text was updated successfully, but these errors were encountered:

deepreef · 2017-10-23T23:31:06Z

The word "name" by itself usually causes more confusion than clarity in these kinds of conversations. Everytime we try to pin down a "clean" definition of "name", it usually isn't successful in the end. I don't think there will be much value in attempting to define a "name" in the COL+ context. Without using the word "name", I suggest there are two classes of things we need to reconcile:

Class 1: Literal Text String. These are literal strings of text (UTF-8) characters that are purported to represent the scientific label for an organism. These are what GNI indexes. Other than eliminating redundant whitespace, etc., these are the actual character strings associated to biodiversity data that we wish to use as taxonomic handles.

Class 2: Nomenclatural Objects: These are conceptual nomenclatural entities that have key properties associated with them (e.g., authorship, year, corrected spelling, combination, rank, etc.), and represent the core "things" that we hope to assign persistenmt reusable identifiers to, and use to cross-link biodiversity data to each other through nomenclature and taxonomy.

For example, your item #1 in your list is a single instance of a Class-2 object, which has been represented by at least six different Class-1 literal character strings. The first string you include ("Aus bus Linnaeus 1758") is not "the" name -- it is just one of potentially many text-string variants used to represent the abstract "name".

As to your questions:

The uniquness of a "name" should be represented as the unique set of Protonyms (with all the associated properties implied) used to construct the name. For single-part names, only one Protonym is needed. For species and subgenera, a set of two protonyms represents the "name". For species within subgenera, and infraspecific names without subgenera, the set includes four protonyms. Infraspecific names within subgenera are represented by a set of five Protonyms.... and so on. Any/all spelling variants should all link to the same "name" as defined by the Protonym set. So none of the variants you list should be thought of as distinct "names", rather, they should all be treated as lexical variants of the same conceptual (Class 2) name-object, defined by the set of Protonyms. Coming back to your first example, there are actually two different such objects represented -- one defined by two Protonyms (the first five text-strings listed), and one defined by three Protonyms (the last listed text string).
I'm not sure what you mean by "Is the (intended) publication of a name a requirement?" A requirement for what? To be regarded as a "Name"? Also, how are you defining "publication" in this context -- only "tranditional" publications, or does a web page count? Also, what about the publication is "required" -- full literature citation metadata?
Chresonyms are a subset of TNUs, and should be treated as such. They should not be treated as "names".
There is no easy answer to the ambiregnal question. My feeling is that there should be ONE Protonym, which is indicated as being compliant with more than one Code. However, this probably requires more discussion.

If you're asking what "thing" a CoL+ identifier should be attached to -- the answer is easy. It should be attached to a specific TNU instance (i.e., not an abstract "name", but an explict, specific usage instance of a name, which includes the set of protonyms and all the other relavent properties of the usage instance). COL is about taxon concept circumstcriptions, and this will always reqire a more explicit anchorpoint than a "name", no matter how we define a "name".

mdoering · 2017-10-24T08:41:35Z

Thanks Rich. I believe your class 1 literals do not deserve an identifier as the string does already identifiy itself. It is the nomenclatural entity that we talk about here. And exactly because a "name" per se is understood in various ways I want a clear definition of what it is we identifiy by a scientificNameID.

I think it is mostly terminology that makes our views (GNUB / CoL+) appear different. It is the set of protonyms that I would like to identify just as you say, but not via the set of protonym ids but with an opaque identifier on its own. I am not convinced we want anything more than a set of 3 protonyms though. In the case of a species name string that includes a subgenus or a variety with a subspecies given: Do we care about the inherent taxonomic classification? I think we should not and regard these as the same name.

As for ambiregnal my hope and gut feeling is just like yours, but needs to be tested.

Chresonyms I also agree we should treat them as taxon concepts. It will just be hard to detect them without a sec/sensu/auct indication. So they will likely slip into the names dataset before we know it and we then have to deal with them. As they probably then have (stable) ids already the practical solution would be to keep them as names and mark them as chresonyms. Not nice, but likely to work...

deepreef · 2017-10-24T17:01:08Z

Thanks, Markus. Yes, for the Class1 Namestrings, the UTF8 string itself is, by definition, unique. Dima creates hashed UUID values from the strings themselves, and these can be useful for a variety of reasons as surrogate identifiers for the strings themselves. However, these are not "assigned" UUIDs, they are derived UUIDs, and as such can be thought of as simply a different rendering of the string itself.

I fully agree that what we need to uniquely identify for COL+ purposes (as well as every other taxonomic purpose in biodiversity informatics) is the Clas2 nomenclatural object/entity. In the same way that "name" can be defined many different ways, I think we need to avoid the text-string components of a "name" (e.g., Genus part, species part, infraspecific part, qualifiers, authorships, year, etc.) as part of the defining properties of the nomenclatural objects. That was my man point.

I certainly have no objection to issuing a single unique identifier to a "Protonym Set". The benefits in doing so are similar to the benefits that Dima has found for assigning hashed UUIDs for Class-1 text-string names. The object itself would be uniquely identified by the set of protonyms involved (in the same way that a Class-1 name string is uniquely identified by the set of UTF-8 characters it contains), but the surrogate identifier would be extremely convenient for data modelling and for perfirmance purposes, etc.

I also agree that we can limit the set to three identifiers:
One identifier for all names at the rank of genus and higher.
Two identifiers (Genus + terminal epithet) for all subgenus names and all species names.
Three identifiers for all infraspecific and infrasubspecific names (Genus+species+terminal epithet)

In other words, the only "middle" identifier is a species. Thus "Aus (Xus) bus" would be the same as "Aus bus" (both with the same two Protonyms for Aus and bus). All other properties would be derived either from the Protonyms themselves (authorship, literature citations, type specimens, objectoive synonymies, homonyms, etc.) or from TNUs anchored to the same Protonym sets (=Nomenclatural Object/Entity). TNU properties would capture full classifications, spellings of each component of the names, rank information, subjective synonymies, etc.)

So, I think we're in agreement on this.

Agreed on need for testing Ambiregnals.

Yes, Chresonyms will slip into the system, so we'll need a mechanism for treating them differently from homonyms. I don't see that as being a problem. We just have to accept that identifiers assigned to three different classes of Nomenclatural objects: 1) Legitimate Nomenclatural objects/entities; 2) inadvertant duplicates for legitimate Nomenclatural objects/entities (e.g. chresonyms, when discovered as such); and 3) entries that are out of scope (e.g., derived from text strings that are not actually names of organisms). I have a very robust/elegant mechanims for dealing with these three classes of objects, which I can discuss in detail if you wish.

I think that the important thing is that we seem to agree that a "name" (Nomenclatural Object/entity) can be defined as a unique set of one, two or three protonyms (in sequence*), and should be represented by its own surrogate identifier (integer, or UUID hashed from the three Protonym identifiers?).

Now we just need to define "Protonym"... :-)

*The "in sequence" bit may be important for cases where "Aus bus cus" and "Aus cus bus" both exist separately (because someone made an error in determinign which name is the species, and which is the subspecies). My gut feeling is that these should be regarded as separate Nomenclatural Objects/Entities, but they represent an edge case.

proceps · 2017-10-24T18:12:04Z

@deepreef, this is definitely applicable to zoological names, but in botany both Aus bus cus and Aus dus cus are both accepted subspecies names associated with two different species. The same will apply to all other rank. You cannot just pick terminal epithets for botanical names.

mjy · 2017-10-24T19:31:20Z

For reference-

I've thrown up some conceptual documentation for TaxonWorks. It was heavily influenced with ideas from Rich, and I'm almost positive the approach can capture what Rich wants to express. A somewhat pertinent point is that we actually built the tool, tested the concept, etc., i.e. we've gone far beyond conceptual discussions.

Models used in TW (these are the actual tables/models/ontology classes), as currently implemented for hundreds of thousands of names in our testing framework: https://github.com/SpeciesFileGroup/taxonworks_doc/blob/master/concepts/TaxonWorksNomenclature.pdf (it's a vector PDF, download and zoom around)
At its core we use the controlled vocabularies in NOMEN. See a recent talk for an intro: https://github.com/SpeciesFileGroup/nomen/blob/master/docs/presentations/Ballroom_A_Tuesday_1445_Yoder_TDWG17.pptx
Corresponding in-line code documentation of TW models: http://rdoc.taxonworks.org/frames.html#!file.README.html (various levels of completion)
We're still working on the API documentation, but conceptually it's already outlined in the resources above.

Note that the system allows for the application of unique identifiers at many different levels. In addition to the maintenance ids that are used (e.g. auto incremented ids in a RDB) global identifiers can be tacked on to any instances of any of these data. All instances of data classes can also be cited (linked to a reference). Some of these citation instances correspond to Rich's concepts.

Amending to actually answer @mdoering questions:

How to treat various spelling variations. Should (some) misspellings, different transliterations, ligatures, umlauts or a wrong gender ending be considered a different name or just a lexical variant?

Frankly, this is up to the curator. Sometimes it matters, sometimes it doesn't. Given them the relationships/object properties to express why it matters to them and it becomes less of an issue.

Is the (intended) publication of a name a requirement?

I very much wanted this, however the rules of nomenclature must let you interpret any name/string, so again this is up to the curator. What we hope to capture is that some curator is interpreting some name according to some nomenclatural rules. A best practice for a GSD would be to "only include names used as if they were intended to be governed by a set of rules". Again, unenforceable, and nothing can be done about it.

What about chresonyms?

See Citation in TaxonWorks.

Is an ambiregnal name published both under the botanical and zoological code a single or two names?

Up to the curators, but if the curators are sane, 2. Thanks for "ambiregnal" btw!

mdoering assigned dremsen and mdoering Oct 23, 2017

mdoering mentioned this issue Sep 29, 2018

CoL name rendering rules #47

Closed

mdoering added the Taxonomy Group label Sep 29, 2018

mdoering mentioned this issue Jan 30, 2019

Name variants and the definition of tnu:TaxonomicName tdwg/tnc#30

Closed

mdoering mentioned this issue Aug 25, 2019

Include subgenus in name matching and DUPLICATE_NAME issue CatalogueOfLife/backend#451

Closed

mdoering mentioned this issue Jun 12, 2020

Should taxonomicName be represented as a Subclass of taxonomicNameUsage tdwg/tnc#57

Closed

mdoering mentioned this issue Aug 22, 2020

Define objective rules for taxon concept identity #6

Open

mdoering mentioned this issue Oct 16, 2020

What exactly does /name/matching do? #79

Closed

mdoering mentioned this issue Aug 14, 2023

are Catalogue of Life ids expected to be stable over time? #98

Closed

mdoering mentioned this issue Dec 1, 2023

25.000+ self-synonymizes taxa CatalogueOfLife/data#598

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define rules for scientific name identity #35

Define rules for scientific name identity #35

mdoering commented Oct 23, 2017

deepreef commented Oct 23, 2017

mdoering commented Oct 24, 2017

deepreef commented Oct 24, 2017

proceps commented Oct 24, 2017

mjy commented Oct 24, 2017 •

edited

Define rules for scientific name identity #35

Define rules for scientific name identity #35

Comments

mdoering commented Oct 23, 2017

deepreef commented Oct 23, 2017

mdoering commented Oct 24, 2017

deepreef commented Oct 24, 2017

proceps commented Oct 24, 2017

mjy commented Oct 24, 2017 • edited

mjy commented Oct 24, 2017 •

edited