Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define rules for scientific name identity #35

Open
mdoering opened this issue Oct 23, 2017 · 5 comments
Open

Define rules for scientific name identity #35

mdoering opened this issue Oct 23, 2017 · 5 comments
Assignees

Comments

@mdoering
Copy link
Member

Define clear rules what exactly makes a name the same name. A scientific name in the sense of the Clearinghouse of CoL+ has an identity and a unique, stable identifier. If possible these identifiers should be reusing ids issued by the participating nomenclators like IPNI.


A name includes it’s authorship. Two homonyms with different authors therefore represent two different name entities.

The same name can usually be represented by many different strings which we refer to as lexical variations. For each name a standard representation, the canonical form, exists. Lexical variations exist for various reasons. Author spelling, transliterations, epithet gender, additional infrageneric or infraspecific indications or cited species authors in infraspecific names are common reasons. Listed here are 7 distinct names with some of their string representations:

 1. Aus bus Linnaeus 1758
    - Aus bus Linn. 1758
    - Aus bus Linn 1758
    - Aus bus L.
    - Aus ba Linn 1758.
    - Aus (Hus) bus L.

 2. Xus bus (Linn, 1758)
    - Xus bus (Linn) Smith

 3. Xus cus Smith, 1850
    - Xus cus Sm.

 4. Xus cus Jones 1900

 5. Xus bus cus Smith 1850
    - Xus bus subsp. cus Smith 1850

 6. Xus dus Pyle 2000

 7. Foo bar var. lion Smith 1850
    - Foo bar L. var. lion Smith
    - Foo bar subsp. dar var. lion Smith 1850
    - Foo bar Lin. subsp. dar Mill. var. lion Smith 1850

New names (sp./gen. nov.), new recombinations of the same epithet (comb. nov.), a name at a new rank (stat. nov.) or replacement names (nom. nov.) are all treated as distinct names.

Open questions to be addressed:

  • How to treat various spelling variations. Should (some) misspellings, different transliterations, ligatures, umlauts or a wrong gender ending be considered a different name or just a lexical variant?
  • Is the (intended) publication of a name a requirement?
  • What about chresonyms?
  • Is an ambiregnal name published both under the botanical and zoological code a single or two names?

We need to capture examples of the various cases.

@deepreef
Copy link
Collaborator

The word "name" by itself usually causes more confusion than clarity in these kinds of conversations. Everytime we try to pin down a "clean" definition of "name", it usually isn't successful in the end. I don't think there will be much value in attempting to define a "name" in the COL+ context. Without using the word "name", I suggest there are two classes of things we need to reconcile:

Class 1: Literal Text String. These are literal strings of text (UTF-8) characters that are purported to represent the scientific label for an organism. These are what GNI indexes. Other than eliminating redundant whitespace, etc., these are the actual character strings associated to biodiversity data that we wish to use as taxonomic handles.

Class 2: Nomenclatural Objects: These are conceptual nomenclatural entities that have key properties associated with them (e.g., authorship, year, corrected spelling, combination, rank, etc.), and represent the core "things" that we hope to assign persistenmt reusable identifiers to, and use to cross-link biodiversity data to each other through nomenclature and taxonomy.

For example, your item #1 in your list is a single instance of a Class-2 object, which has been represented by at least six different Class-1 literal character strings. The first string you include ("Aus bus Linnaeus 1758") is not "the" name -- it is just one of potentially many text-string variants used to represent the abstract "name".

As to your questions:

  1. The uniquness of a "name" should be represented as the unique set of Protonyms (with all the associated properties implied) used to construct the name. For single-part names, only one Protonym is needed. For species and subgenera, a set of two protonyms represents the "name". For species within subgenera, and infraspecific names without subgenera, the set includes four protonyms. Infraspecific names within subgenera are represented by a set of five Protonyms.... and so on. Any/all spelling variants should all link to the same "name" as defined by the Protonym set. So none of the variants you list should be thought of as distinct "names", rather, they should all be treated as lexical variants of the same conceptual (Class 2) name-object, defined by the set of Protonyms. Coming back to your first example, there are actually two different such objects represented -- one defined by two Protonyms (the first five text-strings listed), and one defined by three Protonyms (the last listed text string).

  2. I'm not sure what you mean by "Is the (intended) publication of a name a requirement?" A requirement for what? To be regarded as a "Name"? Also, how are you defining "publication" in this context -- only "tranditional" publications, or does a web page count? Also, what about the publication is "required" -- full literature citation metadata?

  3. Chresonyms are a subset of TNUs, and should be treated as such. They should not be treated as "names".

  4. There is no easy answer to the ambiregnal question. My feeling is that there should be ONE Protonym, which is indicated as being compliant with more than one Code. However, this probably requires more discussion.

If you're asking what "thing" a CoL+ identifier should be attached to -- the answer is easy. It should be attached to a specific TNU instance (i.e., not an abstract "name", but an explict, specific usage instance of a name, which includes the set of protonyms and all the other relavent properties of the usage instance). COL is about taxon concept circumstcriptions, and this will always reqire a more explicit anchorpoint than a "name", no matter how we define a "name".

@mdoering
Copy link
Member Author

Thanks Rich. I believe your class 1 literals do not deserve an identifier as the string does already identifiy itself. It is the nomenclatural entity that we talk about here. And exactly because a "name" per se is understood in various ways I want a clear definition of what it is we identifiy by a scientificNameID.

I think it is mostly terminology that makes our views (GNUB / CoL+) appear different. It is the set of protonyms that I would like to identify just as you say, but not via the set of protonym ids but with an opaque identifier on its own. I am not convinced we want anything more than a set of 3 protonyms though. In the case of a species name string that includes a subgenus or a variety with a subspecies given: Do we care about the inherent taxonomic classification? I think we should not and regard these as the same name.

As for ambiregnal my hope and gut feeling is just like yours, but needs to be tested.

Chresonyms I also agree we should treat them as taxon concepts. It will just be hard to detect them without a sec/sensu/auct indication. So they will likely slip into the names dataset before we know it and we then have to deal with them. As they probably then have (stable) ids already the practical solution would be to keep them as names and mark them as chresonyms. Not nice, but likely to work...

@deepreef
Copy link
Collaborator

Thanks, Markus. Yes, for the Class1 Namestrings, the UTF8 string itself is, by definition, unique. Dima creates hashed UUID values from the strings themselves, and these can be useful for a variety of reasons as surrogate identifiers for the strings themselves. However, these are not "assigned" UUIDs, they are derived UUIDs, and as such can be thought of as simply a different rendering of the string itself.

I fully agree that what we need to uniquely identify for COL+ purposes (as well as every other taxonomic purpose in biodiversity informatics) is the Clas2 nomenclatural object/entity. In the same way that "name" can be defined many different ways, I think we need to avoid the text-string components of a "name" (e.g., Genus part, species part, infraspecific part, qualifiers, authorships, year, etc.) as part of the defining properties of the nomenclatural objects. That was my man point.

I certainly have no objection to issuing a single unique identifier to a "Protonym Set". The benefits in doing so are similar to the benefits that Dima has found for assigning hashed UUIDs for Class-1 text-string names. The object itself would be uniquely identified by the set of protonyms involved (in the same way that a Class-1 name string is uniquely identified by the set of UTF-8 characters it contains), but the surrogate identifier would be extremely convenient for data modelling and for perfirmance purposes, etc.

I also agree that we can limit the set to three identifiers:
One identifier for all names at the rank of genus and higher.
Two identifiers (Genus + terminal epithet) for all subgenus names and all species names.
Three identifiers for all infraspecific and infrasubspecific names (Genus+species+terminal epithet)

In other words, the only "middle" identifier is a species. Thus "Aus (Xus) bus" would be the same as "Aus bus" (both with the same two Protonyms for Aus and bus). All other properties would be derived either from the Protonyms themselves (authorship, literature citations, type specimens, objectoive synonymies, homonyms, etc.) or from TNUs anchored to the same Protonym sets (=Nomenclatural Object/Entity). TNU properties would capture full classifications, spellings of each component of the names, rank information, subjective synonymies, etc.)

So, I think we're in agreement on this.

Agreed on need for testing Ambiregnals.

Yes, Chresonyms will slip into the system, so we'll need a mechanism for treating them differently from homonyms. I don't see that as being a problem. We just have to accept that identifiers assigned to three different classes of Nomenclatural objects: 1) Legitimate Nomenclatural objects/entities; 2) inadvertant duplicates for legitimate Nomenclatural objects/entities (e.g. chresonyms, when discovered as such); and 3) entries that are out of scope (e.g., derived from text strings that are not actually names of organisms). I have a very robust/elegant mechanims for dealing with these three classes of objects, which I can discuss in detail if you wish.

I think that the important thing is that we seem to agree that a "name" (Nomenclatural Object/entity) can be defined as a unique set of one, two or three protonyms (in sequence*), and should be represented by its own surrogate identifier (integer, or UUID hashed from the three Protonym identifiers?).

Now we just need to define "Protonym"... :-)

*The "in sequence" bit may be important for cases where "Aus bus cus" and "Aus cus bus" both exist separately (because someone made an error in determinign which name is the species, and which is the subspecies). My gut feeling is that these should be regarded as separate Nomenclatural Objects/Entities, but they represent an edge case.

@proceps
Copy link

proceps commented Oct 24, 2017

@deepreef, this is definitely applicable to zoological names, but in botany both Aus bus cus and Aus dus cus are both accepted subspecies names associated with two different species. The same will apply to all other rank. You cannot just pick terminal epithets for botanical names.

@mjy
Copy link

mjy commented Oct 24, 2017

For reference-

I've thrown up some conceptual documentation for TaxonWorks. It was heavily influenced with ideas from Rich, and I'm almost positive the approach can capture what Rich wants to express. A somewhat pertinent point is that we actually built the tool, tested the concept, etc., i.e. we've gone far beyond conceptual discussions.

Note that the system allows for the application of unique identifiers at many different levels. In addition to the maintenance ids that are used (e.g. auto incremented ids in a RDB) global identifiers can be tacked on to any instances of any of these data. All instances of data classes can also be cited (linked to a reference). Some of these citation instances correspond to Rich's concepts.

Amending to actually answer @mdoering questions:

How to treat various spelling variations. Should (some) misspellings, different transliterations, ligatures, umlauts or a wrong gender ending be considered a different name or just a lexical variant?

Frankly, this is up to the curator. Sometimes it matters, sometimes it doesn't. Given them the relationships/object properties to express why it matters to them and it becomes less of an issue.

Is the (intended) publication of a name a requirement?

I very much wanted this, however the rules of nomenclature must let you interpret any name/string, so again this is up to the curator. What we hope to capture is that some curator is interpreting some name according to some nomenclatural rules. A best practice for a GSD would be to "only include names used as if they were intended to be governed by a set of rules". Again, unenforceable, and nothing can be done about it.

What about chresonyms?

See Citation in TaxonWorks.

Is an ambiregnal name published both under the botanical and zoological code a single or two names?

Up to the curators, but if the curators are sane, 2. Thanks for "ambiregnal" btw!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants