Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define objective rules for taxon concept identity #6

Open
mdoering opened this issue May 4, 2017 · 129 comments
Open

Define objective rules for taxon concept identity #6

mdoering opened this issue May 4, 2017 · 129 comments

Comments

@mdoering
Copy link
Member

mdoering commented May 4, 2017

Define rules for a stable taxonID. Understanding when a taxon changes sufficiently to warrant an identifier change

@deepreef
Copy link
Collaborator

deepreef commented May 4, 2017

To define what is meant by a "taxon" instance (to which a taxonID is assigned), we need to establish what are the "core proprties" of an instance of a "taxon", whereby if one of the core properties changes, a new taxonID must be issued. I think it's best to narrow the scope of those properties to representing the "contents" of a taxon, rather than the combination of contents and "context". "Contents" in this sense are the items contained within the circumscription of a taxon. For example, a taxon representing a genus would be defined by the set of species contained within it. For example, two different assertions of a genus contain different sets of species: Aus sensu Smith contains (Aus bus+Aus cus) with species "dus" placed in genus "Xus"; whereas Aus sensu Jones contains (Aus bus+Aus cus+Aus dus); then Aus sensu Smith would have a different taxonID from Aus sensu Jones because they have different contents. "Context" in this sense means placement within a hierarchical classification. Changing the context of a taxon instance should not cause a change in taxonID. For example, if Smith and Jones both assert the same contents of the genus Aus (e.g., A.bus+A.cus+A.dus), but Smith places the genus in the family Aiidae, and Jones places the genus in the family Xiidae, we do not need a different taxonID to represent Aus sensu Smith and Aus sensu Jones. Logically, this means that for a species-level concept, if the circumscriptons of both Smith and Jones for the species "bus" are the same (i.e., same heterotypic synonymy), then they have the same taxonID even if Jones treats it as "Aus bus" and Smith treats it as "Xus bus". This needs to be fleshed out in a full document.

@mdoering
Copy link
Member Author

mdoering commented May 4, 2017

I agree with excluding the position in the classification, the "context", from the taxon concept identity. A for the included species we should find a way to allow newly described species to be added to a genus without change its identity as long as the species were not moved from another genus. See also

As for a final output I agree this needs to be written up somewhere else, probably as part of the API documentation. But in general I would like to get an agreement on the key points in these issues first instead of creating lots of documents that tend to consume a lot of overhead just for styling and explaining the context.

@deepreef
Copy link
Collaborator

deepreef commented May 4, 2017

Yes, exactly! Dave and I discussed this at some length at Woods hole. What it boils down to is this:
When a new species is described, what would its type specimen have been identified as prior to the new species description? If it would have been identified as an earlier-named species, then what we have is a case where one larger species was split into two smaller species. As such, the circumscription of the genus doesn't change. On the other hand, in the cases of "brand new" species, which would not have have had ANY taxonomic identity prior to their description, then the concept for the genus would need to change. Obviously, this is subjective in many cases, and not obvious. But from an informatics perspective, I think the cleanest answer involves how the link between names and their corresponding name-bearing types are made. I suggested to Dave that we could have a "retroactive identification" system, whereby it can be asserted that the type specimen of a new species would have been identified as species "X" prior to the new species description. This can be proxied through assertions of heterotypic synonymy, if we don't want to get all the way doen to identifications of specimens. I will take some time this weekend to come up with a better diagram than what is shown above. I actually started one in Woods Hole, so I will finish that and then share it.

@deepreef
Copy link
Collaborator

deepreef commented May 4, 2017

By the way, if we can address this informatically, then we will have also created a REALLY valuable tool to distinguish "true" new species from "split" new species. An analogy of the difference is the two kinds of "Gaps" in Col (actual taxonomic gaps, vs. synonym name gaps). This is important because it distinguishes cases of new species that increase our understanding of the scope of biodiversity, vs. cases of drawing new lines within our already-existing understanding of the scope of biodiversity. We've never been able to do this before.

@mdoering
Copy link
Member Author

mdoering commented May 4, 2017

Would we want a genus concept to change when a brand new species gets added? It would mean genus identifiers change quite a lot over time and we lose stability. It might be more useful to restrict changes of genus concepts to true splits & merges of genera, ignoring the exact amount of included species for the most part and focus on the genus types as we discussed at some point. This needs real world examples to test

@deepreef
Copy link
Collaborator

deepreef commented May 5, 2017

It depends on how much you want to reflect reality, and it also depends on what you mean by stability. The undortunate reality is that the meaning of a genus-level taxon concept DOES change when a truely new species is added. However, if we want less precise but more stable taxon identifiers for genera, then we can treat them the same way as species. That is, instead of defining them by the circumscription of all individuals, we can limit the definition to be circumscriptions of types (stype species for genus concepts, and type specimens for species concepts). Unfortunately, as we discovered in our discussions at Woods Hole, we lose important information about taxa when we fail to distinguish the case of one species-level taxon that is split into two, vis a brand new species being added (impetus for the diagram in the photo you included above).

Also, "stability" is actually INCREASED with increased precision, because there is less subjectivity in the definition. The problem isn't a loss of stability, the problem is a proliferation of subtle variants (e.g., Aus Smith sec. Smith vs. Aus Smith sec. Jones). All of these variants are themselves stable; but they confuse matters because we have no good way to reflect the differences in meaning between two precisely-defined genus-level taxon concepts.

@mdoering
Copy link
Member Author

mdoering commented May 8, 2017

Right, the genus concept changes when a new species is added when you look at the included species. But is this really useful for anyone?

It seems to me it is rather about delimiting a genus to other genera that is important here to define the concept. Merging and splitting again. For example the genus Acacia can be referred to as the concept sensu latu including all species nowadays in Vachellia or sensu strictu when you also acknowledge the existence of Vachellia.

@deepreef
Copy link
Collaborator

deepreef commented May 8, 2017

Personally, I'm happy with defining a taxon by the set of "types" it contains. That is, a "species" concept represents the sum of the species-group protonyms (as proxies for type specimens) assigned to it as heterotypic synonyms, and a "genus" concept is the set of genus-group protonyms (as proxies for type species) assigned to it as heterotypic synonyms. To me, that solves 80% of the problem with 20% of the effort. However, as we discussed in Woods Hole, this completely misses the ability to descern the "sensu lato/sensu stricto" cases where an existing species is split into two. That is, no way to distinguish between "Aus bus Smith sec. Smith" (sensu lato) from "Aus bus Smith sec Jones" (sensu stricto) -- when Jones splits Aus bus into Aus bus Smith sec. Jones and Aus dus Jones sec Jones. The same applies to all ranks (Genus and above).

Like I said, limiting it to heterotypic synonymy gets 80% of the job done with 20% of the effort. If we want to go beyond that, I think it would be better handled by a system of "RelationshipAssertions" (sensu TCS).

@mdoering
Copy link
Member Author

mdoering commented May 9, 2017

Three implementations dealing with tracking taxon concept changes:

@mdoering
Copy link
Member Author

mdoering commented May 9, 2017

Should the identity stay if just the name changes? E.g. some of the synonyms gets accepted or if the name changes its rank, e.g. a species will be considered a subspecies now? Type and concept wise these are the same so the identifier should not change, correct?

@ThierryBourgoin
Copy link
Collaborator

ThierryBourgoin commented May 9, 2017 via email

@mdoering
Copy link
Member Author

mdoering commented May 9, 2017

@ThierryBourgoin I see your point and it makes a lot of sense. There are various ways to look at what the essence of a taxon is and exactly this is why we need to agree on one definition.

We should probably step back and approach the problem from a users perspective. What does a user want from a CoL taxon and why does it need an identifier at all?

  1. someone uses the catalogue at some point and wants to have a persistent reference to the exact version he was looking at that time. That would require a fully versioned CoL with every change triggering a new identifier.

  2. people have identified an organism to a CoL taxon, e.g. a specimen or observation. They want access to the current view of the "same taxon" in the CoL that still represents that organism observed. But maybe with a different name, classification or other updated "metadata". This does not require a taxon concept id per se, just a way to get to the (different) identifier for the latest version of the same concept. The concept identifier basically is internal only - but the system still needs to know about concepts. This mostly applies to species- and infraspecific taxa so we probably would not need to worry about higher taxa, but maybe genera.

  3. researchers want to aggregate species related information from different systems, all linked to CoL taxa. They want to be sure the different systems talk about the same taxon concept and information can safely be transferred and merged. This seems to require shared concept ids.

From the above I feel we need 2 identifier, one for the exact version and one for the taxon concept to assert a concept is the same.

The question now is how to know that a concept (as in set of all theoretically included individuals) is the same. We can either find a way to automatically detect that or rely on experts to tell us. The problem with experts is that they will apply different judgments to what concepts are. So we will see very inconsistent, equal concepts across various groups. Sth that can be asserted by a computer will be much more useful as its predictable and comparable across all groups.

@deepreef
Copy link
Collaborator

Thanks, @ThierryBourgoin and @mdoering -- this is helpful. This conversation is touching on the same problems of communication that have plagued these discussions for several decades now (going back at least to the 1980's). Fundamentally, is that we have different ideas about two issues:

Issue 1 is about what "things" (conceptual entities) do we care enough about to label with a persistent identity. Included within this issue is the question of how to explicitly define these "things", so we know when the properties of one thing (represented by its persistent identifer) should be changed (without changing the identifier), vs. when a new "thing" is needed (with its own distinct identifier). At the heart of this issue is which properties of a "thing" define it (i.e., collectively represent its "essence"), and which merely represent relevant metadata associated with that "thing", which may be altered without altering the essence of the "thing".

Issue 2 is about semantics, that is, which terms do we use to label each class of "thing". The most problematic terms are "name" and "concept". Both have various synonymns and homonyms in our conversations. What has become clear as a result of MANY conversations almost exactly like this one is that we probably have five or six different classes of "things" that we have, over the years, tried to force-fit into two terms ("name" and "concept").

My fear is that if we do not confront these two issues now, we will make very little progress solving these problems from an informatic perspective. Having dealt with these issues (from an informatics perspective) for many years, these are the "things" that I have found useful for persistently representing conceptual objects in the biological taxonomy realm:

Thing 1: An individual human being, or an entity representing an organization created by human beings. I have used the term "Agent" to refer to this Thing.

Thing 2: A text-string label used to represent an instance of Thing 1 ("Agent"), often parsable into "Surname" and "GivenName" (for people), or a hierarchy of names (for organizations). I have used the term "AgentName" to refer to this Thing.

Thing 3: Documentation instance representing assertions made by one or more instances of Thing 1 ("Agent"), at a particular moment in time. The documentation may be a type of publication, or it may be some other form of static documentation. The word "static" here is critical, because the documentation instance represents a snapshot in time, and thus does not change. For retrieval purposes, it is best to associated each instance of Thing 3 with instances of Thing 2 (AgentName), instead of directly to instances of Thing 1 (Agent). I have used the term "Reference" to refer to this Thing.

Thing 4: A string of text characters, typically represented electronically in the form of UTF-8 encoded text, or printed in the form of glyphs rendered as ink on paper, which serves as a Linnean-style scientific name. These text strings may or may not include components representing taxonomic rank, delimiters (such as parentheses), and authorship information (various styles, formatting and with or without years). I have used the term "NameString" to refer to this Thing.

Thing 5: A specific instance of a Linnean-style taxon name represented as a conceptual entity. This applies to a particular unit of a compound name (not the full combination), which has a particular type (specimen or name) in the context of Codes, a particular rank (in the sense of Linnean ranks), and a particlar authorship associated with the creation of the name. This is different from instances of Thing 4 (NameString) in that it is conceptual, not literal. The essence of an instance of Thing 5 is independent of the text string used to represent it. For example, the same instance of Thing 5 might be represented by different text strings (e.g., different genus combinations for a species, different ranks, different spellings, etc.), and more than one instance of Thing 5 might share the same text string (e.g., homonyms, homographs). I have used the term "Protonym" to refer to this Thing.

Thing 6: A particular treatment or usage of an instance of Thing 5 (Protonym) within the context of an instance of Thing 3 (Reference). Important properties of instances of Thing 6 include the exact spelling of the specific name unit (e.g., the species epithet) as it appears within the instance of Thing 3 (Reference), what taxonomic rank the instance of Thing 5 (Protonym) was asserted as within Thing 3 (Reference), Whether or not the instance of Thing 5 (Protonym) was treated as as a valid taxon, or as a heterotypic synonym of another taxon, and a link to another instance of Thing 6 representing the immediate hierarchical taxonomic parent (e.g., the genus into which a species is placed). I have used the term "TaxonNameUsage" to refer to this Thing, but it could also be referred to as "TaxonTreatment" or just "Treatment" (following how PLAZI uses that term).

Thing 7: The set of biological organisms, including individuals that are dead, alive, and yet-to-be-born, which are explicitly or implicitly included within an asserted Taxon. THIS IS THE THING ABOUT WHICH WE ARE DISCUSSING Most people I have discussed these issues with over the years have applied the term "TaxonConcept" and "Circumscription" interchangably to refer to this Thing. However, as per @ThierryBourgoin comments above, perhaps we do not have universal agreement that "Concept" and "Circumscription" are synonymous terms. Therefore I propose we use the term "Circumscription" to represent this Thing, to avoid confusion going forward.

Thing 8: This is the Thing that @ThierryBourgoin refers to in his comment above as a "Concept". Basically, its properties include elements of both Thing 7 (Circumscription, or set of included child entities), as well as Thing 6 (TaxonNameUsage/Treatment), such as the hierarchical classification, treatment as valid or not, and how the name is spelled. Therfore, it is different from Thing 7 (Circumscription) because it is defined by more than just the child items it contains, but it's not the same as an instance of Thing 6 (TaxonNameUsage/Treatment), because there many be many instances of Thing 6 (TaxonNameUsage/Treatment) that all imply the same instance of Thing 8.

I apologize for this long post, but there is a reason we've never solved this issue as a community during the past few decades. Unfortunately, most of that reason has to do with miscommunication, and most of the miscommunication has to do with a mixture of how we define our core objects (Issue 1) and what terms we use to represent them (Issue 2; i.e., semantics).

I believe that we already have well-tested, non-contentious definitions for Things 1, 2, 3, and 4. After the dinner conversation in Woods Hole, I am confident we can fairly quickly settle on a clear definition for Thing 5. If we can achive that, then the definition of Thing 6 is extremely easy. Therefore, the real issue for us to deal with is whether Thing 7 and Thing 8 need to be different Things, or if we can adequately accomodate them with a single Thing. Originally I thought we could get by with a single Thing, but after the comment by @ThierryBourgoin and @mdoering above, it seems we should serious consider defining them as separate things, each with their own identifiers.

In either case, I think it's important that we understand the difference between defining what Things we need to manage in CoL-Plus, and deciding which terms to use to refer to those defined things. I think it would be a grave mistake to start defining data models and such until after we come to consenses on the Things we're managing, ans the terms we're using to refer to those things.

Phew... and this is just the BEGINNING of the discussion!

@deepreef
Copy link
Collaborator

One more point.... in response to the comment by @mdoering above, "versioning" of CoL representations can be handled in several ways:

  1. Internally using version histories for the same identifiers plus a date-stamp;
  2. Geneating new identifiers to represent each version;
  3. Capturing each new version via a new instance of Thing 6 (with Reference representing CoL as the Author and the date of the change as the date, and the properties of spelling, validity, classification, etc.)

There are other ways as well, but #3 above represents the simplest in terms of coding and implementation.

@dremsen
Copy link
Collaborator

dremsen commented Aug 25, 2020

Rich, I reviewed everything you've written. It covers a lot of ground. I'll respond to the things that are most relevant to me in small bites.

You wrote in a response to Matt:

_OK, so walk me through what "changes" here. Let's take FishBase as a GSD. They have an identifier for Chromis agilis: 5669

Their concept of C. agilis is currently s.l., found throughout the vast tropical Indo-Pacific, the same as pretty much everyone else who has treated this species since about 1973. They (FishBase) no-doubt now have a copy of Allen & Erdmann 2020, and I'm pretty sure they'll probably follow that treatment, which defines a s.s. version of C. agilis confined to the Indian Ocean, and establishes a new name/concept for what used to be the Pacific populations of C. agilis, but which they have now named as C. pacifica.

FishBase will mint a new ID for Chromis pacifica, and generate a map of it throughout the Pacific, and cite Allen & Erdmann 2020 as the Main Reference. And for C. agilis, they will generate a new map showing only locations in the Indian Ocean, and will probably update the "Main Reference" from the current Allen 1991 (one of probably hundreds that represent the s.l. version of C. agilis) to Allen & Erdmann 2020 (the one that defined the s.s. version of C. agilis).

Are you saying that FishBase should/will abandon ID 5669 for C. agilis and/or relegate it to the s.l. version of C. agilis, and then mint a new ID for C. agilis s.s.? I'm almost certain they won't do this. And I'm also leaning towards the idea that they shouldn't do this._

Based on this, as well as the statement you made regarding the subsequent update from FishBase into the COL, there are three taxon concepts that are the result of all this. And to me they should have three distinct IDs. The three concepts represent the original stable concept and the two resulting from the 2020 split.

  1. Chromis agilis, Smith 1960 sensu Allen 1991
  2. Chromis agilis, Smith 1960 sensu Allen & Erdmann 2020
  3. Chromis pacifica, Allen & Erdmann 2020, sensu Allen & Erdmann 2020

My use case is that I am a curator of a C. agilis specimen and I want to apply a taxon identifier instead of a name in order to improve precision in the identification of my specimen.

I want a future multi-concept, multi-classification, multi-versioning version of COL to maintain a list of these concepts such that I can ask

"How has Chromis agilis been treated?"

The system then responds with the three concepts below.

  1. Chromis agilis, Smith 1960 sensu Allen 1991
  2. Chromis agilis, Smith 1960 sensu Allen & Erdmann 2020
  3. Chromis pacifica, Allen & Erdmann 2020, sensu Allen & Erdmann 2020
    • Chromis agilis (pro-parte synonym)

I might
Use (1) if I don't support the basis of the new split and I stay with Allen 1991
Use (2) I agree with the Allen & Erdmann 2020 split (and my specimen came from the Indian Ocean)
Use (3) if I agree with Allen & Erdmann 2020 (and my specimen came from the South Pacific)

Now it may be helpful to inquire further and be exposed to a whole bunch of other TNUs that essentially point me back to the concept identified by (1). In fact (1) might really be referred to as Chromis agilis, Smith 1960 sense Smith 1960 for all I know. But I would be very confused if, to represent the summary information in that list of three above, the system made me weed through dozens of TNUs with little to discriminate them.

Lastly,

I don't know if Fishbase should abandon identifier 5669 but I don't think 5569 should refer to anything but the original concept (1) above. If they keep the identifier but it now refers to (3) above then this is a problem for me because then the ID isn't a taxon ID. At least it's not a taxon concept ID and I'll be really confused if we have taxon IDs as well as taxon concept IDs. The identifier changes when and only when the concept changes.

It wouldn't make sense for someone applied 5569 to their specimen, thinking it referred to one concept, only for someone to actually change its meaning. I might be quite peeved if 5569 went from Aus bus sensu Trustworthy 2019 to Aus bus sensu WTF 2020 that lumped my fish with gorillas.

@nfranz
Copy link

nfranz commented Aug 25, 2020

Objective rules. Here I have a section "Role of trained judgment", which was a/my take on Daston & Galison (2007). We don't need "objective rules" in the sense that I might presume is meant here to somehow be devoid of particular human interest(s). We need to develop means to engage these different and evolving human interests, foster a culture of greater confidence in expressing them, more transparency in documenting them, and better machine interpretability of these interests.

@mdoering
Copy link
Member Author

mdoering commented Aug 25, 2020

@rdmpage @dremsen @gdower if you had the choice as a user for a CoL Taxon ID to be based on a) the current name, b) the protonym of the accepted name or c) the set of protonyms appearing in the entire synonymy, what would be your pick?

@mdoering
Copy link
Member Author

If the CoL taxonID is based on the inferred or explicit information of homotypic names this might lead to different identifiers in the classic and extended catalogue for the same edition! As we have more names, synonyms and basionym relations being added in the extended catalogue, the taxonID for the same name in the 2 catalogues might be different. Is that a problem? I would think as long as the classic CoL ids still resolve to sth in the extended catalogue not so much.

What about other homotypic name relations like replacement names. @deepreef do you treat them as a new protonym or rather refer to the replaced name?

@rdmpage
Copy link

rdmpage commented Aug 25, 2020

@mdoering

if you had the choice as a user for a CoL Taxon ID to be based on a) the current name, b) the protonym of the accepted name or c) the set of protonyms appearing in the entire synonymy, what would be your pick?

Confession, I don't use CoL. but if I did, I think I would want the identifier that changed the least so it didn't disrupt links I had made, so would pick (b). If (c) changed I might like a feed that I could follow that would tell me that, as then I could make a note to see what the implications of that change were.

But, as I'll expand on below, I'm beginning to doubt that what a CoL Taxon ID means is the key question.

@rdmpage
Copy link

rdmpage commented Aug 25, 2020

Thinking a bit more about this topic, I wonder whether much of the effort being expended here is ultimately besides the point.

It seems to me that taxonomic databases don't give you taxa (at least not in the ways I think we are talking about here). They give you lists of names, connected to literature if you are lucky, and sometimes putative connections between names (e.g., synonyms). But what can you do with that, other than endlessly discuss what is a taxon, what is a concept, when do they change, and when should ids change?

Meanwhile, there are databases that define taxa explicitly, in the sense that it is this set of sequences (e.g., GenBank, BOLD), this set of specimens (e.g., GBIF), this set of images and observations (e.g., iNaturalist). These databases all contain data that you can work with, you can derive information about taxa ("this species is red, it lives here, it differs from other species at these positions in this gene"). And, of course, each one has it's own set of taxon identifiers, which are managed in different ways. How these taxa map across databases is another issue, but the point is that each database can point to something that is a taxon.

Membership in a taxon in these databases is a trivial operation, in the sense that if a notion of a taxon changes, they can simply point their data to that new local taxon id (as happens, say, when iNaturalist splits a taxon). Now, these taxa may be "wrong" in some sense (for example, GBIF may have misidentifications, or identifications made when people's conception of a particular taxon was very different), but that "data item -> taxon link" is their core feature. The issue for these databases isn't whether two ill defined "taxon concepts" are the same or not, it's "what taxon id do we stick on this observation/sequence/specimen?". Given that the taxon id is local (a NCBI taxon integer, a iNaturalist integer, a GBIF integer), that becomes a local decision (informed by, say sequence similarity, or community voting, or what a museum label says, or a publication).

So, it seems to me that the role of something like CoL is NOT to say "these are the taxa", NOR "this is the accepted taxonomy", but rather "these are the names and where people have said things about those names". Databases that actually have taxa can use CoL to help decide what name should be applied to the sets of things they have in their database ("taxa"). CoL isn't an arbiter of taxonomy, it's a source of information.

So, I would rule taxon concepts as being out of scope. Give us names, give us how they are used in the literature, let us track protonyms where we can so that people can get a sense of how name usage has changed over time, provide identifiers so that these things can all be linked to. Move any discussion about taxon concepts to those databases that can actually speak to them.

Not sure how coherent any of this is, I confess I have reached saturation point. I think the key design decision here is to say "no" to a lot of things. With that, I will leave the booze I brought, pack up my loud speakers, and head home...

@mjy
Copy link

mjy commented Aug 25, 2020

Thanks @rdmpage, wonderful to wake and read this, it's the essence of the issue distilled. I have a longer response that I'm refraining from posting here, to force myself to activate a blog I've been meaning to setup for our group. Will link out when that is done.

@deepreef - I agree that semantics, syntax, granularity keep your and my concepts from aligning rather than core differences. One day I will step you through real data in TaxonWorks and we can further align.

@dremsen
Copy link
Collaborator

dremsen commented Aug 26, 2020

Rod said

It seems to me that taxonomic databases don't give you taxa (at least not in the ways I think we are talking about here). They give you lists of names, connected to literature if you are lucky, and sometimes putative connections between names (e.g., synonyms). But what can you do with that, other than endlessly discuss what is a taxon, what is a concept, when do they change, and when should ids change?

It's a fair question. What can you do with what the COL has got? I worked on that question for my virtual presentation for Biodiversity Next. The core COL data is fairly lightweight. What's it for? Seems like something we should be clear on. That would answer whether concepts are in or out of scope, as well as a bunch of other things. Maybe even verify priorities.

@mdoering
Copy link
Member Author

Clearly the CoL is used by GBIF and others to organize data taxonomically without the need to manage their own taxonomy.
It ideally gets used as a standard like ISO country codes where you also cannot track what a specific country actually means, there are no shape files, but which people still happily use to share and expose data because they trust.

@rdmpage
Copy link

rdmpage commented Aug 26, 2020

@mdoering Markus, just to be clear, I wasn't trying to devalue CoL, I was trying to clarify the limits of what it can do.

Having a list of stable identifiers for sets of names (e.g., an identifier for all variations and objective synonyms of a name) and links between names that have been asserted at various times (e.g., links between heterotypic synonyms) would be great. These are all facts (this name is objectively a synonym of that, or it has been asserted by so-and-so that these two names are synonyms), and stable identifiers for facts enables others to link their data to those facts.

My concern was that the whole issue of taxon concept equivalence not only brought with it uncertainly over what a concept actually is (hence the endless discussion) but was not something CoL could really speak to (hence, "out of scope").

Sorry, thought I'd left my jacket behind... REALLY going home this time ;)

@ghwhitbread
Copy link

+1 for @rdmpage & @mjy.

Concepts are stock in trade for taxonomists, in the @nfranz realm of “trained judgment”.
For others … unfortunately, we’ve not had much success convincing these clients that they should be using names.

@nfranz
Copy link

nfranz commented Aug 26, 2020

I think it would be fair (and accurate) to say that taxonomic concepts are not out of scope for CoL. I would presume that, in order justify its continued growth, CoL will need to maintain the rather assertive attitude about the properties and significance of its products here: http://www.catalogueoflife.org/content/about And I think historically, users have accessed CoL and similarly structured resources, have perhaps added some implicit or explicit contextualization that reflected and served their needs, and then proceeded as if CoL is their legitimate source of taxonomic concepts.

So, in absence of a new family of disclaimers that CoL is not (ever) to be taken as a viable source for taxon-approximating name usages, I think the question is really to what extent CoL can and will make a new contribution (now, as this thread aspires) to making its concept identities and relationships more explicit. For instance, if one does not wish to claim "genuine authorship" of parts or all of a represented classification, how should that be very explicitly shown?

The proposal was that this exploration to do concepts better could be achieved by seeking and agreeing on "objective criteria". I personally think that pathway is misguided (more bluntly: I think there is an unhelpful notion of objectivity under the surface, and it detracts from more productive actions), and should instead be reformulated as: how can CoL gradually and better accommodate information systems and services that are already more explicitly designed for and committed to taxonomic concept identification and linking (such as Avibase)? How can CoL engage more trained users directly in this process, in some not so distant future? I think those are strategies or criteria that would serve CoL more in this context.

@dremsen
Copy link
Collaborator

dremsen commented Aug 28, 2020

I just re-read the AviBase concept paper having given it a read when it originally came out. It's worth a read but my interpretation of the data model and presentation is that he is using the same protonym proxy for circumscription (explicitly and/or via inference) as we are discussing and this thread is not particularly far off. He does explicitly says that a fundamental problem with with names is that they only refer unambiguously to type specimens, instead of biological circumscriptions that underlie most name usages. And his model, of course, supports trait and geographic data to support this. But these data are not the basis for his computation of circumscription. Those are properties, probably fairly incomplete for most taxa, of concepts and super valuable to qualify the concept and make a determination but they do not appear to play a role in modeling relationships. His circumscriptions are based on the lumping and splitting of previously created (or newly split) named taxa. I do like the relationship hierarchies. They certainly help identify and record splits in a more persistent way while we have been discussing different ways of modeling them. I believe, however, those hierarchies can be derived by the same protonym-proxy concept circumscriptions we have been discussing. His Concepts are akin to Rich's TNUs, a Taxon sec Reference. Concepts map to distinct AviBase IDs, which are the normalized sets of congruent circumscriptions. It's worth a repeat. His determination of overlap or congruency are not based on computational comparisons of biological properties. They are based on 'taxonomic lumps, splits and partly overlapping relationships which can be easily detected by looking for any additions or deletions of concepts (via named taxa - Remsen) or reassignments of subspecies. " The biological properties annotate these circumscriptions and fill in the biological properties one can use to help make determinations. The computational components, however, are parallel to what we have been discussing. It would be worth taking some of his examples and mapping them out using the alternative methodology and see if we can recreate the relationship hierarchies. I think you can. So it's a good model but what we are discussing is in-line. We have discussed fractional weighting before and here he has done it. It's good but I think there is convergence rather than being off track.

P.S. When I said "explicitly or via inference" I mean that he can and does use domain knowledge to infer circumscription (via heterotypic synonymy) for sources that do not explicitly provide synonyms. See Parus major example.

@mdoering
Copy link
Member Author

mdoering commented Aug 28, 2020

Fully agree @dremsen. Avibase apparently just has some cases when the algorithm cannot determine the matching Avibase ID and which then requires manual mapping. I do not fully understand the reasons for that. It would be great to have this 100% automated.

Also if a taxonomist is to judge and compare the concepts of different usages for a given name that he already knows, this usually happens by looking at the synonymy, the presence/absence of certain "key" species that indicate a split/merge and the placement and presence/absence of some entire genera (think of Acacia, Vachellia, Racosperma, Senegalia). This happens all the time in a taxonomists head and I cannot see a reason why a machine should not be able to do the same. Looking at biological circumscriptions happens when establishing new concepts, but I strongly doubt they are needed to identify them. At least not in the vast majority of cases as Avibase has already proven. If someone strongly thinks differently I would be very interested to look at some real world examples.

It clearly is also a question of granularity. Surly there can be subtle differences in taxon concepts of 2 different usages of the same name without a split or merge happening. You would not be able to recognise these by looking just at the synonymy and classification. But all larger concept changes will be reflected in a split or merge and it is those that I would like to be able to track and identify by a machine.

@nfranz
Copy link

nfranz commented Aug 28, 2020

In my experience, whenever one is to assert congruence (or lack thereof) between two parent-level concepts (say, species level) yet their respective sets of children (say, subspecies level) are purposefully not sampled (accounted for) in a comprehensive manner - perhaps because there is an upfront regional constraint that is non-matching; or the real aim is higher-level phylogenetic relationships; or only concepts with associate DNA data are considered - then we need an element of intentionality to get reliable relationships. Someone, or something, has to say: taxonomically congruent in spite of these lower-level incongruences.

Think Quercus sec. http://beta.floranorthamerica.org/Quercus vs. Quercus sec. http://www.efloras.org/florataxon.aspx?flora_id=2&taxon_id=127839. An algorithm that starts with matching child-level concepts and propagates those signals up to the next level, will be highly sensitive to that strictly extensional signal.

I agree that in principle this is not out of reach to program in logic/AI. But how best, might need to be informed by sources where humans are enabled to develop the rules (first).

@mdoering
Copy link
Member Author

Genus or higher level taxon concepts are definitely difficult and probably not that useful either for users. What interests me most is really species and infraspecific taxa as these are the units we mostly use for identifications and other linking of data.

@nfranz do you see the same problems there?

@nfranz
Copy link

nfranz commented Aug 28, 2020

I think this is both awesome and yet still deliberately designed for human curation (as part of a semi-automated workflow, as one might call it). https://github.com/jar398/cldiff/blob/master/doc/ncbi-gbif.csv

@mjy
Copy link

mjy commented Aug 28, 2020

Algorithm based rules are are objective even if human guided/assisted if the human decisions are persisted in code/repositories. Let's not forget Open Tree's effort, which includes interesting (and implemented) solutions to some of the practical issues outlined here. Includes code, data, and reference to the bigger picture (e.g. NCBI) that has so far eluded CoL. https://github.com/OpenTreeOfLife/reference-taxonomy.

@AlasdairGray
Copy link

Having a list of stable identifiers for sets of names (e.g., an identifier for all variations and objective synonyms of a name) and links between names that have been asserted at various times (e.g., links between heterotypic synonyms) would be great. These are all facts (this name is objectively a synonym of that, or it has been asserted by so-and-so that these two names are synonyms), and stable identifiers for facts enables others to link their data to those facts.

My concern was that the whole issue of taxon concept equivalence not only brought with it uncertainly over what a concept actually is (hence the endless discussion) but was not something CoL could really speak to (hence, "out of scope").

In Bioschemas, we are moving to having a Taxon for the concept and a separate TaxonName for capturing the naming information that have been associated with the Taxon.

@frmichel
Copy link

frmichel commented Aug 31, 2020

@rdmpage @AlasdairGray:

In Bioschemas, we are moving to having a Taxon for the concept and a separate TaxonName for capturing the naming information that have been associated with the Taxon.

Plus, properties name/alternateName and scientificName/alternateScientificName are used to related names to a taxon, either as the accepted name or as a synonym (the "alternate*").

links between names that have been asserted at various times (e.g., links between heterotypic synonyms) would be great.

This is certainly a need for CoL, but Bioschemas does not intend to be/replace a rich domain ontoloy, it is just a markup vocabulary. Drawing the limit between the two is hard. Still, I feel that trying to go further by tracking the use of names and their changes throughout time goes beyond Bioschemas' goal here. I would say let's keep the vocabulary sufficient to give a snapshot, at a certain time, of how names are used to refer to a taxon. What do you think?

@AlasdairGray
Copy link

links between names that have been asserted at various times (e.g., links between heterotypic synonyms) would be great.

This is certainly a need for CoL, but Bioschemas does not intend to be/replace a rich domain ontoloy, it is just a markup vocabulary. Drawing the limit between the two is hard. Still, I feel that trying to go further by tracking the use of names and their changes throughout time goes beyond Bioschemas' goal here. I would say let's keep the vocabulary sufficient to give a snapshot, at a certain time, of how names are used to refer to a taxon. What do you think?

For Bioschemas yes I would agree, although I believe the mechanism that you have outlined would support names through time to some extent.

For the CoL, a more fully fleshed out approach would be required.

@jliljeblad
Copy link

Genus or higher level taxon concepts are definitely difficult and probably not that useful either for users. What interests me most is really species and infraspecific taxa as these are the units we mostly use for identifications and other linking of data.

Agreed, even if, say, a family is changed conceptually whenever its children are divided into two families, I don't always see it being helpful to the end user if the original family ID is changed. Especially if (as in the case of our national database Dyntaxa) only the original family is represented in the dataset. As you say, it is at the species level and below most users are using to link various biological data.

@jliljeblad
Copy link

For a great number of names in CoL there is only one concept ever tied to them (not counting parent changes as concept changes). They might have changed names, but only to homotypic ones. For these, it really doesn't matter much if we don't know the definition of the concepts, only that they are still the same. If these would be tied to stable taxon IDs, a lot would already be won. So, it's not like Markus' ordeal is in vain just because we don't know the concepts behind these names. Just wanted to point this out. Of course, it's still a bit of a problem if we can't tell if we have one of these "unproblematic" taxa or not...

@jar398
Copy link

jar398 commented Sep 2, 2020

For what it's worth I tried to write down all the different cases around this issue of id sharing a few years ago, with particular examples to help sharpen thinking. Document is here.

I agree that species are more important than higher taxa, but we do sometimes see higher taxon identifiers being used in statements about biology, so I wouldn't dismiss higher taxon identity as completely unimportant. I've recently come to think that an aggregator might choose to apply different rules for identifier choice at species (or genus) and below vs. above that, and maybe this is OK - I don't know.

Agreed that a taxonomic database may be practically adequate in "defining" taxa (concepts), especially at higher levels. There will be failures relative to a proper circumscription as new borderline taxa come along (think: fossils that might or might not be called mammals depending on the chosen circumscription of 'mammal'), but that's a rare event and the risk of it may not warrant the effort required to track down rigorous circumscriptions consistent with an existing database ahead of time - circumscriptions don't even always exist. I'd rather say that a database snapshot gives its own 'taxon identifier usage' for a particular 'taxon identifier' and that usage pattern is or is not biologically equivalent (in practice) to the usage of the identifier in another snapshot.

@mdoering
Copy link
Member Author

Linking an interesting discussion in iNaturalist about similar problems initiated by Rod: https://forum.inaturalist.org/t/what-do-the-inaturalist-the-taxa-urls-represent-taxa-or-taxonomic-names/15551

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests