Figure out how to treat diacritics better #201

dimus · 2021-11-10T17:54:33Z

@abubelinha raised the following in #199:

In summary, for the ö case, I think o is a much more conservative approach than oe (which looks like a germanic phonetic replacement, but gnparser does not do that in other cases like ñ, which is replaced by n despite it sounds more like ny in Spanish).

New comment now:

As there could be different opinions about this, I wonder if in a future version it could be possible to feed gnparser with an array of replacements (i.e. a config file, or something we can post through the api) so we can force it to turn ó/ò/ô/ø/ö into o (instead of oe), п/ñ into n, г into r, and so on (a user choice to override defaults).

Perhaps the cyrillic characters issue (keyboard-originated / OCR-originated / orthographic corrector-originated?) could be frequent in some scenarios, and it would be good letting gnparser correct this when we know it's happening.
Ortographic correctors have the side effect of putting first-letter uppercases in some of your words (after "subsp." or "var."); and depending on the orthographic corrector language, they could be the origin of some of the accented characters in latin names.

dimus · 2021-11-10T18:04:01Z

ICN and ICZN treat diacritics differenty, on top of that, people transliterate them inconsistently from case to case. So may we can have several lexical variants for the same name:

1. Aus bös
2. Aus boes
3. Aus bos

while 1 and 2 will get the same canonical form "Aus boes", the 3rd one will get "Aus bos"

For long names it is still not a huge problem, as the names will match fuzzily, but for short names fuzzy algorithm will not work to avoid false positives.

Proposed idea:

Keep Canonical.Full and Canonical.Simple the same as now
When generating Canonical.Stem transliterate all "oe" to "o", do the same for all other german diacritics.

Positive outcome:
All 3 cases from the example above will match

Negative outcome:
We might create significant number of false positives.

tobymarsden · 2021-11-19T01:45:41Z

@dimus A name demonstrating an issue that I'm seeing is Leptochloöpsis virgata.

Currently the output is (incorrectly for ICN):

  "verbatim": "Leptochloöpsis virgata",
  "normalized": "Leptochlooepsis virgata",
  "canonical": {
    "stemmed": "Leptochlooepsis uirgat",
    "simple": "Leptochlooepsis virgata",
    "full": "Leptochlooepsis virgata"
  },

With the new --diaereses option enabled, this is the output:

"verbatim": "Leptochloöpsis virgata",
  "normalized": "Leptochloöpsis virgata",
  "canonical": {
    "stemmed": "Leptochloopsis uirgat",
    "simple": "Leptochloöpsis virgata",
    "full": "Leptochloöpsis virgata"
  },

(Note the transliteration of the ö in stemmed.)

While I think your proposed idea is an improvement, I wonder if ä,ö,ü should be always be transliterated to a,o,u when they come after a vowel, everywhere (not just stemmed). Otherwise there's no way to correctly parse e.g. Leptochloöpsis virgata without choosing to preserve the diaereses. I confess I don't know the implications of that...

Archilegt · 2022-08-22T14:24:36Z

I don't know if parsing is meant to match the Codes. Above is mentioned that names are treated differently among the Codes.
As per the ZooCode, article 32.5.2:

32.5.2. A name published with a diacritic or other mark, ligature, apostrophe, or hyphen, or a
species-group name published as separate words of which any is an abbreviation, is to be corrected.

32.5.2.1. In the case of a diacritic or other mark, the mark concerned is deleted, except that in a
name published before 1985 and based upon a German word, the umlaut sign is deleted from a vowel
and the letter "e" is to be inserted after that vowel (if there is any doubt that the name is based upon a
German word, it is to be so treated).

Examples. nuñezi is corrected to nunezi, and mjøbergi to mjobergi, but mülleri (published before
1985) is corrected to muelleri.

Forcing ñ into n is ZooCode-compliant. Go for it.

For German umlauts, the parser would have to:

read [article metadata] year,
read [article metadata] language,
if year < 1985 and language = DE,
then ä = ae, ö = oe, ü = ue, 
if year < 1985 and language = non-DE,
then ä = a, ö = o, ü = u, ó/ò/ô/ø/ö = o,
if year => 1985 and language = any,
then ä = a, ö = o, ü = u, ó/ò/ô/ø/ö = o

Archilegt · 2022-08-22T15:25:55Z

If this "German issue" is fixed, we can definitely include it in the Verhoeff paper GNA module.

dimus · 2022-08-25T11:12:11Z

German issue is "fixed" to the best of our abilities, for example:

http://parser.globalnames.org/?format=html&names=Ortygospiza+atricollis+m%C3%BClleri&with_details=on

GNparser treats names with ü, ö, ä as German names before 1985. As names are coming to parser without a context, it is the best we could come up with.

Archilegt · 2022-08-25T11:32:41Z

One question: Are "wordType" values open to changes?
For example: genus to genericName, species to specificEpithet, infraspecies to infraspecificEpithet.

Archilegt · 2022-08-25T11:35:07Z

One question: Are "wordType" values open to changes?
For example:
genus to genericName
species to specificEpithet
infraspecies to infraspecificEpithet.

dimus · 2022-08-25T13:10:43Z

One question: Are "wordType" values open to changes? For example: genus to genericName species to specificEpithet infraspecies to infraspecificEpithet.

I decided on shorter names because it saves a bit of a bandwidth, and I considered that genus, species and infraspecies would be enough to explain the intention of the field. I guess for a real taxonomist these terms do sound kind of weird.

I can change the terms, @Archilegt , however it would create a backward incompatibility. I did ask a few taxonomists (when I was developing the first version of the parser in 2008) if shortened values bother them, and got an answer that it was not a biggie for them. So since then the values did stay as they are now, but may be your suggestion is better, can you tell your motivation for the change?

Archilegt · 2022-08-25T13:28:32Z

@dimus, great to read about the background! The main motivation for the change is aiming at all of us speaking the same language. In a way, the Codes of Nomenclature are biodiversity informatics standards, and terms and definitions contained in the Codes are being adopted by other standards like DarwinCore. With DC becomes more widely used and understood, that creates a larger community speaking "the language". Any software that reuses the same language would benefit from better understanding by the community. In general, the less "mapping" we need from software to (human or machine) user, the better. :)

dimus · 2022-08-25T13:40:42Z

@Archilegt I think it is a valid motivation for this change and clarity is worth of eating a little more bandwidth. It would create a compatibility problem for people though, and probably would require v2.x.x for the parser.

That means people who use v1 API will not automatically receive improvements anymore. It would also create a necessity to keep several APIs versions on our side running "in perpetuity".

So I will make an issue from your suggestion, mark it with 'v2' tag and see if other issues demanding backward incompatibility will tip the balance and v2 will need to become a reality.

dimus mentioned this issue Nov 17, 2021

Option to retain diacritics? #208

Closed

dimus mentioned this issue Aug 25, 2022

As a User I want more standard terms instead of genus, species, infraspecies #235

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out how to treat diacritics better #201

Figure out how to treat diacritics better #201

dimus commented Nov 10, 2021 •

edited

dimus commented Nov 10, 2021

tobymarsden commented Nov 19, 2021

Archilegt commented Aug 22, 2022

Archilegt commented Aug 22, 2022

dimus commented Aug 25, 2022 •

edited

Archilegt commented Aug 25, 2022

Archilegt commented Aug 25, 2022

dimus commented Aug 25, 2022 •

edited

Archilegt commented Aug 25, 2022

dimus commented Aug 25, 2022 •

edited

Figure out how to treat diacritics better #201

Figure out how to treat diacritics better #201

Comments

dimus commented Nov 10, 2021 • edited

dimus commented Nov 10, 2021

tobymarsden commented Nov 19, 2021

Archilegt commented Aug 22, 2022

Archilegt commented Aug 22, 2022

dimus commented Aug 25, 2022 • edited

Archilegt commented Aug 25, 2022

Archilegt commented Aug 25, 2022

dimus commented Aug 25, 2022 • edited

Archilegt commented Aug 25, 2022

dimus commented Aug 25, 2022 • edited

dimus commented Nov 10, 2021 •

edited

dimus commented Aug 25, 2022 •

edited

dimus commented Aug 25, 2022 •

edited

dimus commented Aug 25, 2022 •

edited