Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out how to treat diacritics better #201

Open
dimus opened this issue Nov 10, 2021 · 10 comments
Open

Figure out how to treat diacritics better #201

dimus opened this issue Nov 10, 2021 · 10 comments

Comments

@dimus
Copy link
Member

dimus commented Nov 10, 2021

@abubelinha raised the following in #199:

In summary, for the ö case, I think o is a much more conservative approach than oe (which looks like a germanic phonetic replacement, but gnparser does not do that in other cases like ñ, which is replaced by n despite it sounds more like ny in Spanish).

New comment now:

As there could be different opinions about this, I wonder if in a future version it could be possible to feed gnparser with an array of replacements (i.e. a config file, or something we can post through the api) so we can force it to turn ó/ò/ô/ø/ö into o (instead of oe), п/ñ into n, г into r, and so on (a user choice to override defaults).

Perhaps the cyrillic characters issue (keyboard-originated / OCR-originated / orthographic corrector-originated?) could be frequent in some scenarios, and it would be good letting gnparser correct this when we know it's happening.
Ortographic correctors have the side effect of putting first-letter uppercases in some of your words (after "subsp." or "var."); and depending on the orthographic corrector language, they could be the origin of some of the accented characters in latin names.

@dimus
Copy link
Member Author

dimus commented Nov 10, 2021

ICN and ICZN treat diacritics differenty, on top of that, people transliterate them inconsistently from case to case. So may we can have several lexical variants for the same name:

1. Aus bös
2. Aus boes
3. Aus bos

while 1 and 2 will get the same canonical form "Aus boes", the 3rd one will get "Aus bos"

For long names it is still not a huge problem, as the names will match fuzzily, but for short names fuzzy algorithm will not work to avoid false positives.

Proposed idea:

  1. Keep Canonical.Full and Canonical.Simple the same as now
  2. When generating Canonical.Stem transliterate all "oe" to "o", do the same for all other german diacritics.

Positive outcome:
All 3 cases from the example above will match

Negative outcome:
We might create significant number of false positives.

@tobymarsden
Copy link

@dimus A name demonstrating an issue that I'm seeing is Leptochloöpsis virgata.

Currently the output is (incorrectly for ICN):

  "verbatim": "Leptochloöpsis virgata",
  "normalized": "Leptochlooepsis virgata",
  "canonical": {
    "stemmed": "Leptochlooepsis uirgat",
    "simple": "Leptochlooepsis virgata",
    "full": "Leptochlooepsis virgata"
  },

With the new --diaereses option enabled, this is the output:

"verbatim": "Leptochloöpsis virgata",
  "normalized": "Leptochloöpsis virgata",
  "canonical": {
    "stemmed": "Leptochloopsis uirgat",
    "simple": "Leptochloöpsis virgata",
    "full": "Leptochloöpsis virgata"
  },

(Note the transliteration of the ö in stemmed.)

While I think your proposed idea is an improvement, I wonder if ä,ö,ü should be always be transliterated to a,o,u when they come after a vowel, everywhere (not just stemmed). Otherwise there's no way to correctly parse e.g. Leptochloöpsis virgata without choosing to preserve the diaereses. I confess I don't know the implications of that...

@Archilegt
Copy link

I don't know if parsing is meant to match the Codes. Above is mentioned that names are treated differently among the Codes.
As per the ZooCode, article 32.5.2:

32.5.2. A name published with a diacritic or other mark, ligature, apostrophe, or hyphen, or a
species-group name published as separate words of which any is an abbreviation, is to be corrected.
32.5.2.1. In the case of a diacritic or other mark, the mark concerned is deleted, except that in a
name published before 1985 and based upon a German word, the umlaut sign is deleted from a vowel
and the letter "e" is to be inserted after that vowel (if there is any doubt that the name is based upon a
German word, it is to be so treated).
Examples. nuñezi is corrected to nunezi, and mjøbergi to mjobergi, but mülleri (published before
1985) is corrected to muelleri.

Forcing ñ into n is ZooCode-compliant. Go for it.

For German umlauts, the parser would have to:

read [article metadata] year,
read [article metadata] language,
if year < 1985 and language = DE,
then ä = ae, ö = oe, ü = ue, 
if year < 1985 and language = non-DE,
then ä = a, ö = o, ü = u, ó/ò/ô/ø/ö = o,
if year => 1985 and language = any,
then ä = a, ö = o, ü = u, ó/ò/ô/ø/ö = o

@Archilegt
Copy link

If this "German issue" is fixed, we can definitely include it in the Verhoeff paper GNA module.

@dimus
Copy link
Member Author

dimus commented Aug 25, 2022

German issue is "fixed" to the best of our abilities, for example:

http://parser.globalnames.org/?format=html&names=Ortygospiza+atricollis+m%C3%BClleri&with_details=on

GNparser treats names with ü, ö, ä as German names before 1985. As names are coming to parser without a context, it is the best we could come up with.

@Archilegt
Copy link

One question: Are "wordType" values open to changes?
For example: genus to genericName, species to specificEpithet, infraspecies to infraspecificEpithet.

@Archilegt
Copy link

One question: Are "wordType" values open to changes?
For example:
genus to genericName
species to specificEpithet
infraspecies to infraspecificEpithet.

@dimus
Copy link
Member Author

dimus commented Aug 25, 2022

One question: Are "wordType" values open to changes? For example: genus to genericName species to specificEpithet infraspecies to infraspecificEpithet.

I decided on shorter names because it saves a bit of a bandwidth, and I considered that genus, species and infraspecies would be enough to explain the intention of the field. I guess for a real taxonomist these terms do sound kind of weird.

I can change the terms, @Archilegt , however it would create a backward incompatibility. I did ask a few taxonomists (when I was developing the first version of the parser in 2008) if shortened values bother them, and got an answer that it was not a biggie for them. So since then the values did stay as they are now, but may be your suggestion is better, can you tell your motivation for the change?

@Archilegt
Copy link

@dimus, great to read about the background! The main motivation for the change is aiming at all of us speaking the same language. In a way, the Codes of Nomenclature are biodiversity informatics standards, and terms and definitions contained in the Codes are being adopted by other standards like DarwinCore. With DC becomes more widely used and understood, that creates a larger community speaking "the language". Any software that reuses the same language would benefit from better understanding by the community. In general, the less "mapping" we need from software to (human or machine) user, the better. :)

@dimus
Copy link
Member Author

dimus commented Aug 25, 2022

@Archilegt I think it is a valid motivation for this change and clarity is worth of eating a little more bandwidth. It would create a compatibility problem for people though, and probably would require v2.x.x for the parser.

That means people who use v1 API will not automatically receive improvements anymore. It would also create a necessity to keep several APIs versions on our side running "in perpetuity".

So I will make an issue from your suggestion, mark it with 'v2' tag and see if other issues demanding backward incompatibility will tip the balance and v2 will need to become a reality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants