Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English translation of examples sometimes missing, included in the original text #604

Open
29jm opened this issue Apr 24, 2024 · 12 comments

Comments

@29jm
Copy link

29jm commented Apr 24, 2024

Some examples in some Czech words have an issue, where instead of having both text and english in the example object, there's only text containing a concatenation of both, e.g. "the original text ― the translated text".

Some examples:

  • ani: only the last example has the issue
    _screenshot
  • dobrý: same
    _screenshot
  • It's not necessarily for the last entry though, for instance with zničit.
  • Some more like this: vzít, hlavní, čas, žít, paní, jméno, smrt, kniha, psát, názor, světový, osobní, minulý, onen, umění, věk, telefon, zástupce, ženský.

If it's a problem in the page markup, I can fix it there, but I didn't see what could cause it.

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Apr 24, 2024

Yeah, this is buggy, I'll take a look at it tomorrow.

@kristian-clausal
Copy link
Collaborator

Sorry, I got stuck trying to figure out a bug? with our logging system, this might have to wait a while.

@29jm
Copy link
Author

29jm commented Apr 25, 2024

No worries at all, I know how it is :)

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Apr 26, 2024

Found part of the issue, and it's a silly one; the examples I looked at (ani and vzít) had a word that was blacklisted from our "what is an English word" set: He, with a capital. Our function that determines what kind of language or function a string has rejects the strings because 1/4 of it is classified as non-english, so it all gets lumped as a romanization string and the example is concatenated.

Actually looking at vzít, the example that is broken doesn't get through because it's 2/3rds Czech names with one English word.

I am writing this message as I'm going through examples, and hlavní is a completely different issue: the example isn't in a template that we accept as an 'example' (ux) template, but as a coi template, related to collocation. Outputs the same as an example template, and outputs the same kind of stuff (looks like an example, quacks like an example), so I just added the template and its aliases to our list of "ux" templates.

I'm going to commit these, and hopefully most of all of the issues will be addressed. If you find that some weren't fixed, just point them out.

Unless we add a bunch of Czech names (which is the simplest way), the examples that are just Eliška married Tomáš are impossible to determine with a simple heuristic.

@29jm
Copy link
Author

29jm commented Apr 26, 2024

Unfortunate that wiktionary doesn't tag the translation of examples with a language! If I understood correctly, this is why wikiextract has to implement these heuristics.

Thanks for the fixes already, I'll have a look at the json within the next week!

@29jm
Copy link
Author

29jm commented May 1, 2024

I'm looking at the most recent JSON, and some problematic words like ani, čas and žít are fixed, while others aren't, e.g. hlavní, názor, osobní. I think many of the remaining ones use coi, though not all, for instance Ukrajina which uses uxi.

(in case it helps, a pastebin of all of them here)

@kristian-clausal
Copy link
Collaborator

Many of those are just that they contain words that are not in nltk.corpus.brown; words like "cellphone", "mousetrap", "He’s" (with a unicode apostrophe or other character), "dumbfuck", "peppermint"... Hrm, many of these could be fixed if we somehow could cheaply detect compound words. If anyone has an idea of how to do this super-cheaply ("peppermint" -> contains "pepper" and "mint" and smooshed together)...

Another category of problem are the translations that are basically stuff like "common noun" or "adjective", phrases that will be get classified as tags by the classifier.

@xxyzz
Copy link
Collaborator

xxyzz commented May 6, 2024

We can't use template arguments or expanded HTML tags? "ux" and "coi" templates put translation text in the third argument, find argument should be easier and more reliable than check if words are in English.

@kristian-clausal
Copy link
Collaborator

I was thinking of that, yeah. We can check to see if the arguments map on to the template expanded output and exit early if they conform to the formatting of examples. The problem is that there might be some pitfalls with this approach, for example if example templates are used for other things, but in context it might be fine.

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented May 6, 2024

EDIT: This is a post that was left unwritten earlier today, posting it here just for completeness.

"cause of death" slipped through because decode_tags classified if as tags. "of tags" is parsed as a tag (a space-including tag, so not in valid_tags), and "cause" is classified as a "topic" for some reason, and there's a small piece of boolean logic that says that if any of there are topics or a flag is set or there are no tags with spaces in the collected tags. So because "cause" is in topics, the no tags with spaces does not trigger. This is a super annoying, probably quite rare edgecase.

EDIT: This edgecase should be fixed with the template-arguments fix.

@29jm
Copy link
Author

29jm commented May 14, 2024

That last PR fixed most of the issues! Here's a pastebin of the remaining ones, from a JSON downloaded today: https://pastebin.com/hMyZBXnh

EDIT: If those remaining ones are due to issues on the wiktionary side, let me know how and I'll fix them one by one.

@kristian-clausal
Copy link
Collaborator

I'll take a look at these later, thanks for keeping your eye on the output!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants