English translation of examples sometimes missing, included in the original text #604

29jm · 2024-04-24T10:58:42Z

Some examples in some Czech words have an issue, where instead of having both text and english in the example object, there's only text containing a concatenation of both, e.g. "the original text ― the translated text".

Some examples:

ani: only the last example has the issue
dobrý: same
It's not necessarily for the last entry though, for instance with zničit.
Some more like this: vzít, hlavní, čas, žít, paní, jméno, smrt, kniha, psát, názor, světový, osobní, minulý, onen, umění, věk, telefon, zástupce, ženský.

If it's a problem in the page markup, I can fix it there, but I didn't see what could cause it.

The text was updated successfully, but these errors were encountered:

kristian-clausal · 2024-04-24T11:20:41Z

Yeah, this is buggy, I'll take a look at it tomorrow.

kristian-clausal · 2024-04-25T09:43:31Z

Sorry, I got stuck trying to figure out a bug? with our logging system, this might have to wait a while.

29jm · 2024-04-25T10:14:07Z

No worries at all, I know how it is :)

kristian-clausal · 2024-04-26T10:12:54Z

Found part of the issue, and it's a silly one; the examples I looked at (ani and vzít) had a word that was blacklisted from our "what is an English word" set: He, with a capital. Our function that determines what kind of language or function a string has rejects the strings because 1/4 of it is classified as non-english, so it all gets lumped as a romanization string and the example is concatenated.

Actually looking at vzít, the example that is broken doesn't get through because it's 2/3rds Czech names with one English word.

I am writing this message as I'm going through examples, and hlavní is a completely different issue: the example isn't in a template that we accept as an 'example' (ux) template, but as a coi template, related to collocation. Outputs the same as an example template, and outputs the same kind of stuff (looks like an example, quacks like an example), so I just added the template and its aliases to our list of "ux" templates.

I'm going to commit these, and hopefully most of all of the issues will be addressed. If you find that some weren't fixed, just point them out.

Unless we add a bunch of Czech names (which is the simplest way), the examples that are just Eliška married Tomáš are impossible to determine with a simple heuristic.

29jm · 2024-04-26T11:07:25Z

Unfortunate that wiktionary doesn't tag the translation of examples with a language! If I understood correctly, this is why wikiextract has to implement these heuristics.

Thanks for the fixes already, I'll have a look at the json within the next week!

29jm · 2024-05-01T10:59:08Z

I'm looking at the most recent JSON, and some problematic words like ani, čas and žít are fixed, while others aren't, e.g. hlavní, názor, osobní. I think many of the remaining ones use coi, though not all, for instance Ukrajina which uses uxi.

(in case it helps, a pastebin of all of them here)

kristian-clausal · 2024-05-06T07:10:09Z

Many of those are just that they contain words that are not in nltk.corpus.brown; words like "cellphone", "mousetrap", "He’s" (with a unicode apostrophe or other character), "dumbfuck", "peppermint"... Hrm, many of these could be fixed if we somehow could cheaply detect compound words. If anyone has an idea of how to do this super-cheaply ("peppermint" -> contains "pepper" and "mint" and smooshed together)...

Another category of problem are the translations that are basically stuff like "common noun" or "adjective", phrases that will be get classified as tags by the classifier.

xxyzz · 2024-05-06T07:47:30Z

We can't use template arguments or expanded HTML tags? "ux" and "coi" templates put translation text in the third argument, find argument should be easier and more reliable than check if words are in English.

kristian-clausal · 2024-05-06T09:03:13Z

I was thinking of that, yeah. We can check to see if the arguments map on to the template expanded output and exit early if they conform to the formatting of examples. The problem is that there might be some pitfalls with this approach, for example if example templates are used for other things, but in context it might be fine.

kristian-clausal · 2024-05-06T10:40:05Z

EDIT: This is a post that was left unwritten earlier today, posting it here just for completeness.

"cause of death" slipped through because decode_tags classified if as tags. "of tags" is parsed as a tag (a space-including tag, so not in valid_tags), and "cause" is classified as a "topic" for some reason, and there's a small piece of boolean logic that says that if any of there are topics or a flag is set or there are no tags with spaces in the collected tags. So because "cause" is in topics, the no tags with spaces does not trigger. This is a super annoying, probably quite rare edgecase.

EDIT: This edgecase should be fixed with the template-arguments fix.

29jm · 2024-05-14T08:47:54Z

That last PR fixed most of the issues! Here's a pastebin of the remaining ones, from a JSON downloaded today: https://pastebin.com/hMyZBXnh

EDIT: If those remaining ones are due to issues on the wiktionary side, let me know how and I'll fix them one by one.

kristian-clausal · 2024-05-14T08:52:40Z

I'll take a look at these later, thanks for keeping your eye on the output!

kristian-clausal mentioned this issue Apr 26, 2024

Classify desc #607

Merged

kristian-clausal mentioned this issue May 7, 2024

Use example template args to determine example #615

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English translation of examples sometimes missing, included in the original text #604

English translation of examples sometimes missing, included in the original text #604

29jm commented Apr 24, 2024

kristian-clausal commented Apr 24, 2024 •

edited

kristian-clausal commented Apr 25, 2024

29jm commented Apr 25, 2024

kristian-clausal commented Apr 26, 2024 •

edited

29jm commented Apr 26, 2024

29jm commented May 1, 2024 •

edited

kristian-clausal commented May 6, 2024

xxyzz commented May 6, 2024

kristian-clausal commented May 6, 2024

kristian-clausal commented May 6, 2024 •

edited

29jm commented May 14, 2024 •

edited

kristian-clausal commented May 14, 2024

English translation of examples sometimes missing, included in the original text #604

English translation of examples sometimes missing, included in the original text #604

Comments

29jm commented Apr 24, 2024

kristian-clausal commented Apr 24, 2024 • edited

kristian-clausal commented Apr 25, 2024

29jm commented Apr 25, 2024

kristian-clausal commented Apr 26, 2024 • edited

29jm commented Apr 26, 2024

29jm commented May 1, 2024 • edited

kristian-clausal commented May 6, 2024

xxyzz commented May 6, 2024

kristian-clausal commented May 6, 2024

kristian-clausal commented May 6, 2024 • edited

29jm commented May 14, 2024 • edited

kristian-clausal commented May 14, 2024

kristian-clausal commented Apr 24, 2024 •

edited

kristian-clausal commented Apr 26, 2024 •

edited

29jm commented May 1, 2024 •

edited

kristian-clausal commented May 6, 2024 •

edited

29jm commented May 14, 2024 •

edited