Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large amount of substring ("str left") templates in etymology, possibly in relation to "lite" templates #611

Open
myrriad opened this issue May 4, 2024 · 3 comments

Comments

@myrriad
Copy link

myrriad commented May 4, 2024

Here is the wikitext for Etymology 1 of bot#Old_Javanese:

Inherited from {{inh-lite|kaw|poz-pro|sc=Latn|*bəʀəqat}} (compare {{cog-lite|ms|berat}}). {{doublet|kaw|bwat|wrat}}.

Here is the parsed wikitext in the latest version (2024/05/01):
https://gist.github.com/myrriad/24429fe70924a39d27cfae7a692979a2

There are an excessive number of "str left" and "str right" templates, which repetitively takes substrings of strings (often only extracting one character at a time.) The etymology_text appears good. I suppose these are affected templates: url

I detected these examples by sorting entries by number of etymology templates. Accordingly, here are all entries with >= 90 etymology templates. These templates empirically appear in conjunction with "-lite" templates.

For debugging purposes here is a list of filtered entries with >100 etymology templates
https://gist.github.com/myrriad/f676ea15c5e0da4022473f790d5432c9

@myrriad myrriad changed the title Large amount of substring ("str left") templates in etymology , possibly in relation to "lite" templates Large amount of substring ("str left") templates in etymology, possibly in relation to "lite" templates May 4, 2024
@xxyzz
Copy link
Collaborator

xxyzz commented May 6, 2024

I think it's the code at here

text = clean_node(
wxr,
None,
contents,
template_fn=etym_template_fn,
post_template_fn=etym_post_template_fn,
)

uses the post_template_fn argument. This will add the templates used within templates, I guess the "etymology_templates" field should only contain the templates in wikitext unless it's intended to include nested templates.

@kristian-clausal
Copy link
Collaborator

https://github.com/tatuylonen/wiktextract/blob/f4fd8c9b5a125a23fa319ecda64e0c1649487d02/src/wiktextract/extractor/en/page.py#L3039C1-L3042C21

seems to indicate Tatu meant to capture nested etymology templates, and that to ignore unwanted templates with the blacklist. In this case, I guess the culprit is Template:langname-lite, because only 1/199 lines in the filtered examples in the original post. If we add that to ignored_etymology_templates, it should clean up a lot of these a lot, hopefully.

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented May 6, 2024

I've added langname-lite to the blacklist, if the run goes smoothly we should see some improvements, and after that we can look at adding other templates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants