-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposition of changes to WiktionaryConfig
class
#278
Comments
It sounds like we should just move most of the stuff from I think the reason why the data is returned from pagehandler in this way is because of multiprocessing. Writing into the same .json file from several processes sounds dangerous. Writing into separate files that are then merged doesn't sound optimal, either, so keeping the stats and errors in memory, returning them up from the multiprocessing pool and then processing them (keeping in memory or maybe writing to file at that point) seems still sensible. I will ask Tatu to comment on this, so we hear what the reasoning of the structure is. |
Returned error message data(error, warning, debug) can be written to separate files in the final for loop in the parent process line be line, so they won't be raced. And if |
The wxr and wxr.wtp are forked by the processes, so they keep their own version in memory; they're duplicated, because of the limitations of Python multiprocessing. As far as I understand it:
|
These callback functions make things more complicated, and it looks like the |
#296 removes the code that create new |
Currently new
WiktionaryConfig
object is created for every page inwiktionary.page_handler()
to save and returnWiktionaryConfig.language_counts
,WiktionaryConfig.pos_counts
,WiktionaryConfig.section_counts
,WiktionaryConfig.errors
,WiktionaryConfig.warnings
,WiktionaryConfig.debugs
.But
WiktionaryConfig.__init__()
loads many JSON files and convert language data, rerun this code for every page is very inefficient, especially when the JSON file is quite large like thelanguages.json
file. And the--statistics
feature currently doesn't work because onlyWiktionaryConfig.section_counts
is filled with data, even if other data are added, this feature is still unusable because it will prints really long output and also uses lots of memory.WiktionaryConfig.merge_return()
also saves error message fromWtp
object to an array, this could also be a huge array.Here are my propositions:
--statistics
option and deleteWiktionaryConfig.language_counts
,WiktionaryConfig.pos_counts
,WiktionaryConfig.section_counts
, so newWiktionaryConfig
object won't be created for each page. We could write separate code to process the final JSON file and create some statistic charts in a scheduled GitHub Actions job.languages
folder after all pages are added to database, and also save the data to database. Therefore we'll always use the updated language data. But since this data is not available from the start, some code requires the language data before parsing the dump file will need to be changed.I commented the code in
page_handler()
the creates newWiktionaryConfig
object and the process time on the Chinese Wiktionary dump file decreased from over 20 minutes down to 13 minutes.The text was updated successfully, but these errors were encountered: