Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Good-Turing Frequency Estimation #2344

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

rimonim
Copy link

@rimonim rimonim commented Feb 12, 2024

This is a feature I'd love to have in Quanteda. For a great introduction to the topic, see here.

I wrote this implementation based on the instructions in section 6 of the paper, Good–Turing frequency estimation without tears. It's not particularly efficient, but it works. It performs the calculation separately for each document in the DFM.

In the future it would be nice to add additional customizability. For example, the algorithm has a threshold for switching between raw and smoothed proxies (implemented as crit in the code). I used the threshold recommended by the paper, but it could be fairly easily customizable.

@kbenoit kbenoit self-requested a review April 8, 2024 16:20
Copy link
Collaborator

@kbenoit kbenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rimonim for this, we delayed it simply because of the focus on getting quanteda 4.0 out. I made a few tweaks to this, but feel it would benefit further from the following.

  • Rather than add a new function, it could be seen as a form of weighting a dfm. So we could add it to dfm_weight() as two new options to scheme, call them goodturing_count and goodturing_prop.

  • The implementation currently coerces the input dfm to a dense matrix, which will not scale to larger matrices. But it should be fairly straightforward to compute this while keeping the matrix sparse. I didn't have time to address this but it should be possible and probably, without the lapply(). Our process is usually: 1) prototype the function so that it provides the correct answer, 2) rewrite it to be more efficient, and (often but not always) then 3) @koheiw rewrites it in C++.

  • The function will need unit tests.

@koheiw feel free of course to weigh in on any of the above points

Additionally, the implementation no longer coerces the input dfm to a dense matrix, unless explicitly requested with `estimate_zeros = TRUE`.
@rimonim
Copy link
Author

rimonim commented Apr 11, 2024

@kbenoit thank you for the detailed feedback. I reimplemented the algorithm as two schemes for dfm_weight(), as you suggested. It now keeps the matrix sparse unless estimate_zeros = TRUE, in which case it distributes the estimated probability of unobserved cases between the zero values in each document.
I don't think I'm clever enough to get around the lapply(), especially for the linear regression step.

I'll start working on unit tests shortly.

@rimonim
Copy link
Author

rimonim commented Apr 17, 2024

Step 1 is complete - the function provides the correct answer. It's also slightly more efficient than it was previously. The unit tests use output from the simple-good-turing Python library, with a slight tweak to allow for different crit values.
As it happens, a C++ implementation of Simple Good-Turing exists already, though with somewhat less functionality than my implementation here. I'm not literate in C++, so I can't comment further on whether it might be useful in making this implementation more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants