-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple Good-Turing Frequency Estimation #2344
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rimonim for this, we delayed it simply because of the focus on getting quanteda 4.0 out. I made a few tweaks to this, but feel it would benefit further from the following.
-
Rather than add a new function, it could be seen as a form of weighting a dfm. So we could add it to
dfm_weight()
as two new options toscheme
, call themgoodturing_count
andgoodturing_prop
. -
The implementation currently coerces the input dfm to a dense matrix, which will not scale to larger matrices. But it should be fairly straightforward to compute this while keeping the matrix sparse. I didn't have time to address this but it should be possible and probably, without the
lapply()
. Our process is usually: 1) prototype the function so that it provides the correct answer, 2) rewrite it to be more efficient, and (often but not always) then 3) @koheiw rewrites it in C++. -
The function will need unit tests.
@koheiw feel free of course to weigh in on any of the above points
Additionally, the implementation no longer coerces the input dfm to a dense matrix, unless explicitly requested with `estimate_zeros = TRUE`.
@kbenoit thank you for the detailed feedback. I reimplemented the algorithm as two schemes for I'll start working on unit tests shortly. |
Step 1 is complete - the function provides the correct answer. It's also slightly more efficient than it was previously. The unit tests use output from the simple-good-turing Python library, with a slight tweak to allow for different |
This is a feature I'd love to have in Quanteda. For a great introduction to the topic, see here.
I wrote this implementation based on the instructions in section 6 of the paper, Good–Turing frequency estimation without tears. It's not particularly efficient, but it works. It performs the calculation separately for each document in the DFM.
In the future it would be nice to add additional customizability. For example, the algorithm has a threshold for switching between raw and smoothed proxies (implemented as
crit
in the code). I used the threshold recommended by the paper, but it could be fairly easily customizable.