Simple Good-Turing Frequency Estimation #2344

rimonim · 2024-02-12T05:11:17Z

This is a feature I'd love to have in Quanteda. For a great introduction to the topic, see here.

I wrote this implementation based on the instructions in section 6 of the paper, Good–Turing frequency estimation without tears. It's not particularly efficient, but it works. It performs the calculation separately for each document in the DFM.

In the future it would be nice to add additional customizability. For example, the algorithm has a threshold for switching between raw and smoothed proxies (implemented as crit in the code). I used the threshold recommended by the paper, but it could be fairly easily customizable.

also adds imports from stats

kbenoit

Thanks @rimonim for this, we delayed it simply because of the focus on getting quanteda 4.0 out. I made a few tweaks to this, but feel it would benefit further from the following.

Rather than add a new function, it could be seen as a form of weighting a dfm. So we could add it to dfm_weight() as two new options to scheme, call them goodturing_count and goodturing_prop.
The implementation currently coerces the input dfm to a dense matrix, which will not scale to larger matrices. But it should be fairly straightforward to compute this while keeping the matrix sparse. I didn't have time to address this but it should be possible and probably, without the lapply(). Our process is usually: 1) prototype the function so that it provides the correct answer, 2) rewrite it to be more efficient, and (often but not always) then 3) @koheiw rewrites it in C++.
The function will need unit tests.

@koheiw feel free of course to weigh in on any of the above points

Additionally, the implementation no longer coerces the input dfm to a dense matrix, unless explicitly requested with `estimate_zeros = TRUE`.

rimonim · 2024-04-11T05:21:16Z

@kbenoit thank you for the detailed feedback. I reimplemented the algorithm as two schemes for dfm_weight(), as you suggested. It now keeps the matrix sparse unless estimate_zeros = TRUE, in which case it distributes the estimated probability of unobserved cases between the zero values in each document.
I don't think I'm clever enough to get around the lapply(), especially for the linear regression step.

I'll start working on unit tests shortly.

rimonim · 2024-04-17T06:48:52Z

Step 1 is complete - the function provides the correct answer. It's also slightly more efficient than it was previously. The unit tests use output from the simple-good-turing Python library, with a slight tweak to allow for different crit values.
As it happens, a C++ implementation of Simple Good-Turing exists already, though with somewhat less functionality than my implementation here. I'm not literate in C++, so I can't comment further on whether it might be useful in making this implementation more efficient.

rimonim and others added 4 commits February 11, 2024 18:45

dfm_goodturing initial commit

1607bc7

Merge branch 'master' into dfm_goodturing

f8f37d5

Add "Quant"

628e883

Improve documentation and fn signature

7016f31

also adds imports from stats

kbenoit self-requested a review April 8, 2024 16:20

kbenoit requested changes Apr 8, 2024

View reviewed changes

rimonim added 2 commits April 11, 2024 07:52

Rewrite as schemes for dfm_weight()

28f1a21

Additionally, the implementation no longer coerces the input dfm to a dense matrix, unless explicitly requested with `estimate_zeros = TRUE`.

fix bug

a98a275

kbenoit and others added 4 commits April 11, 2024 11:26

Merge branch 'master' into dfm_goodturing

48f8cd3

Merge branch 'master' into dfm_goodturing

52fbb70

unit tests and bug fixes

a7686ba

Unit test for crit parameter

8cab5aa

rimonim added 2 commits April 17, 2024 10:47

Unit test for short text behavior

b9197e9

cleaning/optimization

621bd56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple Good-Turing Frequency Estimation #2344

Simple Good-Turing Frequency Estimation #2344

rimonim commented Feb 12, 2024

kbenoit left a comment •

edited

rimonim commented Apr 11, 2024

rimonim commented Apr 17, 2024

Simple Good-Turing Frequency Estimation #2344

Are you sure you want to change the base?

Simple Good-Turing Frequency Estimation #2344

Conversation

rimonim commented Feb 12, 2024

kbenoit left a comment • edited

Choose a reason for hiding this comment

rimonim commented Apr 11, 2024

rimonim commented Apr 17, 2024

kbenoit left a comment •

edited