-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output of fcm(x, context = "window", count = "boolean") #2113
Comments
Hi - Yes that is intended, although I can see how it could have been implemented differently. Think of window as a "width" where the default "document" size is the entire document span. There is only a single window in that case, so fcm(x, context = "document", count = "boolean") is the same as the current implementation of When multiple windows fit inside the same document, then in your example, you'd expect boolean to apply within window, not across all windows. This could be added as an option. What's the use case for this? @koheiw what do you think? |
I find it hard to understand why people want to set |
My idea was to use this result as a basis for computing PMI or NPMI values for word pairs. It requires the joint and marginal probability of words to compute log( p(x,y) / ( p(x)*p(y) ). The issue if you don't set If I were to compute the NPMI between "C" and "D" in the above example, assuming I get the joint probabilities from > p <- fcm(toks, context = "window", count = "frequency", window = 2) / ntoken(toks)
> diag(p) <- colSums(dfm(toks)) / ntoken(toks)
> log( p["C", "E"] / (p["C", "C"] * p["E", "E"]) ) / - log( p["C", "E"] )
[1] 1.207519 This is incorrect since NPMI should be bounded between -1 and 1. It happens because I agree that for small windows this is unlikely to be a practical issue, but well, you never know. |
It is a nice way to use FCM. I though about it many times. I think the value become higher than 1.0 because the pair of C and E is 4 when > toks <- tokens("A B C D E C E A D F E B A C E D")
> fcm(toks, context = "window", count = "frequency", window = 2, ordered = FALSE)
Feature co-occurrence matrix of: 6 by 6 features.
features
features A B C D E F
A 0 2 3 1 3 1
B 0 0 2 1 1 1
C 0 0 0 3 4 0
D 0 0 0 0 4 1
E 0 0 0 0 2 1
F 0 0 0 0 0 0
> fcm(toks, context = "window", count = "frequency", window = 2, ordered = TRUE)
Feature co-occurrence matrix of: 6 by 6 features.
features
features A B C D E F
A 0 1 2 1 1 1
B 1 0 2 1 0 0
C 1 0 0 2 3 0
D 0 0 1 0 2 1
E 2 1 1 2 1 0
F 0 1 0 0 1 0 |
I don't think it is related to the setting of > toks <- tokens("A B C D E C E A D F E B A E C D")
> fcm(toks, context = "window", count = "frequency", window = 2, ordered = TRUE)
Feature co-occurrence matrix of: 6 by 6 features.
features
features A B C D E F
A 0 1 2 1 1 1
B 1 0 1 1 1 0
C 1 0 0 2 2 0
D 0 0 1 0 2 1
E 2 1 2 2 1 0
F 0 1 0 0 1 0 It is also possible to modify the example in the other way, so that you get 4 even after setting |
You are right. Below is a less efficient but easy way to get the right frequency. Is this what you need? require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.0.9000
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens("A B C D E C E A D F E B A C E D")
toks_seg <- tokens_chunk(toks, size = 3, overlap = 2)
toks_seg
#> Tokens consisting of 16 documents.
#> text1.1 :
#> [1] "A" "B" "C"
#>
#> text1.2 :
#> [1] "B" "C" "D"
#>
#> text1.3 :
#> [1] "C" "D" "E"
#>
#> text1.4 :
#> [1] "D" "E" "C"
#>
#> text1.5 :
#> [1] "E" "C" "E"
#>
#> text1.6 :
#> [1] "C" "E" "A"
#>
#> [ reached max_ndoc ... 10 more documents ]
fcmt_seg <- fcm(toks_seg, context = "document", count = "boolean")
fcmt_seg
#> Feature co-occurrence matrix of: 6 by 6 features.
#> features
#> features A B C D E F
#> A 0 3 4 2 4 1
#> B 0 0 3 1 2 1
#> C 0 0 0 4 6 0
#> D 0 0 0 0 6 2
#> E 0 0 0 0 1 2
#> F 0 0 0 0 0 0
dfmt_seg <- dfm(toks_seg, tolower = FALSE)
p_seg <- fcmt_seg / sum(dfmt_seg)
diag(p_seg) <- colSums(dfmt_seg) / sum(dfmt_seg)
log(p_seg["C", "E"] / (p_seg["C", "C"] * p_seg["E", "E"]) ) / - log(p_seg["C", "E"] )
#> [1] 0.4547567 Created on 2021-04-28 by the reprex package (v2.0.0) |
Thanks @odelmarcelle for explaining the use case. In earlier versions of what is now @koheiw it would be possible however to add a third option to count called "booleanfreq" (or something like that) to count the frequency of Boolean occurrences within a local context, by modifying the C++ function. Also we could add this as a textstat to quanteda.textstats if you both think it useful enough. Note bene I think the denominator in the npmi calculation should be |
Thanks for your answers. If you need some help on this, I can perhaps prepare a pull request to implement this "booleanfreq" option you mentioned? @kbenoit Conveniently the npmi remains the same regardless of the base used (denominator or numerator). It is not the case for the pmi however. |
PRs are always welcome, but let me understand the desired behavior. How the FCM should look like for |
You're right, I'm not sure I'm going in the right direction either. This other example challenges my approach... Using window of 2, from the point of view of "b", "a" co-occurs 3 times. From the point of view of "a" and applying a boolean count, "b" co-occurs 2 times. This doesn't add up... I need to give it some thought. Using |
There is no rush. If the above result with |
I took some time to read few papers using PMI or NPMI in the hope of coming with a nice solution. "Word Association Norms, Mutual Information, and Lexicography" mention the possible lack of symmetric when retrieving co-occurrence frequencies. The current implementation of I gave it a try with the "boolean window" approach and it gave interesting results. Using a geometric average with frequencies from the "boolean window" seems to yield a correct measure of NPMI. Let me show an example. Apologies for the lengthy function. require(quanteda)
#> Le chargement a nécessité le package : quanteda
#> Package version: 3.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
FCMbool <- function(toks, W = 2) {
FCM <- as.matrix(fcm(toks))
FCM[] <- 0
boolCheck <- FCM[1, ]
list <- as.list(toks)
for (t in seq_along(list)) {
char_tokens <- list[[t]]
for (i in seq(length(char_tokens))) {
## reset the boolean flag. has the current word already paired with one of the
## other tokens?
boolCheck[] <- 0
## to the right
for (w in seq(W)) {
if (i + w > length(char_tokens)) break
if (boolCheck[char_tokens[i+w]] == 0) {
FCM[char_tokens[i], char_tokens[i+w]] = FCM[char_tokens[i], char_tokens[i+w]] + 1
boolCheck[char_tokens[i+w]] <- 1
}
}
## to the left
for (w in seq(W)) {
if (i - w < 1) break
if (boolCheck[char_tokens[i-w]] == 0) {
FCM[char_tokens[i], char_tokens[i-w]] = FCM[char_tokens[i], char_tokens[i-w]] + 1
boolCheck[char_tokens[i-w]] <- 1
}
}
}
}
FCM
}
# NPMI on the whole fcm
NPMI <- function(p) {
p <- as.matrix(p)
p <- p + 10^-99 # avoid log(0)
NPMI <- p / diag(p)
NPMI <- t(NPMI) / diag(p)
NPMI <- log(NPMI) / -log(p)
round(NPMI, 3)
}
toks <- tokens(c("b a b c", "a a c b e", "a c e f g"))
dfm <- dfm(toks, tolower = FALSE)
## boolean window approach
fcm_bool <- FCMbool(toks, W = 2)
fcm_bool # asymmetric
#> features
#> features b a c e f g
#> b 2 3 2 1 0 0
#> a 2 2 4 1 0 0
#> c 2 3 0 2 1 0
#> e 1 1 2 0 1 1
#> f 0 0 1 1 0 1
#> g 0 0 0 1 1 0
## classic fcm
fcm <- fcm(toks, context = "window", count = "frequency", window = 2, tri = FALSE)
fcm
#> Feature co-occurrence matrix of: 6 by 6 features.
#> features
#> features b a c e f g
#> b 2 3 2 1 0 0
#> a 3 2 4 1 0 0
#> c 2 4 0 2 1 0
#> e 1 1 2 0 1 1
#> f 0 0 1 1 0 1
#> g 0 0 0 1 1 0
## coumpound approach
toks_seg <- tokens_chunk(toks, size = 3, overlap = 2)
fcmt_seg <- fcm(toks_seg, context = "document", count = "boolean", tri = FALSE)
fcmt_seg
#> Feature co-occurrence matrix of: 6 by 6 features.
#> features
#> features b a c e f g
#> b 1 3 4 2 0 0
#> a 3 1 4 1 0 0
#> c 4 4 0 3 1 0
#> e 2 1 3 0 2 1
#> f 0 0 1 2 0 2
#> g 0 0 0 1 2 0
## boolean window NPMI
p_bool <- fcm_bool / sum(dfm)
diag(p_bool) <- colSums(dfm) / sum(dfm)
p_bool <- sqrt(p_bool * t(p_bool)) # force symmetry (geometric average)
NPMI(p_bool) # note how ["c", "a"] is equal to 1
#> features
#> features b a c e f g
#> b 1.000 0.602 0.583 0.321 -0.982 -0.982
#> a 0.602 1.000 1.000 0.212 -0.983 -0.983
#> c 0.583 1.000 1.000 0.792 0.584 -0.982
#> e 0.321 0.212 0.792 1.000 0.737 0.737
#> f -0.982 -0.983 0.584 0.737 1.000 1.000
#> g -0.982 -0.983 -0.982 0.737 1.000 1.000
## classic NPMI
p <- fcm / sum(dfm)
diag(p) <- colSums(dfm(toks)) / sum(dfm)
NPMI(p) # note how ["c", "a"] is outside the NPMI range
#> features
#> features b a c e f g
#> b 1.000 0.813 0.583 0.321 -0.982 -0.982
#> a 0.813 1.000 1.230 0.212 -0.983 -0.983
#> c 0.583 1.230 1.000 0.792 0.584 -0.982
#> e 0.321 0.212 0.792 1.000 0.737 0.737
#> f -0.982 -0.983 0.584 0.737 1.000 1.000
#> g -0.982 -0.983 -0.982 0.737 1.000 1.000
## compound NPMI
dfmt_seg <- dfm(toks_seg, tolower = FALSE)
p_seg <- fcmt_seg / sum(dfmt_seg)
diag(p_seg) <- colSums(dfmt_seg) / sum(dfmt_seg)
NPMI(p_seg) # ["c", "a"] is low despite always co-occuring
#> features
#> features b a c e f g
#> b 1.000 0.358 0.406 0.161 -0.983 -0.983
#> a 0.358 1.000 0.479 -0.025 -0.982 -0.982
#> c 0.406 0.479 1.000 0.302 0.091 -0.983
#> e 0.161 -0.025 0.302 1.000 0.463 0.173
#> f -0.983 -0.982 0.091 0.463 1.000 0.711
#> g -0.983 -0.982 -0.983 0.173 0.711 1.000
# Furthermore, this approach also works well on the previous example. See the
# comparison below between the two approaches. In particular, ["C", "D"] is
# estimated as 1 in the classic approach although one "D" do not co-occur with
# "C". ["C", "E"] is also spurious in the classic approach.
toks <- tokens("A B C D E C E A D F E B A C E D")
dfm <- dfm(toks, tolower = FALSE)
fcm_bool <- FCMbool(toks, W = 2)
p_bool <- fcm_bool / sum(dfm)
diag(p_bool) <- colSums(dfm) / sum(dfm)
p_bool <- sqrt(p_bool * t(p_bool)) # force symmetry (geometric average)
fcm <- fcm(toks, context = "window", count = "frequency", window = 2, tri = FALSE)
p <- fcm / sum(dfm)
diag(p) <- colSums(dfm(toks)) / sum(dfm)
NPMI(p_bool)
#> features
#> features A B C D E F
#> A 1.000 0.805 1.000 0.208 0.631 0.604
#> B 0.805 1.000 0.805 0.354 0.250 0.750
#> C 1.000 0.805 1.000 0.784 0.828 -0.980
#> D 0.208 0.354 0.784 1.000 1.000 0.604
#> E 0.631 0.250 0.828 1.000 1.000 0.500
#> F 0.604 0.750 -0.980 0.604 0.500 1.000
NPMI(p)
#> features
#> features A B C D E F
#> A 1.000 0.805 1.000 0.208 0.828 0.604
#> B 0.805 1.000 0.805 0.354 0.250 0.750
#> C 1.000 0.805 1.000 1.000 1.208 -0.980
#> D 0.208 0.354 1.000 1.000 1.208 0.604
#> E 0.828 0.250 1.208 1.208 1.000 0.500
#> F 0.604 0.750 -0.980 0.604 0.500 1.000
## Note that the geometric average can be applied on the fcm directly to obtain
## a symmetric result
sqrt(fcm_bool * t(fcm_bool))
#> features
#> features A B C D E F
#> A 0.00000 2 3.00000 1.000000 2.449490 1
#> B 2.00000 0 2.00000 1.000000 1.000000 1
#> C 3.00000 2 0.00000 2.449490 3.000000 0
#> D 1.00000 1 2.44949 0.000000 3.464102 1
#> E 2.44949 1 3.00000 3.464102 2.000000 1
#> F 1.00000 1 0.00000 1.000000 1.000000 0 Created on 2021-06-01 by the reprex package (v2.0.0) This also makes sense when you think about the NPMI measure. NPMI is equal to 1 whenever p(x,y)^2 = p(x) * p(y). With the "boolean window" approach, two estimates of p(x,y) are created on the upper and lower part of the co-occurrence matrix. One is bounded by p(x), the other by p(y). When both are bounded, their geometric average will yield a NPMI of 1. |
Hello,
The result of fcm with the settings
context = "window"
andcount = "boolean"
looks a bit odd to me.I'm writing this as a question as I'm not sure whether this is the expected behaviour or not.
Whenever using
"window"
and"frequency"
, it counts the number of co-occurences around each token in the specified window. In the following example, there are 3 "C" and for those, 4 "E" are counted within the window size. That leads to the followingfcm
:When specifying
count = "boolean"
, I would expect it to return the number of times "C" co-occured with "E" at least once within the window around each "C". Using the previous example, this would lead to 3 "boolean" co-occurences between "C" and "E".But instead, it seems that the
"boolean"
setting limits the number of co-occurences to 1 per document. Hence it produces:where all co-occurences between words are limited to 1, despite multiple windows being computed. Is this behaviour intended?
The text was updated successfully, but these errors were encountered: