Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output of fcm(x, context = "window", count = "boolean") #2113

Open
odelmarcelle opened this issue Apr 26, 2021 · 12 comments
Open

Output of fcm(x, context = "window", count = "boolean") #2113

odelmarcelle opened this issue Apr 26, 2021 · 12 comments

Comments

@odelmarcelle
Copy link
Collaborator

odelmarcelle commented Apr 26, 2021

Hello,

The result of fcm with the settings context = "window" and count = "boolean" looks a bit odd to me.
I'm writing this as a question as I'm not sure whether this is the expected behaviour or not.

Whenever using "window" and "frequency", it counts the number of co-occurences around each token in the specified window. In the following example, there are 3 "C" and for those, 4 "E" are counted within the window size. That leads to the following fcm:

> library(quanteda)
> toks <- tokens("A B C D E C E A D F E B A C E D")
> fcm(toks, context = "window", count = "frequency", window = 2)
Feature co-occurrence matrix of: 6 by 6 features.
        features
features A B C D E F
       A 0 2 3 1 3 1
       B 0 0 2 1 1 1
       C 0 0 0 3 4 0
       D 0 0 0 0 4 1
       E 0 0 0 0 2 1
       F 0 0 0 0 0 0

When specifying count = "boolean", I would expect it to return the number of times "C" co-occured with "E" at least once within the window around each "C". Using the previous example, this would lead to 3 "boolean" co-occurences between "C" and "E".
But instead, it seems that the "boolean" setting limits the number of co-occurences to 1 per document. Hence it produces:

> fcm(toks, context = "window", count = "boolean", window = 2)
Feature co-occurrence matrix of: 6 by 6 features.
        features
features A B C D E F
       A 0 1 1 1 1 1
       B 0 0 1 1 1 1
       C 0 0 0 1 1 0
       D 0 0 0 0 1 1
       E 0 0 0 0 2 1
       F 0 0 0 0 0 0

where all co-occurences between words are limited to 1, despite multiple windows being computed. Is this behaviour intended?

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 26, 2021

Hi - Yes that is intended, although I can see how it could have been implemented differently. Think of window as a "width" where the default "document" size is the entire document span. There is only a single window in that case, so

fcm(x, context = "document", count = "boolean")

is the same as the current implementation of context = "window", count = "boolean" when window is < ntoken(x).

When multiple windows fit inside the same document, then in your example, you'd expect boolean to apply within window, not across all windows.

This could be added as an option. What's the use case for this?

@koheiw what do you think?

@koheiw
Copy link
Collaborator

koheiw commented Apr 26, 2021

I find it hard to understand why people want to set count = "boolean". It is even more so when the context is window because the same word pairs do not occur twice unless window is very large or the author repeat words like "bara bara bara" or "so so". Just don't use boolean for window.

@odelmarcelle
Copy link
Collaborator Author

odelmarcelle commented Apr 27, 2021

My idea was to use this result as a basis for computing PMI or NPMI values for word pairs. It requires the joint and marginal probability of words to compute log( p(x,y) / ( p(x)*p(y) ). The issue if you don't set count = "boolean" is that you might end up with p(x,y) > p(x) or p(x,y) > p(y).

If I were to compute the NPMI between "C" and "D" in the above example, assuming I get the joint probabilities from fcm and the marginal probabilities from tokens counts, I would do:

> p <- fcm(toks, context = "window", count = "frequency", window = 2) / ntoken(toks)
> diag(p) <- colSums(dfm(toks)) / ntoken(toks)
> log( p["C", "E"] / (p["C", "C"] * p["E", "E"]) ) / - log( p["C", "E"] )
[1] 1.207519

This is incorrect since NPMI should be bounded between -1 and 1. It happens because p["C", "E"] is greater than p["C", "C"]. With my expectation of context = "window", count = "boolean", p["C", "E"] and p["C", "C"] would be equal and the NPMI would be slightly below 1.

I agree that for small windows this is unlikely to be a practical issue, but well, you never know.

@koheiw
Copy link
Collaborator

koheiw commented Apr 27, 2021

It is a nice way to use FCM. I though about it many times.

I think the value become higher than 1.0 because the pair of C and E is 4 when order = FALSE (the default). When order = TRUE, it is 3 in the upper triangle, which is the frequency of cooccurrence for PMI.

> toks <- tokens("A B C D E C E A D F E B A C E D")
> fcm(toks, context = "window", count = "frequency", window = 2, ordered = FALSE)
Feature co-occurrence matrix of: 6 by 6 features.
        features
features A B C D E F
       A 0 2 3 1 3 1
       B 0 0 2 1 1 1
       C 0 0 0 3 4 0
       D 0 0 0 0 4 1
       E 0 0 0 0 2 1
       F 0 0 0 0 0 0
> fcm(toks, context = "window", count = "frequency", window = 2, ordered = TRUE)
Feature co-occurrence matrix of: 6 by 6 features.
        features
features A B C D E F
       A 0 1 2 1 1 1
       B 1 0 2 1 0 0
       C 1 0 0 2 3 0
       D 0 0 1 0 2 1
       E 2 1 1 2 1 0
       F 0 1 0 0 1 0

@odelmarcelle
Copy link
Collaborator Author

I don't think it is related to the setting of order, which simply split the total number of co-occurrence between the upper and lower triangle of the matrix. It is simply by chance that you'll get 3 in the upper triangle using order = TRUE. By simply shifting the position of the last "C" and "E", I then get 2 in the upper triangle:

> toks <- tokens("A B C D E C E A D F E B A E C D")
> fcm(toks, context = "window", count = "frequency", window = 2, ordered = TRUE)
Feature co-occurrence matrix of: 6 by 6 features.
        features
features A B C D E F
       A 0 1 2 1 1 1
       B 1 0 1 1 1 0
       C 1 0 0 2 2 0
       D 0 0 1 0 2 1
       E 2 1 2 2 1 0
       F 0 1 0 0 1 0

It is also possible to modify the example in the other way, so that you get 4 even after setting order = TRUE.

@koheiw
Copy link
Collaborator

koheiw commented Apr 28, 2021

You are right. Below is a less efficient but easy way to get the right frequency. Is this what you need?

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.0.9000
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens("A B C D E C E A D F E B A C E D")
toks_seg <- tokens_chunk(toks, size = 3, overlap = 2)
toks_seg
#> Tokens consisting of 16 documents.
#> text1.1 :
#> [1] "A" "B" "C"
#> 
#> text1.2 :
#> [1] "B" "C" "D"
#> 
#> text1.3 :
#> [1] "C" "D" "E"
#> 
#> text1.4 :
#> [1] "D" "E" "C"
#> 
#> text1.5 :
#> [1] "E" "C" "E"
#> 
#> text1.6 :
#> [1] "C" "E" "A"
#> 
#> [ reached max_ndoc ... 10 more documents ]
fcmt_seg <- fcm(toks_seg, context = "document", count = "boolean")
fcmt_seg
#> Feature co-occurrence matrix of: 6 by 6 features.
#>         features
#> features A B C D E F
#>        A 0 3 4 2 4 1
#>        B 0 0 3 1 2 1
#>        C 0 0 0 4 6 0
#>        D 0 0 0 0 6 2
#>        E 0 0 0 0 1 2
#>        F 0 0 0 0 0 0
dfmt_seg <- dfm(toks_seg, tolower = FALSE)
p_seg <- fcmt_seg / sum(dfmt_seg)
diag(p_seg) <- colSums(dfmt_seg) / sum(dfmt_seg)
log(p_seg["C", "E"] / (p_seg["C", "C"] * p_seg["E", "E"]) ) / - log(p_seg["C", "E"] )
#> [1] 0.4547567

Created on 2021-04-28 by the reprex package (v2.0.0)

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 28, 2021

Thanks @odelmarcelle for explaining the use case. In earlier versions of what is now quanteda.textstats::textstat_collocations() we used a similar method for computing pmi for ordered, adjacent word co-occurrences. What @koheiw is exactly what I would suggest given the currently available functions.

@koheiw it would be possible however to add a third option to count called "booleanfreq" (or something like that) to count the frequency of Boolean occurrences within a local context, by modifying the C++ function.

Also we could add this as a textstat to quanteda.textstats if you both think it useful enough.

Note bene I think the denominator in the npmi calculation should be - log2(...).

@odelmarcelle
Copy link
Collaborator Author

odelmarcelle commented Apr 28, 2021

Thanks for your answers.
@koheiw this is indeed something I thought of, but I realized that this approach is fundamentally different because it computes the probability of co-occurrence happening in any window. In contrast, fcm count the co-occurrence within a window around each token. The result of this approach is that p_seg["C", "C"] is not equal to p_seg["C", "E"], while it should be (or at least I believe so).

If you need some help on this, I can perhaps prepare a pull request to implement this "booleanfreq" option you mentioned?

@kbenoit Conveniently the npmi remains the same regardless of the base used (denominator or numerator). It is not the case for the pmi however.

@koheiw
Copy link
Collaborator

koheiw commented Apr 28, 2021

PRs are always welcome, but let me understand the desired behavior. How the FCM should look like for c("b a b c", "a a c b e", "a c e f g") and window = 2 or window = 10?

@odelmarcelle
Copy link
Collaborator Author

odelmarcelle commented Apr 28, 2021

You're right, I'm not sure I'm going in the right direction either. This other example challenges my approach... Using window of 2, from the point of view of "b", "a" co-occurs 3 times. From the point of view of "a" and applying a boolean count, "b" co-occurs 2 times. This doesn't add up... I need to give it some thought.

Using tokens_chunk is perhaps the most appropriate way to compute these probabilities.

@koheiw
Copy link
Collaborator

koheiw commented Apr 29, 2021

There is no rush. If the above result with tokens_chunk() is the useful for computing PMI, we can try to do it more efficiently using fcm().

@odelmarcelle
Copy link
Collaborator Author

odelmarcelle commented Jun 1, 2021

I took some time to read few papers using PMI or NPMI in the hope of coming with a nice solution.

"Word Association Norms, Mutual Information, and Lexicography" mention the possible lack of symmetric when retrieving co-occurrence frequencies. The current implementation of fcm with ordered = TRUE replicates their methodology, except for the normalizing weight. (I guess this is not a surprise, since this paper is referenced with the help of fcm). However, they suggest the possibility to tackle the asymmetry by taking the average of the co-occurrence matrix with its' transpose.

I gave it a try with the "boolean window" approach and it gave interesting results. Using a geometric average with frequencies from the "boolean window" seems to yield a correct measure of NPMI.

Let me show an example. Apologies for the lengthy function.

require(quanteda)
#> Le chargement a nécessité le package : quanteda
#> Package version: 3.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.

FCMbool <- function(toks, W = 2) {
  FCM <- as.matrix(fcm(toks))
  FCM[] <- 0
  boolCheck <- FCM[1, ]
  list <- as.list(toks)
  for (t in seq_along(list)) {
    char_tokens <- list[[t]] 
    for (i in seq(length(char_tokens))) {
      ## reset the boolean flag. has the current word already paired with one of the
      ## other tokens?
      boolCheck[] <- 0
      ## to the right
      for (w in seq(W)) {
        if (i + w > length(char_tokens)) break
        if (boolCheck[char_tokens[i+w]] == 0) {
          FCM[char_tokens[i], char_tokens[i+w]] = FCM[char_tokens[i], char_tokens[i+w]] + 1
          boolCheck[char_tokens[i+w]] <- 1
        }
      }
      ## to the left
      for (w in seq(W)) {
        if (i - w < 1) break
        if (boolCheck[char_tokens[i-w]] == 0) {
          FCM[char_tokens[i], char_tokens[i-w]] = FCM[char_tokens[i], char_tokens[i-w]] + 1
          boolCheck[char_tokens[i-w]] <- 1
        }
      }
    }
  }
  FCM
}
# NPMI on the whole fcm
NPMI <- function(p) {
  p <- as.matrix(p)
  p <- p + 10^-99 # avoid log(0)
  NPMI <- p / diag(p)
  NPMI <- t(NPMI) / diag(p)
  NPMI <- log(NPMI) / -log(p)
  round(NPMI, 3)
}

toks <- tokens(c("b a b c", "a a c b e", "a c e f g"))
dfm <- dfm(toks, tolower = FALSE)

## boolean window approach
fcm_bool <- FCMbool(toks, W = 2)
fcm_bool # asymmetric
#>         features
#> features b a c e f g
#>        b 2 3 2 1 0 0
#>        a 2 2 4 1 0 0
#>        c 2 3 0 2 1 0
#>        e 1 1 2 0 1 1
#>        f 0 0 1 1 0 1
#>        g 0 0 0 1 1 0

## classic fcm
fcm <- fcm(toks, context = "window", count = "frequency", window = 2, tri = FALSE)
fcm
#> Feature co-occurrence matrix of: 6 by 6 features.
#>         features
#> features b a c e f g
#>        b 2 3 2 1 0 0
#>        a 3 2 4 1 0 0
#>        c 2 4 0 2 1 0
#>        e 1 1 2 0 1 1
#>        f 0 0 1 1 0 1
#>        g 0 0 0 1 1 0

## coumpound approach
toks_seg <- tokens_chunk(toks, size = 3, overlap = 2)
fcmt_seg <- fcm(toks_seg, context = "document", count = "boolean", tri = FALSE)
fcmt_seg
#> Feature co-occurrence matrix of: 6 by 6 features.
#>         features
#> features b a c e f g
#>        b 1 3 4 2 0 0
#>        a 3 1 4 1 0 0
#>        c 4 4 0 3 1 0
#>        e 2 1 3 0 2 1
#>        f 0 0 1 2 0 2
#>        g 0 0 0 1 2 0

## boolean window NPMI
p_bool <- fcm_bool / sum(dfm)
diag(p_bool) <- colSums(dfm) / sum(dfm)
p_bool <- sqrt(p_bool * t(p_bool)) # force symmetry (geometric average)
NPMI(p_bool) # note how ["c", "a"] is equal to 1
#>         features
#> features      b      a      c     e      f      g
#>        b  1.000  0.602  0.583 0.321 -0.982 -0.982
#>        a  0.602  1.000  1.000 0.212 -0.983 -0.983
#>        c  0.583  1.000  1.000 0.792  0.584 -0.982
#>        e  0.321  0.212  0.792 1.000  0.737  0.737
#>        f -0.982 -0.983  0.584 0.737  1.000  1.000
#>        g -0.982 -0.983 -0.982 0.737  1.000  1.000

## classic NPMI
p <- fcm / sum(dfm)
diag(p) <- colSums(dfm(toks)) / sum(dfm)
NPMI(p) # note how ["c", "a"] is outside the NPMI range
#>         features
#> features      b      a      c     e      f      g
#>        b  1.000  0.813  0.583 0.321 -0.982 -0.982
#>        a  0.813  1.000  1.230 0.212 -0.983 -0.983
#>        c  0.583  1.230  1.000 0.792  0.584 -0.982
#>        e  0.321  0.212  0.792 1.000  0.737  0.737
#>        f -0.982 -0.983  0.584 0.737  1.000  1.000
#>        g -0.982 -0.983 -0.982 0.737  1.000  1.000

## compound NPMI
dfmt_seg <- dfm(toks_seg, tolower = FALSE)
p_seg <- fcmt_seg / sum(dfmt_seg)
diag(p_seg) <- colSums(dfmt_seg) / sum(dfmt_seg)
NPMI(p_seg) # ["c", "a"] is low despite always co-occuring
#>         features
#> features      b      a      c      e      f      g
#>        b  1.000  0.358  0.406  0.161 -0.983 -0.983
#>        a  0.358  1.000  0.479 -0.025 -0.982 -0.982
#>        c  0.406  0.479  1.000  0.302  0.091 -0.983
#>        e  0.161 -0.025  0.302  1.000  0.463  0.173
#>        f -0.983 -0.982  0.091  0.463  1.000  0.711
#>        g -0.983 -0.982 -0.983  0.173  0.711  1.000


# Furthermore, this approach also works well on the previous example. See the
# comparison below between the two approaches. In particular, ["C", "D"] is
# estimated as 1 in the classic approach although one "D" do not co-occur with
# "C". ["C", "E"] is also spurious in the classic approach.


toks <- tokens("A B C D E C E A D F E B A C E D")
dfm <- dfm(toks, tolower = FALSE)
fcm_bool <- FCMbool(toks, W = 2)
p_bool <- fcm_bool / sum(dfm)
diag(p_bool) <- colSums(dfm) / sum(dfm)
p_bool <- sqrt(p_bool * t(p_bool)) # force symmetry (geometric average)

fcm <- fcm(toks, context = "window", count = "frequency", window = 2, tri = FALSE)
p <- fcm / sum(dfm)
diag(p) <- colSums(dfm(toks)) / sum(dfm)

NPMI(p_bool)
#>         features
#> features     A     B      C     D     E      F
#>        A 1.000 0.805  1.000 0.208 0.631  0.604
#>        B 0.805 1.000  0.805 0.354 0.250  0.750
#>        C 1.000 0.805  1.000 0.784 0.828 -0.980
#>        D 0.208 0.354  0.784 1.000 1.000  0.604
#>        E 0.631 0.250  0.828 1.000 1.000  0.500
#>        F 0.604 0.750 -0.980 0.604 0.500  1.000
NPMI(p)
#>         features
#> features     A     B      C     D     E      F
#>        A 1.000 0.805  1.000 0.208 0.828  0.604
#>        B 0.805 1.000  0.805 0.354 0.250  0.750
#>        C 1.000 0.805  1.000 1.000 1.208 -0.980
#>        D 0.208 0.354  1.000 1.000 1.208  0.604
#>        E 0.828 0.250  1.208 1.208 1.000  0.500
#>        F 0.604 0.750 -0.980 0.604 0.500  1.000


## Note that the geometric average can be applied on the fcm directly to obtain
## a symmetric result
sqrt(fcm_bool * t(fcm_bool))
#>         features
#> features       A B       C        D        E F
#>        A 0.00000 2 3.00000 1.000000 2.449490 1
#>        B 2.00000 0 2.00000 1.000000 1.000000 1
#>        C 3.00000 2 0.00000 2.449490 3.000000 0
#>        D 1.00000 1 2.44949 0.000000 3.464102 1
#>        E 2.44949 1 3.00000 3.464102 2.000000 1
#>        F 1.00000 1 0.00000 1.000000 1.000000 0

Created on 2021-06-01 by the reprex package (v2.0.0)

This also makes sense when you think about the NPMI measure. NPMI is equal to 1 whenever p(x,y)^2 = p(x) * p(y). With the "boolean window" approach, two estimates of p(x,y) are created on the upper and lower part of the co-occurrence matrix. One is bounded by p(x), the other by p(y). When both are bounded, their geometric average will yield a NPMI of 1.

@kbenoit kbenoit added this to the v4 release milestone Apr 12, 2023
@kbenoit kbenoit removed this from the v4 release milestone Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants