Allow reshaping in tokens? #2061

koheiw · 2021-02-19T00:39:40Z

I am often frustrated because I can only segment documents into sentences on corpus, but I came up with an idea to make it possible on tokens with boundary marker. I would use this often for word-embedding but I wonder if there are broader use-cases. If not I will just add this to my LSX package.

require(stringi)
require(quanteda)

txt <- 'Mr. Jones and Mrs. Brown are very happy. So am I, Prof. Smith.'

corpus_partition <- function(x, marker = " | ") {
    temp <- stri_split_boundaries(x, type = "sentence", locale='en_US@ss=standard')
    unlist(lapply(temp, paste, collapse = marker))
}

txt2 <- corpus_partition(txt)
txt2
#> [1] "Mr. Jones and Mrs. Brown are very happy.  | So am I, Prof. Smith."
toks <- tokens(txt2)
toks
#> Tokens consisting of 1 document.
#> text1 :
#>  [1] "Mr"    "."     "Jones" "and"   "Mrs"   "."     "Brown" "are"   "very" 
#> [10] "happy" "."     "|"    
#> [ ... and 8 more ]
tokens_segment(toks, "|", extract_pattern = TRUE)
#> Tokens consisting of 2 documents and 1 docvar.
#> text1.1 :
#>  [1] "Mr"    "."     "Jones" "and"   "Mrs"   "."     "Brown" "are"   "very" 
#> [10] "happy" "."    
#> 
#> text1.2 :
#> [1] "So"    "am"    "I"     ","     "Prof"  "."     "Smith" "."

^{Created on 2021-02-19 by the reprex package (v1.0.0)}

koheiw · 2021-02-19T01:30:12Z

Keeping original documents in corpus also has advantage in tokenization.

> corp <- readRDS("/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds")
> corp_sent <- corpus_reshape(corp)
> 
> microbenchmark::microbenchmark(
+     tokens(corp_sent, verbose = FALSE),
+     tokens(corp, verbose = FALSE), 
+     time = 10
+ )
> microbenchmark::microbenchmark(
+     tokens(corp_sent, verbose = FALSE),
+     tokens(corp, verbose = FALSE), 
+     times = 10
+ )
Unit: seconds
                               expr      min       lq     mean   median       uq      max neval
 tokens(corp_sent, verbose = FALSE) 19.05009 19.57867 20.12166 20.08268 20.78302 21.14905    10
      tokens(corp, verbose = FALSE) 16.63722 16.84975 17.52424 17.17449 17.67806 19.43895    10

kbenoit · 2021-02-19T10:36:48Z

I like this idea in general. This could be very useful for many reasons, including collocation boundary detection. We already do this in textstat_collocations() (don't span punctuation) but keeping track of sentence boundaries would help make this process more systematic.

The question is whether we want to explore other, more fundamental ways to achieve this. The solution of adding the "|" is a bit like the way we currently have to add the POS to the type, through intervening via adding something that 's not really a token. It works, but could cause problems if a user sets this and then runs functions that don't need it.

What about splitting every corpus on sentences and storing it that way? But letting most operations group them on doc_id so that they don 't appear to be split? An extreme version that I've been talking about off and on for some time involves splitting into sentences and then into tokens, or into tokens with a record of each token's sentence id. Then "tokenization" is just a matter of selecting from the tokens. Maybe too fundamental for v3, but if we could design the functionality to accomplish what you want above, but have the option to change how it works through a more fundamental fix later, then that's an ideal way to proceed.

koheiw · 2021-02-19T11:26:06Z

I actually apply corpus_reshape() and then dfm_group() to restore original unit, but never be happy with this inefficient approach: i.e. many short documents tend to take more storage space than few long documents.

I agree that "|" is a bit low-tech. We could allocate special token ID for sentence boundaries, but it requires a lot of changes in C++.

This is still just an idea so we do not need to do this for v3.0.

koheiw · 2021-02-19T12:27:42Z

The best way might shift the index by 100 for all the special tokens (padding, sentence boundaries, and future additions). This is the easiest and generalization of current approach (shift by 1 for padding).

Then what would be the visible character for sentence boundaries? There are candidates in https://en.wikipedia.org/wiki/Unicode_symbols

kbenoit · 2021-02-20T12:19:30Z

What about making a new subclass of tokens object, call it "tokens3" for v3, that is not a list of tokens but a list of a list of tokens, where:

level 1 is a document (as now)
level 2 is a sentence within a document, similar to the results of corpus_reshape(x, to = "sentences")
level 3 is the token in the sentence

We'd need to adapt as.tokens() for the functions that need the ordinary tokens, which is simple to do. For the function you are talking about, and for textstat_collocations(), etc that need sentence boundaries, we write tokens3 methods.

One advantage is that we are using the list structure itself to record the sentences, which is both natural and efficient. nsentence() for instance would instantly on tokens3 objects, but so would ntoken(), ntype(), etc. In fact this seems like a much more natural structure than considering an entire sentence a "token", if we use tokens(x, what = "sentence").

This also has the advantage of not breaking anything current, but extending the functionality in a way that can be adapted separately and independently. I'm not suggesting we do this in a separate package, but that is how the experimental quanteda.sentiment works, by defining a "dictionary3" class, so that I could define extra functions that only operate if dictionary2 has been extended by creating this subclass.

koheiw · 2021-02-20T12:58:13Z

It is not a good idea because such nested structure requires redesigning all the C++ and R functions for tokens.

kbenoit · 2021-02-20T13:35:49Z

We could easily implement it so that it would not, since we'd unlist level three in as.tokens() so that it became regular tokens for the vast majority of the tokens methods.

For the 2-3 that need the sentence structure, we could just unnest level three so that it became exactly like the result of the tokens_segment(toks, "|", extract_pattern = TRUE) you have above.

This could be trialled experimentally in a way that does not require any modification of existing functions, since it would be defined for an extension class "tokens3". I'm not suggesting we change or do away with the "tokens" class. (That would be a massive amount of work and break a massive amount of existing code!)

koheiw · 2021-02-21T12:26:48Z

We need a tokens object that can be passed to C++ functions without transformation. Unlisting and re-nesting cause too much overhead.

kbenoit · 2021-02-21T12:30:36Z

But in your implementation above, you have a stri_split_boundaries call, then lapply a paste, and then a tokens_segment. Unnesting one part of a list to get the equivalent tokens object would not involve more steps. Then an object identical to your end result above can be passed as is to our existing C++ functions.

koheiw · 2021-02-21T12:56:05Z

I am thinking about a basic work flow. Step 2 requires unlisting and re-nesting although step 1 and 3 are more or less the same.

tokenize docuements with sentence boundary information
select tokens by tokens_select()
reshape documents into sentences by tokens_reshape()
form a DFM

kbenoit · 2021-02-21T13:44:59Z

That sounds good. I'd prefer though to have a slightly different workflow wherein 1. allows us to choose the sentence segmenter, and does not require insertion of markers. This allows us to substitute the stringi sentence boundary detection with any sentence segmenter. I regularly use spacyr for this for instance since it's a smarter sentence tokeniser than the stringi equivalent. I am a bit uneasy about adding a new token to demarcate sentences, since this is a bit of a hack. It could change our token counts, for instance, and cause issues if for some reason a text actually contained the segment marker.

I wasn't sure what you meant exactly by 2 and 3. Is 2 supposed to be tokens_segment()? What are you selecting as part of this workflow?

If for this option we did not nest the list, but used doc_id for the original document - which we already do - and write a few extra methods for this extended type of object, then we only need a new specialty class for this with a few new methods, and the rest can automatically reshape the object into a regular tokens object, e.g. your 3. and 4. above for a dfm, or for document-level kwic, etc. Since the tokens would not be nested under this scheme - just split into text1.1, text1.2, text2.1, etc, all existing C++ code will work.

koheiw · 2021-02-21T23:19:31Z

This is the more concrete version of the workflow. The point here is that tokens object do not have many short vectors for sentences from 1 to 3.

tokenize docuements with sentence boundary information
remove stopwords by tokens_select()
form a DFM to perform analysis on the document level (e.g. topic modeling)
reshape documents into sentences by tokens_reshape()
form a DFM perform analysis on the sentence level (e.g. word embedding)

We should not spend too much time on this idea as this is not for v3.

koheiw · 2021-03-20T23:13:38Z

We can do something similar with corpus_group() and tokens_segment(). It is slower but helps to saves a lot of space in the RAM.

> corp_sent <- corpus_reshape(corp)
> corp <- readRDS("/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds")
> corp_sent <- corpus_reshape(corp)
> system.time({
+     corp <- corpus_group(corp_sent, concatenator = " #EOS ")
+     toks1 <- tokens(corp, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)
+     toks1 <- tokens_remove(toks1, stopwords())
+     dfmt <- dfm(tokens_segment(toks1, "#EOS", pattern_position = "after"))
+ })
   user  system elapsed 
 78.686   0.000  42.293 
> print(object.size(toks1), units = "Mb")
36.3 Mb
> ndoc(toks1)
[1] 10000
> 
> system.time({
+     toks2 <- tokens(corp_sent, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)
+     toks2 <- tokens_remove(toks2, stopwords())
+     dfmt <- dfm(toks2)
+ })
   user  system elapsed 
 64.768   0.000  33.606 
> print(object.size(toks2), units = "Mb")
134.9 Mb
> ndoc(toks2)
[1] 391395

kbenoit · 2021-03-21T09:46:53Z

Very interesting. This is definitely a promising and worthwhile direction for more work.

Our existing fcm construction does not span sentences or elements that have been removed, when computing co-occurrence windows, but the issue of segmenting sentences has more uses than just word embeddings.

kbenoit added this to the v3 release milestone Feb 20, 2021

kbenoit modified the milestones: v3 release, post v3 actions Mar 14, 2021

kbenoit modified the milestones: post v3 actions, v4 release Apr 12, 2023

kbenoit modified the milestones: v4 release, v4.1 Oct 18, 2023

koheiw mentioned this issue Oct 31, 2023

Make textstat_readability to work with tokens and DFM quanteda/quanteda.textstats#65

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow reshaping in tokens? #2061

Allow reshaping in tokens? #2061

koheiw commented Feb 19, 2021

koheiw commented Feb 19, 2021

kbenoit commented Feb 19, 2021

koheiw commented Feb 19, 2021 •

edited

koheiw commented Feb 19, 2021

kbenoit commented Feb 20, 2021 •

edited

koheiw commented Feb 20, 2021

kbenoit commented Feb 20, 2021

koheiw commented Feb 21, 2021

kbenoit commented Feb 21, 2021

koheiw commented Feb 21, 2021

kbenoit commented Feb 21, 2021

koheiw commented Feb 21, 2021

koheiw commented Mar 20, 2021

kbenoit commented Mar 21, 2021

Allow reshaping in tokens? #2061

Allow reshaping in tokens? #2061

Comments

koheiw commented Feb 19, 2021

koheiw commented Feb 19, 2021

kbenoit commented Feb 19, 2021

koheiw commented Feb 19, 2021 • edited

koheiw commented Feb 19, 2021

kbenoit commented Feb 20, 2021 • edited

koheiw commented Feb 20, 2021

kbenoit commented Feb 20, 2021

koheiw commented Feb 21, 2021

kbenoit commented Feb 21, 2021

koheiw commented Feb 21, 2021

kbenoit commented Feb 21, 2021

koheiw commented Feb 21, 2021

koheiw commented Mar 20, 2021

kbenoit commented Mar 21, 2021

koheiw commented Feb 19, 2021 •

edited

kbenoit commented Feb 20, 2021 •

edited