Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reshaping in tokens? #2061

Open
koheiw opened this issue Feb 19, 2021 · 14 comments
Open

Allow reshaping in tokens? #2061

koheiw opened this issue Feb 19, 2021 · 14 comments
Milestone

Comments

@koheiw
Copy link
Collaborator

koheiw commented Feb 19, 2021

I am often frustrated because I can only segment documents into sentences on corpus, but I came up with an idea to make it possible on tokens with boundary marker. I would use this often for word-embedding but I wonder if there are broader use-cases. If not I will just add this to my LSX package.

require(stringi)
require(quanteda)

txt <- 'Mr. Jones and Mrs. Brown are very happy. So am I, Prof. Smith.'

corpus_partition <- function(x, marker = " | ") {
    temp <- stri_split_boundaries(x, type = "sentence", locale='en_US@ss=standard')
    unlist(lapply(temp, paste, collapse = marker))
}

txt2 <- corpus_partition(txt)
txt2
#> [1] "Mr. Jones and Mrs. Brown are very happy.  | So am I, Prof. Smith."
toks <- tokens(txt2)
toks
#> Tokens consisting of 1 document.
#> text1 :
#>  [1] "Mr"    "."     "Jones" "and"   "Mrs"   "."     "Brown" "are"   "very" 
#> [10] "happy" "."     "|"    
#> [ ... and 8 more ]
tokens_segment(toks, "|", extract_pattern = TRUE)
#> Tokens consisting of 2 documents and 1 docvar.
#> text1.1 :
#>  [1] "Mr"    "."     "Jones" "and"   "Mrs"   "."     "Brown" "are"   "very" 
#> [10] "happy" "."    
#> 
#> text1.2 :
#> [1] "So"    "am"    "I"     ","     "Prof"  "."     "Smith" "."

Created on 2021-02-19 by the reprex package (v1.0.0)

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 19, 2021

Keeping original documents in corpus also has advantage in tokenization.

> corp <- readRDS("/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds")
> corp_sent <- corpus_reshape(corp)
> 
> microbenchmark::microbenchmark(
+     tokens(corp_sent, verbose = FALSE),
+     tokens(corp, verbose = FALSE), 
+     time = 10
+ )
> microbenchmark::microbenchmark(
+     tokens(corp_sent, verbose = FALSE),
+     tokens(corp, verbose = FALSE), 
+     times = 10
+ )
Unit: seconds
                               expr      min       lq     mean   median       uq      max neval
 tokens(corp_sent, verbose = FALSE) 19.05009 19.57867 20.12166 20.08268 20.78302 21.14905    10
      tokens(corp, verbose = FALSE) 16.63722 16.84975 17.52424 17.17449 17.67806 19.43895    10

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 19, 2021

I like this idea in general. This could be very useful for many reasons, including collocation boundary detection. We already do this in textstat_collocations() (don't span punctuation) but keeping track of sentence boundaries would help make this process more systematic.

The question is whether we want to explore other, more fundamental ways to achieve this. The solution of adding the "|" is a bit like the way we currently have to add the POS to the type, through intervening via adding something that 's not really a token. It works, but could cause problems if a user sets this and then runs functions that don't need it.

What about splitting every corpus on sentences and storing it that way? But letting most operations group them on doc_id so that they don 't appear to be split? An extreme version that I've been talking about off and on for some time involves splitting into sentences and then into tokens, or into tokens with a record of each token's sentence id. Then "tokenization" is just a matter of selecting from the tokens. Maybe too fundamental for v3, but if we could design the functionality to accomplish what you want above, but have the option to change how it works through a more fundamental fix later, then that's an ideal way to proceed.

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 19, 2021

I actually apply corpus_reshape() and then dfm_group() to restore original unit, but never be happy with this inefficient approach: i.e. many short documents tend to take more storage space than few long documents.

I agree that "|" is a bit low-tech. We could allocate special token ID for sentence boundaries, but it requires a lot of changes in C++.

This is still just an idea so we do not need to do this for v3.0.

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 19, 2021

The best way might shift the index by 100 for all the special tokens (padding, sentence boundaries, and future additions). This is the easiest and generalization of current approach (shift by 1 for padding).

Then what would be the visible character for sentence boundaries? There are candidates in https://en.wikipedia.org/wiki/Unicode_symbols

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 20, 2021

What about making a new subclass of tokens object, call it "tokens3" for v3, that is not a list of tokens but a list of a list of tokens, where:

  • level 1 is a document (as now)
  • level 2 is a sentence within a document, similar to the results of corpus_reshape(x, to = "sentences")
  • level 3 is the token in the sentence

We'd need to adapt as.tokens() for the functions that need the ordinary tokens, which is simple to do. For the function you are talking about, and for textstat_collocations(), etc that need sentence boundaries, we write tokens3 methods.

One advantage is that we are using the list structure itself to record the sentences, which is both natural and efficient. nsentence() for instance would instantly on tokens3 objects, but so would ntoken(), ntype(), etc. In fact this seems like a much more natural structure than considering an entire sentence a "token", if we use tokens(x, what = "sentence").

This also has the advantage of not breaking anything current, but extending the functionality in a way that can be adapted separately and independently. I'm not suggesting we do this in a separate package, but that is how the experimental quanteda.sentiment works, by defining a "dictionary3" class, so that I could define extra functions that only operate if dictionary2 has been extended by creating this subclass.

@kbenoit kbenoit added this to the v3 release milestone Feb 20, 2021
@koheiw
Copy link
Collaborator Author

koheiw commented Feb 20, 2021

It is not a good idea because such nested structure requires redesigning all the C++ and R functions for tokens.

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 20, 2021

We could easily implement it so that it would not, since we'd unlist level three in as.tokens() so that it became regular tokens for the vast majority of the tokens methods.

For the 2-3 that need the sentence structure, we could just unnest level three so that it became exactly like the result of the tokens_segment(toks, "|", extract_pattern = TRUE) you have above.

This could be trialled experimentally in a way that does not require any modification of existing functions, since it would be defined for an extension class "tokens3". I'm not suggesting we change or do away with the "tokens" class. (That would be a massive amount of work and break a massive amount of existing code!)

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 21, 2021

We need a tokens object that can be passed to C++ functions without transformation. Unlisting and re-nesting cause too much overhead.

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 21, 2021

But in your implementation above, you have a stri_split_boundaries call, then lapply a paste, and then a tokens_segment. Unnesting one part of a list to get the equivalent tokens object would not involve more steps. Then an object identical to your end result above can be passed as is to our existing C++ functions.

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 21, 2021

I am thinking about a basic work flow. Step 2 requires unlisting and re-nesting although step 1 and 3 are more or less the same.

  1. tokenize docuements with sentence boundary information
  2. select tokens by tokens_select()
  3. reshape documents into sentences by tokens_reshape()
  4. form a DFM

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 21, 2021

That sounds good. I'd prefer though to have a slightly different workflow wherein 1. allows us to choose the sentence segmenter, and does not require insertion of markers. This allows us to substitute the stringi sentence boundary detection with any sentence segmenter. I regularly use spacyr for this for instance since it's a smarter sentence tokeniser than the stringi equivalent. I am a bit uneasy about adding a new token to demarcate sentences, since this is a bit of a hack. It could change our token counts, for instance, and cause issues if for some reason a text actually contained the segment marker.

I wasn't sure what you meant exactly by 2 and 3. Is 2 supposed to be tokens_segment()? What are you selecting as part of this workflow?

If for this option we did not nest the list, but used doc_id for the original document - which we already do - and write a few extra methods for this extended type of object, then we only need a new specialty class for this with a few new methods, and the rest can automatically reshape the object into a regular tokens object, e.g. your 3. and 4. above for a dfm, or for document-level kwic, etc. Since the tokens would not be nested under this scheme - just split into text1.1, text1.2, text2.1, etc, all existing C++ code will work.

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 21, 2021

This is the more concrete version of the workflow. The point here is that tokens object do not have many short vectors for sentences from 1 to 3.

  1. tokenize docuements with sentence boundary information
  2. remove stopwords by tokens_select()
  3. form a DFM to perform analysis on the document level (e.g. topic modeling)
  4. reshape documents into sentences by tokens_reshape()
  5. form a DFM perform analysis on the sentence level (e.g. word embedding)

We should not spend too much time on this idea as this is not for v3.

@kbenoit kbenoit modified the milestones: v3 release, post v3 actions Mar 14, 2021
@koheiw
Copy link
Collaborator Author

koheiw commented Mar 20, 2021

We can do something similar with corpus_group() and tokens_segment(). It is slower but helps to saves a lot of space in the RAM.

> corp_sent <- corpus_reshape(corp)
> corp <- readRDS("/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds")
> corp_sent <- corpus_reshape(corp)
> system.time({
+     corp <- corpus_group(corp_sent, concatenator = " #EOS ")
+     toks1 <- tokens(corp, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)
+     toks1 <- tokens_remove(toks1, stopwords())
+     dfmt <- dfm(tokens_segment(toks1, "#EOS", pattern_position = "after"))
+ })
   user  system elapsed 
 78.686   0.000  42.293 
> print(object.size(toks1), units = "Mb")
36.3 Mb
> ndoc(toks1)
[1] 10000
> 
> system.time({
+     toks2 <- tokens(corp_sent, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)
+     toks2 <- tokens_remove(toks2, stopwords())
+     dfmt <- dfm(toks2)
+ })
   user  system elapsed 
 64.768   0.000  33.606 
> print(object.size(toks2), units = "Mb")
134.9 Mb
> ndoc(toks2)
[1] 391395

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 21, 2021

Very interesting. This is definitely a promising and worthwhile direction for more work.

Our existing fcm construction does not span sentences or elements that have been removed, when computing co-occurrence windows, but the issue of segmenting sentences has more uses than just word embeddings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants