-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow reshaping in tokens? #2061
Comments
Keeping original documents in corpus also has advantage in tokenization.
|
I like this idea in general. This could be very useful for many reasons, including collocation boundary detection. We already do this in The question is whether we want to explore other, more fundamental ways to achieve this. The solution of adding the "|" is a bit like the way we currently have to add the POS to the type, through intervening via adding something that 's not really a token. It works, but could cause problems if a user sets this and then runs functions that don't need it. What about splitting every corpus on sentences and storing it that way? But letting most operations group them on doc_id so that they don 't appear to be split? An extreme version that I've been talking about off and on for some time involves splitting into sentences and then into tokens, or into tokens with a record of each token's sentence id. Then "tokenization" is just a matter of selecting from the tokens. Maybe too fundamental for v3, but if we could design the functionality to accomplish what you want above, but have the option to change how it works through a more fundamental fix later, then that's an ideal way to proceed. |
I actually apply I agree that "|" is a bit low-tech. We could allocate special token ID for sentence boundaries, but it requires a lot of changes in C++. This is still just an idea so we do not need to do this for v3.0. |
The best way might shift the index by 100 for all the special tokens (padding, sentence boundaries, and future additions). This is the easiest and generalization of current approach (shift by 1 for padding). Then what would be the visible character for sentence boundaries? There are candidates in https://en.wikipedia.org/wiki/Unicode_symbols |
What about making a new subclass of tokens object, call it "tokens3" for v3, that is not a list of tokens but a list of a list of tokens, where:
We'd need to adapt One advantage is that we are using the list structure itself to record the sentences, which is both natural and efficient. This also has the advantage of not breaking anything current, but extending the functionality in a way that can be adapted separately and independently. I'm not suggesting we do this in a separate package, but that is how the experimental quanteda.sentiment works, by defining a "dictionary3" class, so that I could define extra functions that only operate if dictionary2 has been extended by creating this subclass. |
It is not a good idea because such nested structure requires redesigning all the C++ and R functions for tokens. |
We could easily implement it so that it would not, since we'd unlist level three in For the 2-3 that need the sentence structure, we could just unnest level three so that it became exactly like the result of the This could be trialled experimentally in a way that does not require any modification of existing functions, since it would be defined for an extension class "tokens3". I'm not suggesting we change or do away with the "tokens" class. (That would be a massive amount of work and break a massive amount of existing code!) |
We need a tokens object that can be passed to C++ functions without transformation. Unlisting and re-nesting cause too much overhead. |
But in your implementation above, you have a stri_split_boundaries call, then lapply a paste, and then a tokens_segment. Unnesting one part of a list to get the equivalent tokens object would not involve more steps. Then an object identical to your end result above can be passed as is to our existing C++ functions. |
I am thinking about a basic work flow. Step 2 requires unlisting and re-nesting although step 1 and 3 are more or less the same.
|
That sounds good. I'd prefer though to have a slightly different workflow wherein 1. allows us to choose the sentence segmenter, and does not require insertion of markers. This allows us to substitute the stringi sentence boundary detection with any sentence segmenter. I regularly use spacyr for this for instance since it's a smarter sentence tokeniser than the stringi equivalent. I am a bit uneasy about adding a new token to demarcate sentences, since this is a bit of a hack. It could change our token counts, for instance, and cause issues if for some reason a text actually contained the segment marker. I wasn't sure what you meant exactly by 2 and 3. Is 2 supposed to be If for this option we did not nest the list, but used |
This is the more concrete version of the workflow. The point here is that tokens object do not have many short vectors for sentences from 1 to 3.
We should not spend too much time on this idea as this is not for v3. |
We can do something similar with > corp_sent <- corpus_reshape(corp)
> corp <- readRDS("/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds")
> corp_sent <- corpus_reshape(corp)
> system.time({
+ corp <- corpus_group(corp_sent, concatenator = " #EOS ")
+ toks1 <- tokens(corp, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)
+ toks1 <- tokens_remove(toks1, stopwords())
+ dfmt <- dfm(tokens_segment(toks1, "#EOS", pattern_position = "after"))
+ })
user system elapsed
78.686 0.000 42.293
> print(object.size(toks1), units = "Mb")
36.3 Mb
> ndoc(toks1)
[1] 10000
>
> system.time({
+ toks2 <- tokens(corp_sent, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)
+ toks2 <- tokens_remove(toks2, stopwords())
+ dfmt <- dfm(toks2)
+ })
user system elapsed
64.768 0.000 33.606
> print(object.size(toks2), units = "Mb")
134.9 Mb
> ndoc(toks2)
[1] 391395 |
Very interesting. This is definitely a promising and worthwhile direction for more work. Our existing fcm construction does not span sentences or elements that have been removed, when computing co-occurrence windows, but the issue of segmenting sentences has more uses than just word embeddings. |
I am often frustrated because I can only segment documents into sentences on corpus, but I came up with an idea to make it possible on tokens with boundary marker. I would use this often for word-embedding but I wonder if there are broader use-cases. If not I will just add this to my LSX package.
Created on 2021-02-19 by the reprex package (v1.0.0)
The text was updated successfully, but these errors were encountered: