You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let's imagine we need to make artificial paragraphs by combining 2 adjacent sentences, run some operation of them and restore original documents. This might be a rare case, but raises a few questions on quanteda's functions.
I came up with the following pipeline, but it is too complex. So I propose to make a dedicated function for grouping segments of texts, although I do not know how we should call it.
At least, we should add segid() and tokens_group(reset_docid = FALSE). With the the first, we can avoid accessing the system docvar directly, (attr(toks, "docvars")$segid_; with the second, we don't need to back up the original docid, toks$docid_org <- docid(toks).
require(quanteda)
#> Loading required package: quanteda#> Package version: 3.2.4#> Unicode version: 13.0#> ICU version: 66.1#> Parallel computing: 4 of 4 threads used.#> See https://quanteda.io for tutorials and examples.dat<-data.frame(text= c("A aaa aaa. Aa aaaa.",
"B bb bbb bb. Bbb b. B bbb. Bb bb bb."),
var= c(10, 20))
corp<- corpus(dat) %>%
corpus_reshape()
toks<- tokens(corp)
toks#> Tokens consisting of 6 documents and 1 docvar.#> text1.1 :#> [1] "A" "aaa" "aaa" "." #> #> text1.2 :#> [1] "Aa" "aaaa" "." #> #> text2.1 :#> [1] "B" "bb" "bbb" "bb" "." #> #> text2.2 :#> [1] "Bbb" "b" "." #> #> text2.3 :#> [1] "B" "bbb" "." #> #> text2.4 :#> [1] "Bb" "bb" "bb" "."
docid(toks)
#> [1] text1 text1 text2 text2 text2 text2#> Levels: text1 text2
docvars(toks)
#> var#> 1 10#> 2 10#> 3 20#> 4 20#> 5 20#> 6 20# back up the original docidtoks$docid_org<- docid(toks)
f<- paste0(docid(toks), ".", ceiling(attr(toks, "docvars")$segid_/2))
f<-factor(f, unique(f))
# chunk sentences to create artificial paragraphstoks_grp<- tokens_group(toks, f) # the original docid is overrittentoks_grp#> Tokens consisting of 3 documents and 2 docvars.#> text1.1 :#> [1] "A" "aaa" "aaa" "." "Aa" "aaaa" "." #> #> text2.1 :#> [1] "B" "bb" "bbb" "bb" "." "Bbb" "b" "." #> #> text2.2 :#> [1] "B" "bbb" "." "Bb" "bb" "bb" "."
docid(toks_grp)
#> [1] text1.1 text2.1 text2.2#> Levels: text1.1 text2.1 text2.2
docvars(toks_grp)
#> var docid_org#> 1 10 text1#> 2 20 text2#> 3 20 text2# restore original documentstoks2<- tokens_group(toks_grp, docid_org) # group by the original docidtoks2#> Tokens consisting of 2 documents and 2 docvars.#> text1 :#> [1] "A" "aaa" "aaa" "." "Aa" "aaaa" "." #> #> text2 :#> [1] "B" "bb" "bbb" "bb" "." "Bbb" "b" "." "B" "bbb" "." "Bb" #> [ ... and 3 more ]
docid(toks2)
#> [1] text1 text2#> Levels: text1 text2
docvars(toks2)
#> var docid_org#> 1 10 text1#> 2 20 text2
Let's imagine we need to make artificial paragraphs by combining 2 adjacent sentences, run some operation of them and restore original documents. This might be a rare case, but raises a few questions on quanteda's functions.
I came up with the following pipeline, but it is too complex. So I propose to make a dedicated function for grouping segments of texts, although I do not know how we should call it.
At least, we should add
segid()
andtokens_group(reset_docid = FALSE)
. With the the first, we can avoid accessing the system docvar directly,(attr(toks, "docvars")$segid_
; with the second, we don't need to back up the original docid,toks$docid_org <- docid(toks)
.Created on 2023-01-11 with reprex v2.0.2
The text was updated successfully, but these errors were encountered: