How best we can make artificial paragraphs? #2199

koheiw · 2023-01-11T01:03:46Z

Let's imagine we need to make artificial paragraphs by combining 2 adjacent sentences, run some operation of them and restore original documents. This might be a rare case, but raises a few questions on quanteda's functions.

I came up with the following pipeline, but it is too complex. So I propose to make a dedicated function for grouping segments of texts, although I do not know how we should call it.

At least, we should add segid() and tokens_group(reset_docid = FALSE). With the the first, we can avoid accessing the system docvar directly, (attr(toks, "docvars")$segid_; with the second, we don't need to back up the original docid, toks$docid_org <- docid(toks).

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.4
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
dat <- data.frame(text = c("A aaa aaa. Aa aaaa.",  
                           "B bb bbb bb. Bbb b. B bbb. Bb bb bb."),
                  var = c(10, 20))
corp <- corpus(dat) %>% 
    corpus_reshape()
toks <- tokens(corp)
toks
#> Tokens consisting of 6 documents and 1 docvar.
#> text1.1 :
#> [1] "A"   "aaa" "aaa" "."  
#> 
#> text1.2 :
#> [1] "Aa"   "aaaa" "."   
#> 
#> text2.1 :
#> [1] "B"   "bb"  "bbb" "bb"  "."  
#> 
#> text2.2 :
#> [1] "Bbb" "b"   "."  
#> 
#> text2.3 :
#> [1] "B"   "bbb" "."  
#> 
#> text2.4 :
#> [1] "Bb" "bb" "bb" "."
docid(toks)
#> [1] text1 text1 text2 text2 text2 text2
#> Levels: text1 text2
docvars(toks)
#>   var
#> 1  10
#> 2  10
#> 3  20
#> 4  20
#> 5  20
#> 6  20

# back up the original docid
toks$docid_org <- docid(toks) 

f <- paste0(docid(toks), ".", ceiling(attr(toks, "docvars")$segid_ / 2))
f <- factor(f, unique(f))

# chunk sentences to create artificial paragraphs
toks_grp <- tokens_group(toks, f) # the original docid is overritten
toks_grp
#> Tokens consisting of 3 documents and 2 docvars.
#> text1.1 :
#> [1] "A"    "aaa"  "aaa"  "."    "Aa"   "aaaa" "."   
#> 
#> text2.1 :
#> [1] "B"   "bb"  "bbb" "bb"  "."   "Bbb" "b"   "."  
#> 
#> text2.2 :
#> [1] "B"   "bbb" "."   "Bb"  "bb"  "bb"  "."
docid(toks_grp)
#> [1] text1.1 text2.1 text2.2
#> Levels: text1.1 text2.1 text2.2
docvars(toks_grp)
#>   var docid_org
#> 1  10     text1
#> 2  20     text2
#> 3  20     text2

# restore original documents
toks2 <- tokens_group(toks_grp, docid_org) # group by the original docid
toks2
#> Tokens consisting of 2 documents and 2 docvars.
#> text1 :
#> [1] "A"    "aaa"  "aaa"  "."    "Aa"   "aaaa" "."   
#> 
#> text2 :
#>  [1] "B"   "bb"  "bbb" "bb"  "."   "Bbb" "b"   "."   "B"   "bbb" "."   "Bb" 
#> [ ... and 3 more ]
docid(toks2)
#> [1] text1 text2
#> Levels: text1 text2
docvars(toks2)
#>   var docid_org
#> 1  10     text1
#> 2  20     text2

^{Created on 2023-01-11 with reprex v2.0.2}

The text was updated successfully, but these errors were encountered:

koheiw added enhancement feature request labels Jan 11, 2023

koheiw mentioned this issue Jan 13, 2023

Add segid() #2201

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How best we can make artificial paragraphs? #2199

How best we can make artificial paragraphs? #2199

koheiw commented Jan 11, 2023

How best we can make artificial paragraphs? #2199

How best we can make artificial paragraphs? #2199

Comments

koheiw commented Jan 11, 2023