Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How best we can make artificial paragraphs? #2199

Open
koheiw opened this issue Jan 11, 2023 · 0 comments
Open

How best we can make artificial paragraphs? #2199

koheiw opened this issue Jan 11, 2023 · 0 comments

Comments

@koheiw
Copy link
Collaborator

koheiw commented Jan 11, 2023

Let's imagine we need to make artificial paragraphs by combining 2 adjacent sentences, run some operation of them and restore original documents. This might be a rare case, but raises a few questions on quanteda's functions.

I came up with the following pipeline, but it is too complex. So I propose to make a dedicated function for grouping segments of texts, although I do not know how we should call it.

At least, we should add segid() and tokens_group(reset_docid = FALSE). With the the first, we can avoid accessing the system docvar directly, (attr(toks, "docvars")$segid_; with the second, we don't need to back up the original docid, toks$docid_org <- docid(toks).

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.4
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
dat <- data.frame(text = c("A aaa aaa. Aa aaaa.",  
                           "B bb bbb bb. Bbb b. B bbb. Bb bb bb."),
                  var = c(10, 20))
corp <- corpus(dat) %>% 
    corpus_reshape()
toks <- tokens(corp)
toks
#> Tokens consisting of 6 documents and 1 docvar.
#> text1.1 :
#> [1] "A"   "aaa" "aaa" "."  
#> 
#> text1.2 :
#> [1] "Aa"   "aaaa" "."   
#> 
#> text2.1 :
#> [1] "B"   "bb"  "bbb" "bb"  "."  
#> 
#> text2.2 :
#> [1] "Bbb" "b"   "."  
#> 
#> text2.3 :
#> [1] "B"   "bbb" "."  
#> 
#> text2.4 :
#> [1] "Bb" "bb" "bb" "."
docid(toks)
#> [1] text1 text1 text2 text2 text2 text2
#> Levels: text1 text2
docvars(toks)
#>   var
#> 1  10
#> 2  10
#> 3  20
#> 4  20
#> 5  20
#> 6  20

# back up the original docid
toks$docid_org <- docid(toks) 

f <- paste0(docid(toks), ".", ceiling(attr(toks, "docvars")$segid_ / 2))
f <- factor(f, unique(f))

# chunk sentences to create artificial paragraphs
toks_grp <- tokens_group(toks, f) # the original docid is overritten
toks_grp
#> Tokens consisting of 3 documents and 2 docvars.
#> text1.1 :
#> [1] "A"    "aaa"  "aaa"  "."    "Aa"   "aaaa" "."   
#> 
#> text2.1 :
#> [1] "B"   "bb"  "bbb" "bb"  "."   "Bbb" "b"   "."  
#> 
#> text2.2 :
#> [1] "B"   "bbb" "."   "Bb"  "bb"  "bb"  "."
docid(toks_grp)
#> [1] text1.1 text2.1 text2.2
#> Levels: text1.1 text2.1 text2.2
docvars(toks_grp)
#>   var docid_org
#> 1  10     text1
#> 2  20     text2
#> 3  20     text2

# restore original documents
toks2 <- tokens_group(toks_grp, docid_org) # group by the original docid
toks2
#> Tokens consisting of 2 documents and 2 docvars.
#> text1 :
#> [1] "A"    "aaa"  "aaa"  "."    "Aa"   "aaaa" "."   
#> 
#> text2 :
#>  [1] "B"   "bb"  "bbb" "bb"  "."   "Bbb" "b"   "."   "B"   "bbb" "."   "Bb" 
#> [ ... and 3 more ]
docid(toks2)
#> [1] text1 text2
#> Levels: text1 text2
docvars(toks2)
#>   var docid_org
#> 1  10     text1
#> 2  20     text2

Created on 2023-01-11 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant