Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docvarsfrom = "filepaths" not working as expected #141

Open
kbenoit opened this issue Oct 29, 2018 · 1 comment
Open

docvarsfrom = "filepaths" not working as expected #141

kbenoit opened this issue Oct 29, 2018 · 1 comment
Assignees
Labels

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 29, 2018

Error

This should parse out the filepaths, not filepaths and filenames.

> (rt3 <- readtext(paste0(DATA_DIR, "txt/movie_reviews/*"), 
+                  docvarsfrom = "filepaths", docvarnames = "sentiment"))
readtext object consisting of 10 documents and 4 docvars.
# data.frame [10 × 6]
  doc_id       text         sentiment                                    docvar2    docvar3 docvar4
  <chr>        <chr>        <chr>                                        <chr>      <chr>   <chr>  
1 neg_cv000_2"\"plot : t… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv000   29416.…
2 neg_cv001_1… "\"the happ… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv001   19502.…
3 neg_cv002_1… "\"it is mo… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv002   17424.…
4 neg_cv003_1… "\" \" ques… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv003   12683.…
5 neg_cv004_1… "\"synopsis… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv004   12641.…
6 pos_cv000_2… "\"films ad… /Library/Frameworks/R.framework/Versions/3.… reviews/p… cv000   29590.…
# ... with 4 more rows
Warning message:
In get_docvars_filenames(files, dvsep, docvarnames, docvarsfrom ==  :
  Fewer docnames supplied than existing docvars - last 3 docvars given generic names.

Expected behaviour

The idea behind the docvarsfrom = "filepaths" is not to parse the filenames, but rather to take as docvars the folder parts from the supplied file pattern matches.

So in the example:

DATA_DIR <- system.file("extdata/", package = "readtext")
# recurse through subdirectories
(rt3 <- readtext(paste0(DATA_DIR, "txt/movie_reviews/*"), 
                 docvarsfrom = "filepaths", docvarnames = "sentiment"))

it should return:

readtext object consisting of 10 documents and 1 docvar.
# data.frame [10 × 3]
  doc_id              text                 sentiment
  <chr>               <chr>                <chr>    
1 neg_cv000_29416.txt "\"plot : two\"..."  neg      
2 neg_cv001_19502.txt "\"the happy \"..."  neg      
3 neg_cv002_17424.txt "\"it is movi\"..."  neg      
4 neg_cv003_12683.txt "\" \" quest f\"..." neg      
5 neg_cv004_12641.txt "\"synopsis :\"..."  neg      
6 pos_cv000_29590.txt "\"films adap\"..."  pos      
# ... with 4 more rows

where the neg, pos labels come not from filenames but instead from the path at the match level, e.g. the pre-/ part of:

> list.files(path = paste0(DATA_DIR, "txt/movie_reviews/"), recursive = TRUE)
 [1] "neg/neg_cv000_29416.txt" "neg/neg_cv001_19502.txt" "neg/neg_cv002_17424.txt"
 [4] "neg/neg_cv003_12683.txt" "neg/neg_cv004_12641.txt" "pos/pos_cv000_29590.txt"
 [7] "pos/pos_cv001_18431.txt" "pos/pos_cv002_15918.txt" "pos/pos_cv003_11664.txt"
[10] "pos/pos_cv004_11636.txt"

When docvarsfrom = "filepaths" the filenames should not be parsed into dvars.

@kbenoit kbenoit added the bug label Oct 29, 2018
@kbenoit kbenoit assigned kbenoit and koheiw and unassigned kbenoit Oct 29, 2018
koheiw added a commit to quanteda/tutorials.quanteda.io that referenced this issue Nov 18, 2020
@koheiw
Copy link
Contributor

koheiw commented Nov 24, 2020

The root cause is that Sys.glob() does not tell us what in file paths "*" matched.

file <- Sys.glob(file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants