-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review verbose behaviours #2329
Comments
We tried to make verbose messages more consistent using Lines 56 to 81 in 84ecce8
However, it is not easy to provide detailed information on the operations that will be performed in C++. For example, the message below says removing more features than actually exist. Since 20215 is only the possible sequence of tokens that the pattern would match (and be removed), we only know actual number of tokens removed (4584) only after the operation. require(quanteda)
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)
length(types(toks))
#> [1] 10090
toks2 <- tokens_remove(toks, phrase("a *"), verbose = TRUE)
#> removed 20,180 features
#>
length(types(toks2))
#> [1] 9942
sum(ntoken(toks)) - sum(ntoken(toks2))
#> [1] 4584 Further, the repeated use of The best approach would be to simplify the message including only the number of documents (and/or tokens), the type of operation (remove/keep, lookup, ngrams etc) and, maybe, a main parameter (e.g. |
How about making require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)
stats_tokens <- function(x) {
list(ndoc = ndoc(x),
ntoken = sum(ntoken(x, remove_padding = TRUE)))
}
message_tokens <- function(operation, pre, post) {
msg <- sprintf("%s: from %d tokens (%d documents) to %d tokens (%d documents)",
operation, pre$ntoken, pre$ndoc, post$ntoken, post$ndoc)
msg <- prettyNum(msg, big.mark = ",")
cat(msg)
}
stats_dfm <- function(x) {
list(ndoc = ndoc(x),
nfeat = nfeat(dfm_remove(x, "")))
}
message_tokens <- function(operation, pre, post) {
msg <- sprintf("%s: from %d tokens (%d documents) to %d tokens (%d documents)",
operation, pre$ntoken, pre$ndoc, post$ntoken, post$ndoc)
msg <- prettyNum(msg, big.mark = ",")
cat(msg)
}
message_dfm <- function(operation, pre, post) {
msg <- sprintf("%s: from %d features (%d documents) to %d features (%d documents)",
operation, pre$nfeat, pre$ndoc, post$nfeat, post$ndoc)
msg <- prettyNum(msg, big.mark = ",")
cat(msg)
}
before <- stats_tokens(toks)
toks <- tokens_remove(toks, stopwords())
after <- stats_tokens(toks)
message_tokens("tokens_remove()", before, after)
#> tokens_remove(): from 151,442 tokens (59 documents) to 79,535 tokens (59 documents)
before <- stats_tokens(toks)
toks <- tokens_subset(toks, Year > 2000)
after <- stats_tokens(toks)
message_tokens("tokens_subset()", before, after)
#> tokens_subset(): from 79,535 tokens (59 documents) to 7,459 tokens (6 documents)
dfmt <- dfm(toks)
before <- stats_dfm(dfmt)
dfmt <- dfm_trim(dfmt, min_termfreq = 10)
after <- stats_dfm(dfmt)
message_dfm("dfm_trim()", before, after)
#> dfm_trim(): from 2,185 features (6 documents) to 104 features (6 documents) Created on 2024-01-12 with reprex v2.0.2 |
Makes sense to me! |
Some functions don't even have the
verbose
argument, e.g.tokens_split()
. We should review the functions to see which need it, and review the behaviours whenverbose = TRUE
to make sure they are consistent.Not sure anyone uses it much, but for the sake of consistency it's still worth reviewing.
The text was updated successfully, but these errors were encountered: