Releases: quanteda/quanteda
Releases · quanteda/quanteda
CRAN v0.99.12
Changes since v0.99.9
New Features
- Added methods for changing the docnames of tokens and dfm objects (#987).
Bug fixes and stability enhancements
- The computation of tfidf has been more thoroughly described in the documentation for this function (#997).
- Now depends on R >= 3.4.0, to avoid showing errors in r-oldrelease builds.
CRAN v.99.9
Changes since v0.99
New Features
- Added magrittr pipe support (#927).
%>%
can now be used with quanteda without needing to attach magrittr (or, as many users apparently believe, the entire tidyverse.) corpus_segment()
now behaves more logically and flexibly, and is clearly differentiated fromcorpus_reshape()
in terms of its functionality. Its documentation is also vastly improved. (#908)- Added
data_dictionary_LSD2015
, the Lexicoder Sentiment 2015 dictionary (#963). - Significant improvements to the performance of
tokens_lookup()
anddfm_lookup()
(#960). - New functions
head.corpus()
,tail.corpus()
provide fast subsetting of the first or last documents in a corpus. (#952)
Bug fixes and stability enhancements
- Fixed a problem when applying
purrr::map()
todfm()
(#928). - Added documentation for
regex2fixed()
and associated functions. - Fixed a bug in
textstat_collocations.tokens()
caused by "documents" containing only""
as tokens. (#940) - Fixed a bug caused by
cbind.dfm()
when features shared a name starting withquanteda_options("base_featname")
(#946) - Improved dictionary handling and creation now correctly handles nested LIWC 2015 categories. (#941)
- Number of threads now set correctly by
quanteda_options()
. (#966)
Behaviour changes
summary.corpus()
now generates a special data.frame, which has its own print method, rather than requiringverbose = FALSE
to suppress output (#926).textstat_collocations()
is now multi-threaded.head.dfm()
,tail.dfm()
now behave consistently with base R methods for matrix, with the added argumentnfeature
. Previously, these methods printed the subset and invisibly returned it. Now, they simply return the subset. (#952)
CRAN v0.99
New features
- Improvements and consoldiation of methods for detecting multi-word expressions, now active only through
textstat_collocations()
, which computes only thelambda
method for now, but does so accurately and efficiently. (#753, #803). This function is still under development and likely to change further. - Added new
quanteda_options
that affect the maximum documents and features displayed by the dfm print method (#756). ngram
formation is now significantly faster, including with skips (skipgrams).- Improvements to
topfeatures()
: - New wrapper
phrase()
converts whitespace-separated multi-word patterns into a list of patterns. This affects the feature/pattern matching intokens/dfm_select/remove
,tokens_compound
,tokens/dfm_lookup
, andkwic
.phrase()
and the associated changes also make the behaviour of using character vectors, lists of characters, dictionaries, and collocation objects for pattern matches far more consistent. (See #820, #787, #740, #837, #836, #838) corpus.Corpus()
for creating a corpus from a tm Corpus now works with more complex objects that include document-level variables, such as data from the manifestoR package (#849).- New plot function
textplot_keyness()
plots term "keyness", the association of words with contrasting classes as measured bytextstat_keyness()
. - Added corpus constructor for corpus objects (#690).
- Added dictionary constructor for dictionary objects (#690).
- Added a tokens constructor for tokens objects (#690), including updates to
tokens()
that improve the consistency and efficiency of the tokenization. - Added new
quanteda_options()
:language_stemmer
andlanguage_stopwords
, now used for default in*_wordstem
functions andstopwords()
for defaults, respectively. Also uses this option indfm()
whenstem = TRUE
, rather than hard-wiring in the "english" stemmer (#386). - Added a new function
textstat_frequency()
to compile feature frequencies, possibly by groups. (#825) - Added
nomatch
option totokens_lookup()
anddfm_lookup()
, to provide tokens or feature counts for categories not matched to any dictionary key. (#496)
Behaviour changes
- The functions
sequences()
andcollocations()
have been removed and replaced bytextstat_collocations()
. - (Finally) we added "will" to the list of English stopwords (#818).
dfm
objects with one or both dimensions haveing zero length, and emptykwic
objects now display more appropriately in their print methods (per #811).- Pattern matches are now implemented more consistently across functions. In functions such as
*_select
,*_remove
,tokens_compound
,features
has been replaced bypattern
, and inkwic
,keywords
has been replaced bypattern
. These all behave consistently with respect topattern
, which now has a unified single help page and parameter description.(#839) See also above new features related tophrase()
. - We have improved the performance of the C++ routines that handle many of the
tokens_*
functions using hashed tokens, making some of them 10x faster (#853). - Upgrades to the
dfm_group()
function now allow "empty" documents to be created using thefill = TRUE
option, for making documents conform to a selection (similar to howdfm_select()
works for features, when supplied a dfm as the pattern argument). Thegroups
argument now behaves consistently across the functions where it is used. (#854) dictionary()
now requires its main argument to be a list, not a series of elements that can be used to build a list.- Some changes to the behaviour of
tokens()
have improved the behaviour ofremove_hyphens = FALSE
, which now behaves more correctly regardless of the setting ofremove_punct
(#887). - Improved
cbind.dfm()
function allows cbinding vectors, matrixes, and (recyclable) scalars to dfm objects.
Bug fixes and stability enhancements
- For the underlying methods behind
textstat_collocations()
, we corrected the word matching, and lambda and z calculation methods, which were slightly incorrect before. We also removed the chi2, G2, and pmi statistics, because these were incorrectly calculated for size > 2. - LIWC-formatted dictionary import now robust to assignment to term assignment to missing categories.
textmodel_NB(x, y, distribution = "Bernoulli")
was previously inactive even when this option was set. It has now been fully implemented and tested (#776, #780).- Separators including rare spacing characters are now handled more robustly by the
remove_separators
argument intokens()
. See #796. - Improved memory usage when computing
ntoken()
andntype()
. (#795) - Improvements to
quanteda_options()
now does not throw an error when quanteda functions are called directly without attaching the package. In addition, quanteda options can be set now in .Rprofile and will not be overwritten when the options initialization takes place when attaching the package. - Fixed a bug in
textstat_readability()
that wrongly computed the number of words with fewer than 3 syllables in a text; this affected theFOG.NRI
and theLinsear.Write
measures only. - Fixed mistakes in the computation of two docfreq schemes:
"logave"
and"inverseprob"
. - Fixed a bug in the handling of multi-thread options where the settings using
quanteda_options()
did not actually set the number of threads. In addition, we fixed a bug causing threading to be turned off on macOS (due to a check for a gcc version that is not used for compiling the macOS binaries) prevented multi-threading from being used at all on that platform. - Fixed a bug causing failure when functions that use
quanteda_options()
are called without the namespace or package being attached or loaded (#864). - Fixed a bug in overloading the View method that caused all named objects in the RStudio/Source pane to be named "x". (#893)
CRAN v0.9.9.65
Changes since v0.9.9-50
New features
- Corpus construction using
corpus()
now works for atm::SimpleCorpus
object. (#680) - Added
corpus_trim()
andchar_trim()
functions for selecting documents or subsets of documents based on sentence, paragraph, or document lengths. - Conversion of a dfm to an stm object now passes docvars through in the
$meta
of the return object. - New
dfm_group(x, groups = )
command, a convenience wrapper arounddfm.dfm(x, groups = )
(#725). - Methods for extending quanteda functions to readtext objects updated to match CRAN release of readtext package.
- Corpus constructor methods for data.frame objects now conform to the "text interchange format" for corpus data.frames, automatically recognizing
doc_id
andtext
fields, which also provides interoperability with the readtext package. corpus construction methods are now more explicitly tailored to input object classes.
Bug fixes and stability enhancements
dfm_lookup()
behaves more robustly on different platforms, especially for keys whose values match no features (#704).textstat_simil()
andtextstat_dist()
no longer take then
argument, as this was not sorting features in correct order.- Fixed failure of
tokens(x, what = "character")
whenx
included Twitter characters@
and#
(#637). - Fixed bug #707 where
ntype.dfm()
produced an incorrect result. - Fixed bug #706 where
textstat_readability()
andtextstat_lexdiv()
for single-document returns whendrop = TRUE
. - Improved the robustness of
corpus_reshape()
. print
, andhead
, andtail
methods fordfm
are more robust (#684).- Fixed bug in
convert(x, to = "stm")
caused by zero-count documents and zero-count features in a dfm (#699, #700, #701). This also removes docvar rows from$meta
when this is passed through the dfm, for zero-count documents. - Corrected broken handling of nested Yoshikoder dictionaries in
dictionary()
. (#722) dfm_compress
now preserves a dfm's docvars if collapsing only on the features margin, which means thatdfm_tolower()
anddfm_toupper()
no longer remove the docvars.fcm_compress()
now retains the fcm class, and generates and error when an asymmetric compression is attempted (#728).textstat_collocations()
now returns the collocations as character, not as a factor (#736)- Fixed a bug in
dfm_lookup(x, exclusive = FALSE)
wherein an empty dfm ws returned with there was no no match (#116). - Argument passing through
dfm()
totokens()
is now robust, and preserves variables defined in the calling environment (#721). - Fixed issues related to dictionaries failing when applying
str()
,names()
, or other indexing operations, which started happening on Linux and Windows platforms following the CRAN move to 3.4.0. (#744) - Dictionary import using the LIWC format is more robust to improperly formatted input files (#685).
- Weights applied using
dfm_weight()
now print friendlier error messages when the weight vector contains features not found in the dfm. See this Stack Overflow question for the use case that sparked this improvement.
CRAN v0.9.9.50
Update methods for new readtext format
CRAN v0.9.9-22
Minor fixes in C++ release to comply with CRAN checks on lesser-used platforms.
CRAN v0.9.9-24
New since v.09.9-17
Fixes incompatibilities on older compiler platforms.
CRAN v0.9.9-17
Bug fixes and minor feature additions.
Changes since v0.9.9-3
Bug fixes
- Fixed a bug causing
dfm
andtokens
to break on > 10,000 documents. (#438) - Fixed a bug in
tokens(x, what = "character", removeSeparators = TRUE)
that returned an empty string. - Fixed a bug in
corpus.VCorpus
if the VCorpus contains a single document. (#445) - Fixed a bug in
dfm_compress
in which the function failed on documents that contained zero feature counts. (#467) - Fixed a bug in
textmodel_NB
that caused the class priorsPc
to be refactored alphabetically instead of in the order of assignment (#471), also affecting predicted classes (#476).
New features
- New textstat function
textstat_keyness()
discovers words that occur at differential rates between partitions of a dfm (using chi-squared, Fisher's exact test, and the G^2 likelihood ratio test to measure the strength of associations). - Added 2017-Trump to the inaugural corpus datasets (
data_corpus_inaugual
anddata_char_inaugural
). - Improved the
groups
argument intexts()
(and indfm()
that uses this function), which will now coerce to a factor rather than requiring one. - Added a dfm constructor from dfm objects, with the option of collapsing by groups.
- Added new arguments to
sequences()
:ordered
andmax_length
, the latter to prevent memory leaks from extremely long sequences. dictionary()
now accepts YAML as an input file format.dfm_lookup
andtokens_lookup
now accept alevels
argument to determine which level of a hierarchical dictionary should be applied.- Added
min_nchar
andmax_nchar
arguments todfm_select
. dictionary()
can now be called on the argument of alist()
without explicitly wrapping it inlist()
.fcm
now works directly on a dfm object whencontext = "documents"
.
CRAN release v0.9.9-3
Major new update published on CRAN on 2016-01-10. This is a pre-v1.0 release that implements major API changes while still retaining nearly all of the old functions, but hidden and deprecated. See NEWS.md.
CRAN release v0.9.8.5
Added
CITATION
file
Bug Fixes
- (0.9.8.5) Fixed an incompatibility in sequences.cpp with Solaris x86 (#257)
- (0.9.8.4) Fix bug in verbose output of dfm that causes misreporting of number of features (#250)
- (0.9.8.4) Fix a bug in selectFeatures.dfm() that ignored case_insensitive = TRUE settings (#251) correct the documentation for this function.