Releases: quanteda/quanteda
CRAN v2.1.2
Changes
textstat_keyness()
performance is now improved through implementation in (multi-threaded) C++.
Bug fixes and stability enhancements
- Fixes breaking tests and examples on Solaris platform as well as other changes introduced by changes to the stringi package.
CRAN v2.1.1
Bug fixes and stability enhancements
corpus_reshape()
now allows reshaping back to documents even when segmented texts were of zero length. (#1978)- Special handling applied for Solaris to some issues breaking on that build, relating to the caching in
summary.corpus()
/textstat_summary()
.
CRAN v2.1.0
Changes
- Added
block_size
toquanteda_options()
to control the number of documents in blocked tokenization. - Fixed
print.dictionary2()
to control the printing of nested levels withmax_nkey
(#1967) - Added
textstat_summary()
to provide detailed information about dfm, tokens and corpus objects. It will replacesummary()
in future versions. - Fixed a performance issue causing slowdowns in tokenizing (using the default
what = "word"
) corpora with large numbers of documents that contain social media tags and URLs that needed to be preserved (such a large corpus of Tweets). - Updated the (default) "word" tokenizer to preserve hashtags and usernames better with non-ASCII text, and made these patterns user-configurable in
quanteda_options()
. The following are now preserved: "#政治" as well as Weibo-style hashtags such as "#英国首相#". convert(x, to = "data.frame")
now outputs the first column as "doc_id" rather than "document" since "document" is a commonly occurring term in many texts. (#1918)- Added new methods
char_select()
,char_keep()
, andchar_remove()
for easy manipulation of character vectors. - Added
dictionary_edit()
for easy, interactive editing of dictionaries, plus the functionschar_edit()
andlist_edit()
for editing character and list of character objects. - Added a method to
textplot_wordcloud()
that plots objects fromtextstat_keyness()
, to visualize keywords either by comparison or for the target category only. - Improved the performance of
kwic()
(#1840). - Added new
logsmooth
scheme todfm_weight()
. - Added new
textstat_summary()
method, which returns summary information about the tokens/types/features etc in an object. It also caches summary information so that this can be retrieved on subsequent calls, rather than re-computed.
Bug fixes and stability enhancements
- Stopped returning
NA
for non-existent features whenn
>nfeat(x)
intextstat_frequency(x, n)
. (#1929) - Fixed a problem in
dfm_lookup()
andtokens_lookup()
in which an error was caused when no dictionary key returned a single match (#1946). - Fixed a bug that caused a
textstat_simil/dist
object converted to a data.frame to drop itsdocument2
labels (#1939). - Fixed a bug causing
dfm_match()
to fail on a dfm that included "pads" (""
). (#1960) - Updated the
data_dfm_lbgexample
object using more modern dfm internals. - Updates
textstat_readability()
,textstat_lexdiv()
, andnscrabble()
so that empty texts are not dropped in the result. (#1976)
CRAN v2.0.1
Changes
- Moved
data_corpus_irishbudget2010
anddata_corpus_dailnoconf1991
to the quanteda.textmodels package. - Em dashes and double dashes between words, whether surrounded by a space or not, are now converted to " - " to distinguish them from infix hyphens. (#1889)
- Verbose output for dfm and tokens creation is now corrected and more consistent. (#1894)
Bug fixes and stability enhancements
- Number removal is now both improved and fixed (#1909).
- Fixed an issue causing CRAN errors in pre-v4, related to the new default of
stringsAsFactors = FALSE
for data.frame objects. - An error in the print method for dfm objects is now fixed (#1897)
- Fixed a bug in
tokens_replace()
when the pattern was not matched (#1895) - Fixed the names of dimensions not exchanging when a dfm was transposed (#1903)
CRAN v2.0.0
quanteda 2.0 introduces some major changes, detailed here.
What's new in v2.0
-
New corpus object structure.
The internals of the corpus object have been redesigned, and now are based around a character vector with meta- and system-data in attributes. These are all updated to work with the existing extractor and replacement functions. If you were using these before, then you should not even notice the change. Docvars are now handled separately from the texts, in the same way that docvars are handled for tokens objects.
-
New metadata handling.
Corpus-level metadata is now inserted in a user metadata list via
meta()
andmeta<-()
.metacorpus()
is kept as a synonym formeta()
, for backwards compatibility. Additional system-level corpus information is also recorded, but automatically when an object is created.Document-level metadata is deprecated, and now all document-level information is simply a "docvar". For backward compatibility,
metadoc()
is kept and will insert document variables (docvars) with the name prefixed by an underscore. -
Corpus objects now store default summary statistics for efficiency. When these are present,
summary.corpus()
retrieves them rather than computing them on the fly. -
New index operators for core objects. The main change here is to redefine the
$
operator for corpus, tokens, and dfm objects (all objects that retain docvars) to allow this operator to access single docvars by name. Some other index operators have been redefined as well, such as[.corpus
returning a slice of a corpus, and[[.corpus
returning the texts from a corpus.See the full details at https://github.com/quanteda/quanteda/wiki/indexing_core_objects.
-
*_subset()
functions.The
subset
argument now must be logical, and theselect
argument has been removed. (This is part ofbase::subset()
but has never made sense, either in quanteda or base.) -
Return format from
textstat_simil()
andtextstat_dist()
.Now defaults to a sparse matrix from the Matrix package, but coercion methods are provided for
as.data.frame()
, to make these functions return a data.frame just like the other textstat functions. Additional coercion methods are provided foras.dist()
,as.simil()
, andas.matrix()
. -
settings functions (and related slots and object attributes) are gone. These are now replaced by a new
meta(x, type = "object")
that records object-specific meta-data, including settings such as then
for tokens (to record thengrams
). -
All included data objects are upgraded to the new formats. This includes the three corpus objects, the single dfm data object, and the LSD 2015 dictionary object.
-
New print methods for core objects (corpus, tokens, dfm, dictionary) now exist, each with new global options to control the number of documents shown, as well as the length of a text snippet (corpus), the tokens (tokens), dfm cells (dfm), or keys and values (dictionary). Similar to the extended printing options for dfm objects, printing of corpus objects now allows for brief summaries of the texts to be printed, and for the number of documents and the length of the previews to be controlled by new global options.
-
All textmodels and related functions have been moved to a new package quanteda.textmodels. This makes them easier to maintain and update, and keeps the size of the core package down.
-
quanteda v2 implements major changes to the
tokens()
constructor. These are designed to simplify the code and its maintenance in quanteda, to allow users to work with other (external) tokenizers, and to improve consistency across the tokens processing options. Changes include:-
A new method
tokens.list(x, ...)
constructs atokens
object from named list of characters, allowing users to tokenize texts using some other function (or package) such astokenize_words()
,tokenize_sentences()
, ortokenize_tweets()
from the tokenizers package, or the list returned byspacyr::spacy_tokenize()
. This allows users to use their choice of tokenizer, as long as it returns a named list of characters. Withtokens.list()
, all tokens processing (remove_*
) options can be applied, or the list can be converted directly to atokens
object without processing usingas.tokens.list()
. -
All tokens options are now intervention options, to split or remove things that by default are not split or removed. All
remove_*
options totokens()
now remove them from tokens objects by callingtokens.tokens()
, after constructing the object. "Pre-processing" is now actually post-processing usingtokens_*()
methods internally, after a conservative tokenization on token boundaries. This both improves performance and improves consistency in handling special characters (e.g. Twitter characters) across different tokenizer engines. (#1503, #1446, #1801)
Note that
tokens.tokens()
will remove what is found, but cannot "undo" a removal -- for instance it cannot replace missing punctuation characters if these have already been removed.-
The option
remove_hyphens
is removed and deprecated, but replaced bysplit_hyphens
. This preserves infix (internal) hyphens rather than splitting them. This behaviour is implemented in both thewhat = "word"
andwhat = "word2"
tokenizer options. This option isFALSE
by default. -
The option
remove_twitter
has been removed. The newwhat = "word"
is a smarter tokenizer that preserves social media tags, URLs, and email-addresses. "Tags" are defined as valid social media hashtags and usernames (using Twitter rules for validity) rather than removing the#
and@
punctuation characters, even ifremove_punct = TRUE
.
-
New features
- Changed the default value of the
size
argument indfm_sample()
to the number of features, not the number of documents. (#1643) - Fixes a few CRAN-related issues (compiler warnings on Solaris and encoding warnings on r-devel-linux-x86_64-debian-clang.)
- Added
startpos
andendpos
arguments totokens_select()
, for selecting on token positions relative to the start or end of the tokens in each document. (#1475) - Added a
convert()
method for corpus objects, to convert them into data.frame or json formats. - Added a
spacy_tokenize()
method for corpus objects, to provide direct access via the spacyr package.
Behaviour changes
- Added a
force = TRUE
option and error checking for the situations of applyingdfm_weight()
ordfm_group()
to a dfm that has already been weighted. (#1545) The functiontextstat_frequency()
now allows passing this argument todfm_group()
via...
. (#1646) textstat_frequency()
now has a new argument for resolving ties when ranking term frequencies, defaulting to the "min" method. (#1634)- New docvars accessor and replacement functions are available for corpus, tokens, and dfm objects via
$
. (See Index Operators for Core Objects above.) textstat_entropy()
now produces a data.frame that is more consistent with othertextstat
methods. (#1690)
Bug fixes and stability enhancements
- docnames now enforced to be character (formerly, could be numeric for some objects).
- docnames are now enforced to be strictly unique for all object classes.
- Grouping operations in
tokens_group()
anddfm_group()
are more robust to using multiple grouping variables, and preserve these correctly as docvars in the new dfm. (#1809) - Some fixes to documented ... objects in two functions that were previously causing CRAN check failures on the release of 1.5.2.
Other improvements
- All of the (three) included corpus objects have been cleaned up and augmented with improved meta-data and docvars. The inaugural speech corpus, for instance, now includes the President's political party affiliation.
CRAN v1.5.2
Last 1.x.x release before major changes in v2.
New features
- Added Yule's I to
textstat_lexdiv()
. - Added forward compatibility for newer (v2) corpus class objects.
- Added a new function
featfreq()
to compute the overall feature frequencies from a dfm.
Bug fixes
CRAN v1.5.1
New features
- Added
omit_empty
as an argument toconvert()
, to allow the user to control whether empty documents are excluded from converted dfm objects for certain formats. (#1660)
Bug fixes and stability enhancements
- Fixed a bug that affects the new
textstat_dist()
andtextstat_simil()
(#1730) - Fixed a bug in how
textstat_dist()
andtextstat_simil()
class symmetric matrices.
CRAN v1.5.0
New features
- Add
flatten
andlevels
arguments toas.list.dictionary2()
to enable more flexible conversion of dictionary objects. (#1661) - In
corpus_sample()
, thesize
now works with theby
argument, to control the size of units sampled from each group. - Improvements to
textstat_dist()
andtextstat_simil()
, see below. - Long tokens are not discarded automatically in the call to
tokens()
. (#1713)
Behaviour changes
textstat_dist()
andtextstat_simil()
now return sparse symmetric matrix objects using classes from the Matrix package. This replaces the former structure based on thedist
class. Computation of these classes is now also based on the fast implementation in the proxyC package. When computing similarities, the newmin_simil
argument allows a user to ignore certain values below a specified similarity threshold. A new coercion methodas.data.frame.textstat_simildist()
now exists for converting these returns into a data.frame of pairwise comparisons. Existing methods such asas.matrix()
,as.dist()
, andas.list()
work as they did before.- We have removed the "faith", "chi-squared", and "kullback" methods from
textstat_dist()
andtextstat_simil()
because these were either not symmetric or not invariant to document or feature ordering. Finally, theselection
argument has been deprecated in favour of a newy
argument. textstat_readability()
now defaults tomeasure = "Flesch"
if no measure is supplied. This makes it consistent withtextstat_lexdiv()
that also takes a default measure ("TTR") if none is supplied. (#1715)- The default values for
max_nchar
andmin_nchar
intokens_select()
are now NULL, meaning they are not applied if the user does not supply values. Fixes #1713.
Bug fixes and stability enhancements
kwic.corpus()
andkwic.tokens()
behaviour now aligned, meaning that dictionaries are correctly faceted by key instead of by value. (#1684)- Improved formatting of
tokens()
verbose output. (#1683) - Subsetting and printing of subsetted kwic objects is more robust. (#1665)
- The "Bormuth" and "DRP" measures are now fixed for
textstat_readability()
. (#1701)
CRAN v1.4.3
Bug fixes and stability enhancements
- Changed the default value of the
size
argument indfm_sample()
to the number of features, not the number of documents. (#1643) - Fixes a few CRAN-related issues (compiler warnings on Solaris and encoding warnings on r-devel-linux-x86_64-debian-clang.)
Behaviour changes
- Added a
force = TRUE
option and error checking for the situations of applyingdfm_weight()
ordfm_group()
to a dfm that has already been weighted. (#1545) The functiontextstat_frequency()
now allows passing this argument todfm_group()
via...
. (#1646) textstat_frequency()
now has a new argument for resolving ties when ranking term frequencies, defaulting to the "min" method. (#1634)