Short documents and skip_grams assertion do not match #88

awagner-mainz · 2019-09-10T12:08:46Z

As I am reading it, the TextReuseCorpus function has some safety check in order not to run tokenizers on documents that are too short, "too short" being documents too small to generate two ngrams of the requested size. In addition to that, the tokenizers seem to have their own assertions to prevent running with too short documents.

However, I have run into problems with skipgrams. First, the safety check in TextReuseCorpus lets documents pass that the assertion in tokenize_skip_ngrams then bails out on, because the latter assertion assumes a larger minimum document length. Second, I don't quite understand why the assertion would require this in the first place. IIUC, it's n + n * k - k <= length(words), but why would I not be able to generate skipgrams with the same document length as that of the ngram tokenizer (n < length(words)).

FWIW, I am trying to build large skipgrams, say, with n=15 and k = 3.

textreuse/R/tokenizers.R

Line 59 in 35f8421

assert_that(n + n * k - k <= length(words))

Thanks for any pointers or insights.

The text was updated successfully, but these errors were encountered:

lmullen · 2019-09-10T12:18:00Z

Have you tried using the skip-gram tokenizer in the tokenizers package? Those tokenizers will eventually replace the ones in this package. Note that their output format is somewhat different, so you will have to use them by passing the simplify = TRUE argument.

In general, this package is intended to let you drop in different tokenizers, so if the existing tokenizers do not meet your needs, you might consider writing a special case one.

awagner-mainz · 2019-09-10T12:24:21Z

Ah, I wasn't aware of that and have not tried it. Will do and report back. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short documents and skip_grams assertion do not match #88

Short documents and skip_grams assertion do not match #88

awagner-mainz commented Sep 10, 2019

lmullen commented Sep 10, 2019

awagner-mainz commented Sep 10, 2019

Short documents and skip_grams assertion do not match #88

Short documents and skip_grams assertion do not match #88

Comments

awagner-mainz commented Sep 10, 2019

lmullen commented Sep 10, 2019

awagner-mainz commented Sep 10, 2019