Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short documents and skip_grams assertion do not match #88

Open
awagner-mainz opened this issue Sep 10, 2019 · 2 comments
Open

Short documents and skip_grams assertion do not match #88

awagner-mainz opened this issue Sep 10, 2019 · 2 comments

Comments

@awagner-mainz
Copy link

As I am reading it, the TextReuseCorpus function has some safety check in order not to run tokenizers on documents that are too short, "too short" being documents too small to generate two ngrams of the requested size. In addition to that, the tokenizers seem to have their own assertions to prevent running with too short documents.

However, I have run into problems with skipgrams. First, the safety check in TextReuseCorpus lets documents pass that the assertion in tokenize_skip_ngrams then bails out on, because the latter assertion assumes a larger minimum document length. Second, I don't quite understand why the assertion would require this in the first place. IIUC, it's n + n * k - k <= length(words), but why would I not be able to generate skipgrams with the same document length as that of the ngram tokenizer (n < length(words)).

FWIW, I am trying to build large skipgrams, say, with n=15 and k = 3.

assert_that(n + n * k - k <= length(words))

Thanks for any pointers or insights.

@lmullen
Copy link
Member

lmullen commented Sep 10, 2019

Have you tried using the skip-gram tokenizer in the tokenizers package? Those tokenizers will eventually replace the ones in this package. Note that their output format is somewhat different, so you will have to use them by passing the simplify = TRUE argument.

In general, this package is intended to let you drop in different tokenizers, so if the existing tokenizers do not meet your needs, you might consider writing a special case one.

@awagner-mainz
Copy link
Author

Ah, I wasn't aware of that and have not tried it. Will do and report back. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants