-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full text model layout features: BLOCKSTART missing, if very first block token is a new line #712
Comments
Thanks a lot @de-code for raising this error ! The PR would be of course very welcome :) |
Hi @kermitt2 I am trying to add a unit test for the change. But I am having a bit of trouble with the following block: int lastPos = tokens.size();
// if it's a last block from a document piece, it may end earlier
if (blockIndex == dp2.getBlockPtr()) {
lastPos = dp2.getTokenBlockPos()+1;
if (lastPos > tokens.size()) {
LOGGER.error("DocumentPointer for block " + blockIndex + " points to " +
dp2.getTokenBlockPos() + " token, but block token size is " +
tokens.size());
lastPos = tokens.size();
}
} With just one block, it is causing |
I think |
Okay, thank you. I think I may be creating the data incorrectly. I created a draft PR with what I have so far: #714 |
Hi @lfoppiano is there more detail here that you are looking for to get more context? (I am not quite sure what to add at the moment) |
At least for some documents, the first token of a block seem to be a line feed.
In that case the line feed is filtered out:
But when it is then getting to process the next "real" token,
n
will no longer be0
but1
. Therefore it will not go into the main blockstart block:Example document
475335v1
(DOI: 10.1101/475335)PDF
bioRxiv XML
The text
We also looked at epidemic synchrony...
(line 216) doesn't get theBLOCKSTART
feature (it will beBLOCKIN
), even though it is in its own block (but with a line feed as the first token as described above).I could try to submit a fix PR for it.
/cc @kermitt2
The text was updated successfully, but these errors were encountered: