Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tagging #33

Open
irifki opened this issue Jul 18, 2017 · 3 comments
Open

Tagging #33

irifki opened this issue Jul 18, 2017 · 3 comments

Comments

@irifki
Copy link

irifki commented Jul 18, 2017

Hi,
I'm training my model to recognize a new entity (isoforms)and my input set that looks like

O 0 3 The The DT B-NP O
O 4 13 chemokine chemokine NN I-NP O
O 14 22 receptor receptor NN I-NP O
O 23 28 CXCR3 CXCR3 NN I-NP O
O 28 29 , , , O O
O 30 35 which which WDT B-NP O
O 36 39 has have VBZ B-VP O
O 40 45 three three CD B-NP O
O 46 51 known know VBN I-NP O
O 52 60 variants variant NNS I-NP O
O 61 62 ( ( ( O O
B-isoform 62 67 CXCR3 CXCR3 NN B-NP B-isoform
I-isoform 67 68 - - HYPH I-NP I-isoform
I-isoform 68 69 A A NN I-NP I-isoform
O 69 70 , , , O O
B-isoform 71 76 CXCR3 CXCR3 NN B-NP B-isoform
I-isoform 76 77 - - HYPH I-NP I-isoform
I-isoform 77 78 B B NN I-NP I-isoform
O 79 82 and and CC I-NP O
O 83 88 CXCR3 CXCR3 NN I-NP O
O 88 89 - - HYPH B-NP O
O 89 92 Alt Alt NN I-NP O
O 92 93 ) ) ) O O
O 93 94 , , , O O
O 95 98 has have VBZ B-VP O
O 99 103 been be VBN I-VP O
O 104 114 implicated implicate VBN I-VP O
O 115 117 in in IN B-PP O
O 118 121 the the DT B-NP O
O 122 133 recruitment recruitment NN I-NP O
O 134 136 of of IN B-PP O
O 137 141 mast mast NN B-NP O
O 142 147 cells cell NNS I-NP O
O 148 150 to to TO B-PP O
O 151 158 tissues tissue NNS B-NP O
O 159 161 in in IN B-PP O
O 162 166 many many JJ B-NP O
O 167 176 different different JJ I-NP O
O 177 184 chronic chronic JJ I-NP O
O 185 193 diseases disease NNS I-NP O
O 194 198 with with IN B-PP O
O 199 202 its its PRP$ B-NP O
O 203 211 agonists agonist NNS I-NP O
O 212 217 found find VBN B-VP O
O 218 220 in in IN B-PP O
O 221 229 elevated elevate VBN B-NP O
O 230 236 levels level NNS I-NP O
O 237 239 in in IN B-PP O
O 240 247 several several JJ B-NP O
O 248 257 pulmonary pulmonary JJ I-NP O
O 258 266 diseases disease NNS I-NP O
O 266 267 . . . O O

O 268 271 The The DT B-NP O
O 272 277 known know VBN I-NP O
O 278 283 CXCR3 CXCR3 NN I-NP O
O 284 292 agonists agonist NNS I-NP O
.........
........
B-isoform 377703 377707 PlGF PlGF NN B-NP B-isoform
I-isoform 377708 377709 - - HYPH B-NP I-isoform
I-isoform 377710 377713 224 224 CD I-NP I-isoform
O 377714 377715 . . . O O

B-isoform 377716 377719 IXi IXi NN I-NP B-isoform
O 377720 377721 . . . O O

B-isoform 377722 377728 TTLL1b TTLL1b NN B-NP B-isoform
O 377728 377729 . . . O O

It has a total 81712 line. When I try to learn the model, it stays stuck in the "Start feature extraction".
Moreover, when I select a small part of the training corpus, run the model and try tagging on a already existing entry in the corpus, the result is negative.

Can you please point out if I'm doing something wrong? I've also encountered a segmantation fault twice.

Thank you!

@priancho
Copy link
Member

Hi,

This issue has no description. Is it accidentally omitted?

Best wishes,
Han-cheol

@irifki
Copy link
Author

irifki commented Jul 20, 2017

Hi, I just updated my comment correctly.

@priancho
Copy link
Member

Hi, irifki

Sorry for late reply, I have been quite occupied for a current project.

Looking at your data, the whole text is recognized as one sentence.
The second and third columns are sentence offsets of each token.

Unfortunately, NERsuite toolkit doesn't provide sentence splitting module.
If you have such a tool, can you please segment your source text into sentence-per-line text and then re-run NERsuite on it?
If you don't have a sentence splitting program, I think that CoreNLP's sentence splitting program can be a good starting point.

Best wishes,
Han-Cheol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants