Explanation of a set of features used in NERSuite. #30

priancho · 2016-04-26T01:49:19Z

Although there is no documentation of the feature set used in NERSuite, you can check the following two source files for this purpose:
nersuite/src/nersuite/FExtor.h
nersuite/src/nersuite/FExtor.cpp

With the default window size [-2, 2], NERSuite uses:

word features:
1-1) character n-grams (n=2-4) of the current word.
1-2) raw word n-grams (n=1-2) within the window.
1-3) number normalized word n-grams (n=1-2) within the window. When there is a sequence of consecutive numbers within a string, this part is normalized into a single 0 (e.g., NF1234 -> NF0).
lemma features - same to 1-3), but use lemma instead of word.
orthographic features - boolean features such as:
3-1) a current word contains beginning capital letter, digits, only digits, alpha-numeric characters, only capital letters and digits, no lowercase letters, all capital letters, capital letter(s) which is not the first letter, two consecutive capital letters, a Greek word as a sub-string, period, hyphen, slash, opening square bracket, closing square bracket, opening round bracket, closing round bracket, colon, semi-colon, percentage symbol, apostrophe.
3-2) the length of the current word (boolean feature).
3-3) the length of the current word & all capitalized word (boolean feature).
POS features - POS n-grams (n=1-2) within the window.
lemma+POS features - Lemma+POS n-grams (n=1-2) within the window.
chunk features:
6-1) chunk type of a current word.
6-2) the last raw word of the chunk that a current word belongs to.
6-3) the last lemma of the chunk that a current word belongs to.
6-4) whether the word "the" exist in the left most position of the current chunk (boolean feature).
dictionary features:
7-1) unlexicalized n-gram of a dictionary matching result (n=1-2) within the window.
7-2) lexicalized n-gram of a dictionary matching result (n=1-2) within the window.

NERSuite uses only positive features for dictionary feature.
For a dictionary dic1 having an entry "NF-kappa B" and input text "As a result, we can identify NF-kappa B in ...", dictionary features will be triggered for each token as follows: (feature notation is not same to the one in the source code)

As - (empty)
a - (empty)
result - (empty)
, - (empty)
we - (empty)
can - "Dic[2]=dic1", "Dic[2]=dic1_NF", "Dic[1,2]=O/dic1", "Dic[1,2]=O_is/dic1_NF"
identify - "Dic[1]=dic1", "Dic[1]=dic1_NF", "Dic[0,1]=O/dic1", "Dic[0,1]=O_is/dic1_NF", "Dic[1,2]=dic1/dic1", "Dic[1,2]=dic1_NF/dic1_-"
...

And some of these features (especially orthographic features) are redundant because of the default tokenization scheme.

priancho self-assigned this Nov 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explanation of a set of features used in NERSuite. #30

Explanation of a set of features used in NERSuite. #30

priancho commented Apr 26, 2016 •

edited

Explanation of a set of features used in NERSuite. #30

Explanation of a set of features used in NERSuite. #30

Comments

priancho commented Apr 26, 2016 • edited

priancho commented Apr 26, 2016 •

edited