You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Although there is no documentation of the feature set used in NERSuite, you can check the following two source files for this purpose:
nersuite/src/nersuite/FExtor.h
nersuite/src/nersuite/FExtor.cpp
With the default window size [-2, 2], NERSuite uses:
word features:
1-1) character n-grams (n=2-4) of the current word.
1-2) raw word n-grams (n=1-2) within the window.
1-3) number normalized word n-grams (n=1-2) within the window. When there is a sequence of consecutive numbers within a string, this part is normalized into a single 0 (e.g., NF1234 -> NF0).
lemma features - same to 1-3), but use lemma instead of word.
orthographic features - boolean features such as:
3-1) a current word contains beginning capital letter, digits, only digits, alpha-numeric characters, only capital letters and digits, no lowercase letters, all capital letters, capital letter(s) which is not the first letter, two consecutive capital letters, a Greek word as a sub-string, period, hyphen, slash, opening square bracket, closing square bracket, opening round bracket, closing round bracket, colon, semi-colon, percentage symbol, apostrophe.
3-2) the length of the current word (boolean feature).
3-3) the length of the current word & all capitalized word (boolean feature).
POS features - POS n-grams (n=1-2) within the window.
lemma+POS features - Lemma+POS n-grams (n=1-2) within the window.
chunk features:
6-1) chunk type of a current word.
6-2) the last raw word of the chunk that a current word belongs to.
6-3) the last lemma of the chunk that a current word belongs to.
6-4) whether the word "the" exist in the left most position of the current chunk (boolean feature).
dictionary features:
7-1) unlexicalized n-gram of a dictionary matching result (n=1-2) within the window.
7-2) lexicalized n-gram of a dictionary matching result (n=1-2) within the window.
NERSuite uses only positive features for dictionary feature.
For a dictionary dic1 having an entry "NF-kappa B" and input text "As a result, we can identify NF-kappa B in ...", dictionary features will be triggered for each token as follows: (feature notation is not same to the one in the source code)
As - (empty)
a - (empty)
result - (empty)
, - (empty)
we - (empty)
can - "Dic[2]=dic1", "Dic[2]=dic1_NF", "Dic[1,2]=O/dic1", "Dic[1,2]=O_is/dic1_NF"
identify - "Dic[1]=dic1", "Dic[1]=dic1_NF", "Dic[0,1]=O/dic1", "Dic[0,1]=O_is/dic1_NF", "Dic[1,2]=dic1/dic1", "Dic[1,2]=dic1_NF/dic1_-"
...
And some of these features (especially orthographic features) are redundant because of the default tokenization scheme.
The text was updated successfully, but these errors were encountered:
Although there is no documentation of the feature set used in NERSuite, you can check the following two source files for this purpose:
nersuite/src/nersuite/FExtor.h
nersuite/src/nersuite/FExtor.cpp
With the default window size [-2, 2], NERSuite uses:
word features:
1-1) character n-grams (n=2-4) of the current word.
1-2) raw word n-grams (n=1-2) within the window.
1-3) number normalized word n-grams (n=1-2) within the window. When there is a sequence of consecutive numbers within a string, this part is normalized into a single 0 (e.g., NF1234 -> NF0).
lemma features - same to 1-3), but use lemma instead of word.
orthographic features - boolean features such as:
3-1) a current word contains beginning capital letter, digits, only digits, alpha-numeric characters, only capital letters and digits, no lowercase letters, all capital letters, capital letter(s) which is not the first letter, two consecutive capital letters, a Greek word as a sub-string, period, hyphen, slash, opening square bracket, closing square bracket, opening round bracket, closing round bracket, colon, semi-colon, percentage symbol, apostrophe.
3-2) the length of the current word (boolean feature).
3-3) the length of the current word & all capitalized word (boolean feature).
POS features - POS n-grams (n=1-2) within the window.
lemma+POS features - Lemma+POS n-grams (n=1-2) within the window.
chunk features:
6-1) chunk type of a current word.
6-2) the last raw word of the chunk that a current word belongs to.
6-3) the last lemma of the chunk that a current word belongs to.
6-4) whether the word "the" exist in the left most position of the current chunk (boolean feature).
dictionary features:
7-1) unlexicalized n-gram of a dictionary matching result (n=1-2) within the window.
7-2) lexicalized n-gram of a dictionary matching result (n=1-2) within the window.
NERSuite uses only positive features for dictionary feature.
For a dictionary dic1 having an entry "NF-kappa B" and input text "As a result, we can identify NF-kappa B in ...", dictionary features will be triggered for each token as follows: (feature notation is not same to the one in the source code)
And some of these features (especially orthographic features) are redundant because of the default tokenization scheme.
The text was updated successfully, but these errors were encountered: