Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanation of a set of features used in NERSuite. #30

Open
priancho opened this issue Apr 26, 2016 · 0 comments
Open

Explanation of a set of features used in NERSuite. #30

priancho opened this issue Apr 26, 2016 · 0 comments
Assignees

Comments

@priancho
Copy link
Member

priancho commented Apr 26, 2016

Although there is no documentation of the feature set used in NERSuite, you can check the following two source files for this purpose:
nersuite/src/nersuite/FExtor.h
nersuite/src/nersuite/FExtor.cpp

With the default window size [-2, 2], NERSuite uses:

  1. word features:
    1-1) character n-grams (n=2-4) of the current word.
    1-2) raw word n-grams (n=1-2) within the window.
    1-3) number normalized word n-grams (n=1-2) within the window. When there is a sequence of consecutive numbers within a string, this part is normalized into a single 0 (e.g., NF1234 -> NF0).

  2. lemma features - same to 1-3), but use lemma instead of word.

  3. orthographic features - boolean features such as:
    3-1) a current word contains beginning capital letter, digits, only digits, alpha-numeric characters, only capital letters and digits, no lowercase letters, all capital letters, capital letter(s) which is not the first letter, two consecutive capital letters, a Greek word as a sub-string, period, hyphen, slash, opening square bracket, closing square bracket, opening round bracket, closing round bracket, colon, semi-colon, percentage symbol, apostrophe.
    3-2) the length of the current word (boolean feature).
    3-3) the length of the current word & all capitalized word (boolean feature).

  4. POS features - POS n-grams (n=1-2) within the window.

  5. lemma+POS features - Lemma+POS n-grams (n=1-2) within the window.

  6. chunk features:
    6-1) chunk type of a current word.
    6-2) the last raw word of the chunk that a current word belongs to.
    6-3) the last lemma of the chunk that a current word belongs to.
    6-4) whether the word "the" exist in the left most position of the current chunk (boolean feature).

  7. dictionary features:
    7-1) unlexicalized n-gram of a dictionary matching result (n=1-2) within the window.
    7-2) lexicalized n-gram of a dictionary matching result (n=1-2) within the window.

NERSuite uses only positive features for dictionary feature.
For a dictionary dic1 having an entry "NF-kappa B" and input text "As a result, we can identify NF-kappa B in ...", dictionary features will be triggered for each token as follows: (feature notation is not same to the one in the source code)

As - (empty)
a - (empty)
result - (empty)
, - (empty)
we - (empty)
can - "Dic[2]=dic1", "Dic[2]=dic1_NF", "Dic[1,2]=O/dic1", "Dic[1,2]=O_is/dic1_NF"
identify - "Dic[1]=dic1", "Dic[1]=dic1_NF", "Dic[0,1]=O/dic1", "Dic[0,1]=O_is/dic1_NF", "Dic[1,2]=dic1/dic1", "Dic[1,2]=dic1_NF/dic1_-"
...

And some of these features (especially orthographic features) are redundant because of the default tokenization scheme.

@priancho priancho self-assigned this Nov 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant