spacy-lookup: Named Entity Recognition based on dictionaries

spaCy v2.0 extension and pipeline component for adding Named Entities metadata to Doc objects. Detects Named Entities using dictionaries. The extension sets the custom Doc, Token and Span attributes ._.is_entity, ._.entity_type, ._.has_entities and ._.entities.

Named Entities are matched using the python module flashtext, and looks up in the data provided by different dictionaries.

Installation

spacy-lookup requires spacy v2.0.16 or higher.

pip install spacy-lookup

Usage

First, you need to download a language model.

python -m spacy download en

Import the component and initialise it with the shared nlp object (i.e. an instance of Language), which is used to initialise flashtext with the shared vocab, and create the match patterns. Then add the component anywhere in your pipeline.

import spacy
from spacy_lookup import Entity

nlp = spacy.load('en')
entity = Entity(keywords_list=['python', 'product manager', 'java platform'])
nlp.add_pipe(entity, last=True)

doc = nlp(u"I am a product manager for a java and python.")
assert doc._.has_entities == True
assert doc[0]._.is_entity == False
assert doc[3]._.entity_desc == 'product manager'
assert doc[3]._.is_entity == True

print([(token.text, token._.canonical) for token in doc if token._.is_entity])

spacy-lookup only cares about the token text, so you can use it on a blank Language instance (it should work for all available languages!), or in a pipeline with a loaded model. If you're loading a model and your pipeline includes a tagger, parser and entity recognizer, make sure to add the entity component as last=True, so the spans are merged at the end of the pipeline.

Available attributes

The extension sets attributes on the Doc, Span and Token. You can change the attribute names on initialisation of the extension. For more details on custom components and attributes, see the processing pipelines documentation.

`Token._.is_entity`	bool	Whether the token is an entity.
`Token._.entity_type`	unicode	A human-readable description of the entity.
`Doc._.has_entities`	bool	Whether the document contains entity.
`Doc._.entities`	list	`(entity, index, description)` tuples of the document's entities.
`Span._.has_entities`	bool	Whether the span contains entity.
`Span._.entities`	list	`(entity, index, description)` tuples of the span's entities.

Settings

On initialisation of Entity, you can define the following settings:

`nlp`	`Language`	The shared `nlp` object. Used to initialise the matcher with the shared `Vocab`, and create `Doc` match patterns.
`attrs`	tuple	Attributes to set on the ._ property. Defaults to `('has_entities', 'is_entity', 'entity_type', 'entity')`.
`keywords_list	` list	Optional lookup table with the list of terms to look for.
`keywords_dict	` dict	Optional lookup table with the list of terms to look for.
`keywords_file	` string	Optional filename with the list of terms to look for.

entity = Entity(nlp, keywords_list=['python', 'java platform'], label='ACME')
nlp.add_pipe(entity)
doc = nlp(u"I am a product manager for a java platform and python.")
assert doc[3]._.is_entity

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
spacy_lookup		spacy_lookup
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

spacy_lookup

spacy_lookup

tests

tests

.gitignore

.gitignore

LICENSE

LICENSE

README.rst

README.rst

requirements-dev.txt

requirements-dev.txt

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

spacy-lookup: Named Entity Recognition based on dictionaries

Installation

Usage

Available attributes

Settings

About

Releases 1

Packages

Contributors 5

Languages

License

mpuig/spacy-lookup

Folders and files

Latest commit

History

Repository files navigation

spacy-lookup: Named Entity Recognition based on dictionaries

Installation

Usage

Available attributes

Settings

About

Topics

Resources

License

Stars

Watchers

Forks

Languages