How to extract entities (Named Entity Recognition) in spaCy
What are named entities?
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.
As you can see, we can already use the default model in spaCy to extract well known entities from text.
Create a new file called ner_test.py and add the following code to it.
Note: you need to download the en_core_web_sm model first to be able to run the script below
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Dialogflow, previously known as api.ai, is a chatbot framework provided by Google. Google acquired API.AI in 2016.")
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
This code prints 4 values per sentence – the text of the entity, the starting position, the ending position, and the entity label (i.e. type of entity).
Now run the code

You can notice quite a few things by looking at the output.
First, Dialogflow itself is an entity (a software product) but isn’t identified.
Second, in the first sentence, api.ai also refers to a software product and not to a company (the ORG label means “organization”). The first time api.ai is mislabeled as an ORG.
Also, notice that the first word of second sentence is not identified as an entity at all. This is because the word Google is also used as a verb which can cause ambiguity when it is the first word in a sentence.
Why? Because the capitalization of the first letter is usually a hint to the statistical model that a word could be an entity. However, this rule of thumb cannot be used for the first word of a sentence, which is always capitalized for the sake of grammar. Normally spaCy can still extract the entity, but the word google is also used as a verb, which creates additional ambiguity for the model.
As you can see, the out-of-the-box entity extraction in spaCy is decent, but there is a lot of scope for improvement.
Visualizing entity extraction
You can use the displacy visualization suite to see the labels for all the entities that were extracted by your model.
Better machine learning models can extract more entities
If you use the larger (and better) en_core_web_lg machine learning model, you can usually extract more entities.
Notice that the smaller model missed API.AI but the larger model was able to identify it as an organization.
(The difference would be much more obvious if you use domain specific models)
You must be logged in to post a comment.