How to extract entities (Named Entity Recognition) in spaCy

What are named entities?

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Source

As you can see, we can already use the default model in spaCy to extract well known entities from text.

Create a new file called ner_test.py and add the following code to it.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Dialogflow, previously known as api.ai, is a chatbot framework provided by Google. Google acquired API.AI in 2016.")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

This code prints 4 values per sentence – the text of the entity, the starting position, the ending position, and the entity label (i.e. type of entity).

Now run the code

You can notice quite a few things by looking at the output.

First, Dialogflow itself is an entity (a software product) but isn’t identified.

Second, in the first sentence, api.ai also refers to a software product and not to a company (the ORG label means “organization”). The first time api.ai is mislabeled as an ORG.

Also, notice that the first word of second sentence is not identified as an entity at all. This is because the word Google is also used as a verb which can cause ambiguity when it is the first word in a sentence.

Why? Because the capitalization of the first letter is usually a hint to the statistical model that a word could be an entity. However, this rule of thumb cannot be used for the first word of a sentence, which is always capitalized for the sake of grammar. Normally spaCy can still extract the entity, but the word google is also used as a verb, which creates additional ambiguity for the model.

As you can see, the out-of-the-box entity extraction in spaCy is decent, but there is a lot of scope for improvement.


Generic filters
>