What are spaCy models?

When you are using spaCy you will see that they refer to the word “model” quite often.

So what are models?

A “model” in machine learning is the output of a machine learning algorithm run on data.

Source

It is easiest to explain using an example.

Create a file called spacy_model.py and add the following code

import spacy

nlp = spacy.blank("en")
text = 'Dialogflow, previously known as api.ai, is a chatbot framework provided by Google. Google acquired API.AI in 2016.'
doc = nlp(text)

print('Printing entities using blank model....')

for ent in doc.ents:
    print(ent)

print('Completed....')

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

print("Printing entities using default model....")

for ent in doc.ents:
    print(ent)

print('Completed....')

This is the output when you run this program

So here is an explanation of what we are doing here:

First, we load a blank model (line 3).

Then we print all the entities in the document. As you can see, spacy does not recognize any entities at all.

Then we load the pretrained en_core_web_sm model (line 14). Once the pretrained models are used, spaCy is able to identify some entities in the same text.

Larger models can do better named entity recognition

As a general rule of thumb, the larger the file size of the spaCy model, the more entities it should be able to identify in your text.

For example, here is a comparison between en_core_web_sm which we have used before and en_core_web_md, a larger sized model

The en_core_web_md is a larger model and this means it will usually take a little longer to load the model. On the other hand, we usually expect it to find more entities.

Create a new file called large_model.py and add the following code to it

import spacy
text = 'Google, headquartered in Mountain View (1600 Amphitheatre Pkwy, Mountain View, CA 940430), unveiled the new Android phone for $799 at the Consumer Electronic Show. Sundar Pichai said in his keynote that users love their new Android phones.'

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

print("Printing entities using small model....")
counter = 1
for ent in doc.ents:
    print(f'{counter} {ent}')
    counter += 1

print('Completed....')

nlp = spacy.load("en_core_web_md")
doc = nlp(text)

print("Printing entities using medium model....")
counter = 1
for ent in doc.ents:
    print(f'{counter} {ent}')
    counter += 1

print('Completed....')

Here is the output when you run this program

As you can see, the en_core_web_md does better than en_core_web_sm by identifying 10 entities in the same text versus 8.

This may not seem like a big difference, but when there is a large quantity of text the difference is usually be much more obvious.