What are spaCy Machine Learning models?

When you are using spaCy you will see that they refer to the word “model” quite often.

So what are models?

A “model” in machine learning is the output of a machine learning algorithm run on data.

Source

Suppose we are interested in identifying all the named entities in some text. spaCy allows us to build Machine Learning models which can perform this task.

And out-of-the-box, spaCy provides multiple ML models for this task for the English language.

It is easiest to explain using an example.

Let us consider this Abstract:

SARS-CoV-2 Omicron led to the most serious outbreak of COVID-19 in Hong Kong in 2022. Under the pressure of a high volume of patients and limited medical resources, Chinese herbal medicine (CHM) has been extensively used. This is a case-control study of the infected patients that aims to evaluate the effectiveness of CHM using data extracted from the Hong Kong Baptist University Telemedicine Chinese Medicine Centre database. Patients with COVID-19 confirmed by either a rapid antigen test or a polymerase chain reaction who had completed two consultations and taken CHM within 10 days of the first positive test were included in the study (CHM group, [Formula: see text]). The matched control cases were those who did not take CHM within 10 days of the first positive test and were based on age ([Formula: see text] 3 years), vaccine doses ([Formula: see text] 3 doses, or 3 doses), and gender (no-CHM group, [Formula: see text]). The outcomes included the negative conversion time (NCT, primary outcome), total score of individual symptoms, number of the reported symptoms, and individual symptom disappearance rates. The NCT of the CHM group (median days: 7.0, interquartile range: 6.0-8.0) was significantly shorter than that of the no-CHM group (8.0, 7.0-10.5; [Formula: see text]). CHM treatment significantly reduced the total score of individual symptoms ([Formula: see text]) and the number of the reported symptoms ([Formula: see text]) as compared with that of the no-CHM group. Additionally, the symptom disappearance rates of symptoms such as chills, cough, sputum, dry throat, itching throat, headache, chest tightness, abdominal pain, diarrhea, and fatigue were significantly higher in the CHM group than in the no-CHM group. In conclusion, CHM intervention can significantly reduce NCT and COVID-19 symptoms. Chinese medicine can be accurately prescribed based on a telemedical consultation.

As you can see, it is filled with a lot of medical jargon.

We will use the following:

  • a blank model (meaning it only tokenizes words, does not identify entities)
  • the basic English model en_core_web_sm
  • the transformer based English model en_core_web_trf
  • a pretrained domain specific model en_core_sci_sm from scispacy which identifies all medical jargon
  • a pretrained domain specific model en_ner_bc5cdr_md from scispacy which identifies only diseases and chemicals

First we will run these models on the given text, and print the unique entities that the ML model extracted from the given text.

Here is the code:

import spacy

text = '''
SARS-CoV-2 Omicron led to the most serious outbreak of COVID-19 in Hong Kong in 2022. Under the pressure of a high volume of patients and limited medical resources, Chinese herbal medicine (CHM) has been extensively used. This is a case-control study of the infected patients that aims to evaluate the effectiveness of CHM using data extracted from the Hong Kong Baptist University Telemedicine Chinese Medicine Centre database. Patients with COVID-19 confirmed by either a rapid antigen test or a polymerase chain reaction who had completed two consultations and taken CHM within 10 days of the first positive test were included in the study (CHM group, [Formula: see text]). The matched control cases were those who did not take CHM within 10 days of the first positive test and were based on age ([Formula: see text] 3 years), vaccine doses ([Formula: see text] 3 doses, or 3 doses), and gender (no-CHM group, [Formula: see text]). The outcomes included the negative conversion time (NCT, primary outcome), total score of individual symptoms, number of the reported symptoms, and individual symptom disappearance rates. The NCT of the CHM group (median days: 7.0, interquartile range: 6.0-8.0) was significantly shorter than that of the no-CHM group (8.0, 7.0-10.5; [Formula: see text]). CHM treatment significantly reduced the total score of individual symptoms ([Formula: see text]) and the number of the reported symptoms ([Formula: see text]) as compared with that of the no-CHM group. Additionally, the symptom disappearance rates of symptoms such as chills, cough, sputum, dry throat, itching throat, headache, chest tightness, abdominal pain, diarrhea, and fatigue were significantly higher in the CHM group than in the no-CHM group. In conclusion, CHM intervention can significantly reduce NCT and COVID-19 symptoms. Chinese medicine can be accurately prescribed based on a telemedical consultation. 
'''

model_name = 'en'
nlp = spacy.blank(model_name)
doc = nlp(text)
uniq_ents = set([e.text for e in doc.ents])
ent_str = ' | '.join(uniq_ents)
print(f'Model = {model_name} Num Ents = {len(uniq_ents)} Ents = {ent_str}\n')

model_name = 'en_core_web_sm'
nlp = spacy.load(model_name)
doc = nlp(text)
uniq_ents = set([e.text for e in doc.ents])
ent_str = ' | '.join(uniq_ents)
print(f'Model = {model_name} Num Ents = {len(uniq_ents)} Ents = {ent_str}\n')

model_name = 'en_core_web_trf'
nlp = spacy.load(model_name)
doc = nlp(text)
uniq_ents = set([e.text for e in doc.ents])
ent_str = ' | '.join(uniq_ents)
print(f'Model = {model_name} Num Ents = {len(uniq_ents)} Ents = {ent_str}\n')

model_name = 'en_core_sci_sm'
nlp = spacy.load(model_name)
doc = nlp(text)
uniq_ents = set([e.text for e in doc.ents])
ent_str = ' | '.join(uniq_ents)
print(f'Model = {model_name} Num Ents = {len(uniq_ents)} Ents = {ent_str}\n')

model_name = 'en_ner_bc5cdr_md'
nlp = spacy.load(model_name)
doc = nlp(text)
uniq_ents = set([e.text for e in doc.ents])
ent_str = ' | '.join(uniq_ents)
print(f'Model = {model_name} Num Ents = {len(uniq_ents)} Ents = {ent_str}')

And here is the result from running this code:

Model = en Num Ents = 0 Ents = 

Model = en_core_web_sm Num Ents = 16 Ents = Omicron | 3 years | CHM | 7.0 | COVID-19 | Formula | 10 days | (median days | 3 | Chinese | two | first | 6.0 | Hong Kong | 8.0 | 2022

Model = en_core_web_trf Num Ents = 15 Ents = 3 years | 7.0 | median days | 6.0-8.0 | 7.0-10.5 | 10 days | 3 | 8.0 | Chinese | two | first | Telemedicine Chinese Medicine Centre | Hong Kong | Hong Kong Baptist University | 2022

Model = en_core_sci_sm Num Ents = 57 Ents = chest tightness | rapid antigen test | medical resources | chills | Formula | diarrhea | total score | symptom disappearance rates | group | symptoms | intervention | infected | NCT | itching throat | Chinese medicine | years | headache | study | patients | Patients | no-CHM | volume | outbreak | case-control study | CHM group | outcomes | gender | score | evaluate | days | dry throat | CHM | individual | polymerase chain reaction | cough | positive test | fatigue | treatment | vaccine doses | effectiveness | pressure | no-CHM group | COVID-19 | interquartile range | negative | telemedical consultation | consultations | age | doses | Chinese herbal medicine | primary outcome | prescribed | data | sputum | Hong Kong | abdominal pain | Hong Kong Baptist University Telemedicine Chinese Medicine Centre database

Model = en_ner_bc5cdr_md Num Ents = 9 Ents = chest tightness | cough | headache | chills | diarrhea | fatigue | throat | abdominal pain | itching throat

Let us visualize these one by one (I skipped the blank model as it does not identify any entities)

en_core_web_sm


SARS-CoV-2 Omicron ORG led to the most serious outbreak of COVID-19 in Hong Kong GPE in 2022 DATE . Under the pressure of a high volume of patients and limited medical resources, Chinese NORP herbal medicine (CHM) has been extensively used. This is a case-control study of the infected patients that aims to evaluate the effectiveness of CHM ORG using data extracted from the Hong Kong ORG Baptist University Telemedicine Chinese Medicine Centre database. Patients with COVID-19 confirmed by either a rapid antigen test or a polymerase chain reaction who had completed two CARDINAL consultations and taken CHM ORG within 10 days DATE of the first ORDINAL positive test were included in the study ( CHM ORG group, [ Formula ORG : see text]). The matched control cases were those who did not take CHM within 10 days DATE of the first ORDINAL positive test and were based on age ([ Formula WORK_OF_ART : see text] 3 years DATE ), vaccine doses ([ Formula WORK_OF_ART : see text] 3 CARDINAL doses, or 3 CARDINAL doses), and gender (no-CHM group, [ Formula ORG : see text]). The outcomes included the negative conversion time (NCT, primary outcome), total score of individual symptoms, number of the reported symptoms, and individual symptom disappearance rates. The NCT of the CHM ORG group (median days DATE : 7.0 CARDINAL , interquartile range: 6.0 CARDINAL -8.0) was significantly shorter than that of the no-CHM group ( 8.0 CARDINAL , 7.0 CARDINAL -10.5; [ Formula PERSON : see text]). CHM treatment significantly reduced the total score of individual symptoms ([ Formula WORK_OF_ART : see text]) and the number of the reported symptoms ([ Formula WORK_OF_ART : see text]) as compared with that of the no-CHM group. Additionally, the symptom disappearance rates of symptoms such as chills, cough, sputum, dry throat, itching throat, headache, chest tightness, abdominal pain, diarrhea, and fatigue were significantly higher in the CHM ORG group than in the no-CHM group. In conclusion, CHM intervention can significantly reduce NCT and COVID-19 ORG symptoms. Chinese NORP medicine can be accurately prescribed based on a telemedical consultation.

en_core_web_trf


SARS-CoV-2 Omicron led to the most serious outbreak of COVID-19 in Hong Kong GPE in 2022 DATE . Under the pressure of a high volume of patients and limited medical resources, Chinese NORP herbal medicine (CHM) has been extensively used. This is a case-control study of the infected patients that aims to evaluate the effectiveness of CHM using data extracted from the Hong Kong Baptist University ORG Telemedicine Chinese Medicine Centre ORG database. Patients with COVID-19 confirmed by either a rapid antigen test or a polymerase chain reaction who had completed two CARDINAL consultations and taken CHM within 10 days DATE of the first ORDINAL positive test were included in the study (CHM group, [Formula: see text]). The matched control cases were those who did not take CHM within 10 days DATE of the first ORDINAL positive test and were based on age ([Formula: see text] 3 years DATE ), vaccine doses ([Formula: see text] 3 CARDINAL doses, or 3 CARDINAL doses), and gender (no-CHM group, [Formula: see text]). The outcomes included the negative conversion time (NCT, primary outcome), total score of individual symptoms, number of the reported symptoms, and individual symptom disappearance rates. The NCT of the CHM group ( median days DATE : 7.0 CARDINAL , interquartile range: 6.0-8.0 CARDINAL ) was significantly shorter than that of the no-CHM group ( 8.0 CARDINAL , 7.0-10.5 CARDINAL ; [Formula: see text]). CHM treatment significantly reduced the total score of individual symptoms ([Formula: see text]) and the number of the reported symptoms ([Formula: see text]) as compared with that of the no-CHM group. Additionally, the symptom disappearance rates of symptoms such as chills, cough, sputum, dry throat, itching throat, headache, chest tightness, abdominal pain, diarrhea, and fatigue were significantly higher in the CHM group than in the no-CHM group. In conclusion, CHM intervention can significantly reduce NCT and COVID-19 symptoms. Chinese NORP medicine can be accurately prescribed based on a telemedical consultation.

en_core_sci_sm


SARS-CoV-2 Omicron led to the most serious outbreak ENTITY of COVID-19 ENTITY in Hong Kong ENTITY in 2022. Under the pressure ENTITY of a high volume ENTITY of patients ENTITY and limited medical resources ENTITY , Chinese herbal medicine ENTITY ( CHM ENTITY ) has been extensively used. This is a case-control study ENTITY of the infected ENTITY patients ENTITY that aims to evaluate ENTITY the effectiveness ENTITY of CHM ENTITY using data ENTITY extracted from the Hong Kong Baptist University Telemedicine Chinese Medicine Centre database ENTITY . Patients ENTITY with COVID-19 ENTITY confirmed by either a rapid antigen test ENTITY or a polymerase chain reaction ENTITY who had completed two consultations ENTITY and taken CHM ENTITY within 10 days ENTITY of the first positive test ENTITY were included in the study ENTITY ( CHM group ENTITY , [ Formula ENTITY : see text]). The matched control cases were those who did not take CHM ENTITY within 10 days ENTITY of the first positive test ENTITY and were based on age ENTITY ([ Formula ENTITY : see text] 3 years ENTITY ), vaccine doses ENTITY ([ Formula ENTITY : see text] 3 doses ENTITY , or 3 doses ENTITY ), and gender ENTITY ( no-CHM group ENTITY , [ Formula ENTITY : see text]). The outcomes ENTITY included the negative ENTITY conversion time ( NCT ENTITY , primary outcome ENTITY ), total score ENTITY of individual ENTITY symptoms ENTITY , number of the reported symptoms ENTITY , and individual ENTITY symptom disappearance rates ENTITY . The NCT ENTITY of the CHM ENTITY group ENTITY (median days ENTITY : 7.0, interquartile range ENTITY : 6.0-8.0) was significantly shorter than that of the no-CHM ENTITY group (8.0, 7.0-10.5; [ Formula ENTITY : see text]). CHM ENTITY treatment ENTITY significantly reduced the total score ENTITY of individual ENTITY symptoms ([ Formula ENTITY : see text]) and the number of the reported symptoms ENTITY ([ Formula ENTITY : see text]) as compared with that of the no-CHM ENTITY group. Additionally, the symptom disappearance rates ENTITY of symptoms ENTITY such as chills ENTITY , cough ENTITY , sputum ENTITY , dry throat ENTITY , itching throat ENTITY , headache ENTITY , chest tightness ENTITY , abdominal pain ENTITY , diarrhea ENTITY , and fatigue ENTITY were significantly higher in the CHM ENTITY group ENTITY than in the no-CHM ENTITY group ENTITY . In conclusion, CHM ENTITY intervention ENTITY can significantly reduce NCT ENTITY and COVID-19 ENTITY symptoms ENTITY . Chinese medicine ENTITY can be accurately prescribed ENTITY based on a telemedical consultation ENTITY .

en_ner_bc5cdr_md


SARS-CoV-2 Omicron led to the most serious outbreak of COVID-19 in Hong Kong in 2022. Under the pressure of a high volume of patients and limited medical resources, Chinese herbal medicine (CHM) has been extensively used. This is a case-control study of the infected patients that aims to evaluate the effectiveness of CHM using data extracted from the Hong Kong Baptist University Telemedicine Chinese Medicine Centre database. Patients with COVID-19 confirmed by either a rapid antigen test or a polymerase chain reaction who had completed two consultations and taken CHM within 10 days of the first positive test were included in the study (CHM group, [Formula: see text]). The matched control cases were those who did not take CHM within 10 days of the first positive test and were based on age ([Formula: see text] 3 years), vaccine doses ([Formula: see text] 3 doses, or 3 doses), and gender (no-CHM group, [Formula: see text]). The outcomes included the negative conversion time (NCT, primary outcome), total score of individual symptoms, number of the reported symptoms, and individual symptom disappearance rates. The NCT of the CHM group (median days: 7.0, interquartile range: 6.0-8.0) was significantly shorter than that of the no-CHM group (8.0, 7.0-10.5; [Formula: see text]). CHM treatment significantly reduced the total score of individual symptoms ([Formula: see text]) and the number of the reported symptoms ([Formula: see text]) as compared with that of the no-CHM group. Additionally, the symptom disappearance rates of symptoms such as chills DISEASE , cough DISEASE , sputum, dry throat DISEASE , itching throat DISEASE , headache DISEASE , chest tightness DISEASE , abdominal pain DISEASE , diarrhea DISEASE , and fatigue DISEASE were significantly higher in the CHM group than in the no-CHM group. In conclusion, CHM intervention can significantly reduce NCT and COVID-19 symptoms. Chinese medicine can be accurately prescribed based on a telemedical consultation.

Summary

You can see a few things based on the list of entities which were extracted:

  • Even though the en_core_web_sm model is small (in size), it is pretty good and identifies entities on par with a larger model like en_core_web_trf
  • Domain specific models will (obviously) find more entities of interest – that is often the primary reason to create domain specific ML models
  • Within domain specific models, you can have specific models which only identify specific types of entities – for e.g. en_ner_bc5cdr_md only identifies diseases and chemicals. More is not always better, and sometimes extracting fewer entities is a sign of a higher quality threshold.