When you are using spaCy you will see that they refer to the word “model” quite often.
So what are models?
A “model” in machine learning is the output of a machine learning algorithm run on data.
Source
Suppose we are interested in identifying all the named entities in some text. spaCy allows us to build Machine Learning models which can perform this task.
And out-of-the-box, spaCy provides multiple ML models for this task for the English language.
It is easiest to explain using an example.
Let us consider this Abstract:
SARS-CoV-2 Omicron led to the most serious outbreak of COVID-19 in Hong Kong in 2022. Under the pressure of a high volume of patients and limited medical resources, Chinese herbal medicine (CHM) has been extensively used. This is a case-control study of the infected patients that aims to evaluate the effectiveness of CHM using data extracted from the Hong Kong Baptist University Telemedicine Chinese Medicine Centre database. Patients with COVID-19 confirmed by either a rapid antigen test or a polymerase chain reaction who had completed two consultations and taken CHM within 10 days of the first positive test were included in the study (CHM group, [Formula: see text]). The matched control cases were those who did not take CHM within 10 days of the first positive test and were based on age ([Formula: see text] 3 years), vaccine doses ([Formula: see text] 3 doses, or 3 doses), and gender (no-CHM group, [Formula: see text]). The outcomes included the negative conversion time (NCT, primary outcome), total score of individual symptoms, number of the reported symptoms, and individual symptom disappearance rates. The NCT of the CHM group (median days: 7.0, interquartile range: 6.0-8.0) was significantly shorter than that of the no-CHM group (8.0, 7.0-10.5; [Formula: see text]). CHM treatment significantly reduced the total score of individual symptoms ([Formula: see text]) and the number of the reported symptoms ([Formula: see text]) as compared with that of the no-CHM group. Additionally, the symptom disappearance rates of symptoms such as chills, cough, sputum, dry throat, itching throat, headache, chest tightness, abdominal pain, diarrhea, and fatigue were significantly higher in the CHM group than in the no-CHM group. In conclusion, CHM intervention can significantly reduce NCT and COVID-19 symptoms. Chinese medicine can be accurately prescribed based on a telemedical consultation.
As you can see, it is filled with a lot of medical jargon.
We will use the following:
- a blank model (meaning it only tokenizes words, does not identify entities)
- the basic English model en_core_web_sm
- the transformer based English model en_core_web_trf
- a pretrained domain specific model en_core_sci_sm from scispacy which identifies all medical jargon
- a pretrained domain specific model en_ner_bc5cdr_md from scispacy which identifies only diseases and chemicals
First we will run these models on the given text, and print the unique entities that the ML model extracted from the given text.
Here is the code:
import spacy
text = '''
SARS-CoV-2 Omicron led to the most serious outbreak of COVID-19 in Hong Kong in 2022. Under the pressure of a high volume of patients and limited medical resources, Chinese herbal medicine (CHM) has been extensively used. This is a case-control study of the infected patients that aims to evaluate the effectiveness of CHM using data extracted from the Hong Kong Baptist University Telemedicine Chinese Medicine Centre database. Patients with COVID-19 confirmed by either a rapid antigen test or a polymerase chain reaction who had completed two consultations and taken CHM within 10 days of the first positive test were included in the study (CHM group, [Formula: see text]). The matched control cases were those who did not take CHM within 10 days of the first positive test and were based on age ([Formula: see text] 3 years), vaccine doses ([Formula: see text] 3 doses, or 3 doses), and gender (no-CHM group, [Formula: see text]). The outcomes included the negative conversion time (NCT, primary outcome), total score of individual symptoms, number of the reported symptoms, and individual symptom disappearance rates. The NCT of the CHM group (median days: 7.0, interquartile range: 6.0-8.0) was significantly shorter than that of the no-CHM group (8.0, 7.0-10.5; [Formula: see text]). CHM treatment significantly reduced the total score of individual symptoms ([Formula: see text]) and the number of the reported symptoms ([Formula: see text]) as compared with that of the no-CHM group. Additionally, the symptom disappearance rates of symptoms such as chills, cough, sputum, dry throat, itching throat, headache, chest tightness, abdominal pain, diarrhea, and fatigue were significantly higher in the CHM group than in the no-CHM group. In conclusion, CHM intervention can significantly reduce NCT and COVID-19 symptoms. Chinese medicine can be accurately prescribed based on a telemedical consultation.
'''
model_name = 'en'
nlp = spacy.blank(model_name)
doc = nlp(text)
uniq_ents = set([e.text for e in doc.ents])
ent_str = ' | '.join(uniq_ents)
print(f'Model = {model_name} Num Ents = {len(uniq_ents)} Ents = {ent_str}\n')
model_name = 'en_core_web_sm'
nlp = spacy.load(model_name)
doc = nlp(text)
uniq_ents = set([e.text for e in doc.ents])
ent_str = ' | '.join(uniq_ents)
print(f'Model = {model_name} Num Ents = {len(uniq_ents)} Ents = {ent_str}\n')
model_name = 'en_core_web_trf'
nlp = spacy.load(model_name)
doc = nlp(text)
uniq_ents = set([e.text for e in doc.ents])
ent_str = ' | '.join(uniq_ents)
print(f'Model = {model_name} Num Ents = {len(uniq_ents)} Ents = {ent_str}\n')
model_name = 'en_core_sci_sm'
nlp = spacy.load(model_name)
doc = nlp(text)
uniq_ents = set([e.text for e in doc.ents])
ent_str = ' | '.join(uniq_ents)
print(f'Model = {model_name} Num Ents = {len(uniq_ents)} Ents = {ent_str}\n')
model_name = 'en_ner_bc5cdr_md'
nlp = spacy.load(model_name)
doc = nlp(text)
uniq_ents = set([e.text for e in doc.ents])
ent_str = ' | '.join(uniq_ents)
print(f'Model = {model_name} Num Ents = {len(uniq_ents)} Ents = {ent_str}')
And here is the result from running this code:
Model = en Num Ents = 0 Ents =
Model = en_core_web_sm Num Ents = 16 Ents = Omicron | 3 years | CHM | 7.0 | COVID-19 | Formula | 10 days | (median days | 3 | Chinese | two | first | 6.0 | Hong Kong | 8.0 | 2022
Model = en_core_web_trf Num Ents = 15 Ents = 3 years | 7.0 | median days | 6.0-8.0 | 7.0-10.5 | 10 days | 3 | 8.0 | Chinese | two | first | Telemedicine Chinese Medicine Centre | Hong Kong | Hong Kong Baptist University | 2022
Model = en_core_sci_sm Num Ents = 57 Ents = chest tightness | rapid antigen test | medical resources | chills | Formula | diarrhea | total score | symptom disappearance rates | group | symptoms | intervention | infected | NCT | itching throat | Chinese medicine | years | headache | study | patients | Patients | no-CHM | volume | outbreak | case-control study | CHM group | outcomes | gender | score | evaluate | days | dry throat | CHM | individual | polymerase chain reaction | cough | positive test | fatigue | treatment | vaccine doses | effectiveness | pressure | no-CHM group | COVID-19 | interquartile range | negative | telemedical consultation | consultations | age | doses | Chinese herbal medicine | primary outcome | prescribed | data | sputum | Hong Kong | abdominal pain | Hong Kong Baptist University Telemedicine Chinese Medicine Centre database
Model = en_ner_bc5cdr_md Num Ents = 9 Ents = chest tightness | cough | headache | chills | diarrhea | fatigue | throat | abdominal pain | itching throat
Let us visualize these one by one (I skipped the blank model as it does not identify any entities)
en_core_web_sm
SARS-CoV-2
Omicron
ORG
led to the most serious outbreak of COVID-19 in
Hong Kong
GPE
in
2022
DATE
. Under the pressure of a high volume of patients and limited medical resources,
Chinese
NORP
herbal medicine (CHM) has been extensively used. This is a case-control study of the infected patients that aims to evaluate the effectiveness of
CHM
ORG
using data extracted from the
Hong Kong
ORG
Baptist University Telemedicine Chinese Medicine Centre database. Patients with COVID-19 confirmed by either a rapid antigen test or a polymerase chain reaction who had completed
two
CARDINAL
consultations and taken
CHM
ORG
within
10 days
DATE
of the
first
ORDINAL
positive test were included in the study (
CHM
ORG
group, [
Formula
ORG
: see text]). The matched control cases were those who did not take CHM within
10 days
DATE
of the
first
ORDINAL
positive test and were based on age ([
Formula
WORK_OF_ART
: see text]
3 years
DATE
), vaccine doses ([
Formula
WORK_OF_ART
: see text]
3
CARDINAL
doses, or
3
CARDINAL
doses), and gender (no-CHM group, [
Formula
ORG
: see text]). The outcomes included the negative conversion time (NCT, primary outcome), total score of individual symptoms, number of the reported symptoms, and individual symptom disappearance rates. The NCT of the
CHM
ORG
group
(median days
DATE
:
7.0
CARDINAL
, interquartile range:
6.0
CARDINAL
-8.0) was significantly shorter than that of the no-CHM group (
8.0
CARDINAL
,
7.0
CARDINAL
-10.5; [
Formula
PERSON
: see text]). CHM treatment significantly reduced the total score of individual symptoms ([
Formula
WORK_OF_ART
: see text]) and the number of the reported symptoms ([
Formula
WORK_OF_ART
: see text]) as compared with that of the no-CHM group. Additionally, the symptom disappearance rates of symptoms such as chills, cough, sputum, dry throat, itching throat, headache, chest tightness, abdominal pain, diarrhea, and fatigue were significantly higher in the
CHM
ORG
group than in the no-CHM group. In conclusion, CHM intervention can significantly reduce NCT and
COVID-19
ORG
symptoms.
Chinese
NORP
medicine can be accurately prescribed based on a telemedical consultation.
en_core_web_trf
SARS-CoV-2 Omicron led to the most serious outbreak of COVID-19 in
Hong Kong
GPE
in
2022
DATE
. Under the pressure of a high volume of patients and limited medical resources,
Chinese
NORP
herbal medicine (CHM) has been extensively used. This is a case-control study of the infected patients that aims to evaluate the effectiveness of CHM using data extracted from the
Hong Kong Baptist University
ORG
Telemedicine Chinese Medicine Centre
ORG
database. Patients with COVID-19 confirmed by either a rapid antigen test or a polymerase chain reaction who had completed
two
CARDINAL
consultations and taken CHM within
10 days
DATE
of the
first
ORDINAL
positive test were included in the study (CHM group, [Formula: see text]). The matched control cases were those who did not take CHM within
10 days
DATE
of the
first
ORDINAL
positive test and were based on age ([Formula: see text]
3 years
DATE
), vaccine doses ([Formula: see text]
3
CARDINAL
doses, or
3
CARDINAL
doses), and gender (no-CHM group, [Formula: see text]). The outcomes included the negative conversion time (NCT, primary outcome), total score of individual symptoms, number of the reported symptoms, and individual symptom disappearance rates. The NCT of the CHM group (
median days
DATE
:
7.0
CARDINAL
, interquartile range:
6.0-8.0
CARDINAL
) was significantly shorter than that of the no-CHM group (
8.0
CARDINAL
,
7.0-10.5
CARDINAL
; [Formula: see text]). CHM treatment significantly reduced the total score of individual symptoms ([Formula: see text]) and the number of the reported symptoms ([Formula: see text]) as compared with that of the no-CHM group. Additionally, the symptom disappearance rates of symptoms such as chills, cough, sputum, dry throat, itching throat, headache, chest tightness, abdominal pain, diarrhea, and fatigue were significantly higher in the CHM group than in the no-CHM group. In conclusion, CHM intervention can significantly reduce NCT and COVID-19 symptoms.
Chinese
NORP
medicine can be accurately prescribed based on a telemedical consultation.
en_core_sci_sm
SARS-CoV-2 Omicron led to the most serious
outbreak
ENTITY
of
COVID-19
ENTITY
in
Hong Kong
ENTITY
in 2022. Under the
pressure
ENTITY
of a high
volume
ENTITY
of
patients
ENTITY
and limited
medical resources
ENTITY
,
Chinese herbal medicine
ENTITY
(
CHM
ENTITY
) has been extensively used. This is a
case-control study
ENTITY
of the
infected
ENTITY
patients
ENTITY
that aims to
evaluate
ENTITY
the
effectiveness
ENTITY
of
CHM
ENTITY
using
data
ENTITY
extracted from the
Hong Kong Baptist University Telemedicine Chinese Medicine Centre database
ENTITY
.
Patients
ENTITY
with
COVID-19
ENTITY
confirmed by either a
rapid antigen test
ENTITY
or a
polymerase chain reaction
ENTITY
who had completed two
consultations
ENTITY
and taken
CHM
ENTITY
within 10
days
ENTITY
of the first
positive test
ENTITY
were included in the
study
ENTITY
(
CHM group
ENTITY
, [
Formula
ENTITY
: see text]). The matched control cases were those who did not take
CHM
ENTITY
within 10
days
ENTITY
of the first
positive test
ENTITY
and were based on
age
ENTITY
([
Formula
ENTITY
: see text] 3
years
ENTITY
),
vaccine doses
ENTITY
([
Formula
ENTITY
: see text] 3
doses
ENTITY
, or 3
doses
ENTITY
), and
gender
ENTITY
(
no-CHM group
ENTITY
, [
Formula
ENTITY
: see text]). The
outcomes
ENTITY
included the
negative
ENTITY
conversion time (
NCT
ENTITY
,
primary outcome
ENTITY
),
total score
ENTITY
of
individual
ENTITY
symptoms
ENTITY
, number of the reported
symptoms
ENTITY
, and
individual
ENTITY
symptom disappearance rates
ENTITY
. The
NCT
ENTITY
of the
CHM
ENTITY
group
ENTITY
(median
days
ENTITY
: 7.0,
interquartile range
ENTITY
: 6.0-8.0) was significantly shorter than that of the
no-CHM
ENTITY
group (8.0, 7.0-10.5; [
Formula
ENTITY
: see text]).
CHM
ENTITY
treatment
ENTITY
significantly reduced the total
score
ENTITY
of
individual
ENTITY
symptoms ([
Formula
ENTITY
: see text]) and the number of the reported
symptoms
ENTITY
([
Formula
ENTITY
: see text]) as compared with that of the
no-CHM
ENTITY
group. Additionally, the
symptom disappearance rates
ENTITY
of
symptoms
ENTITY
such as
chills
ENTITY
,
cough
ENTITY
,
sputum
ENTITY
,
dry throat
ENTITY
,
itching throat
ENTITY
,
headache
ENTITY
,
chest tightness
ENTITY
,
abdominal pain
ENTITY
,
diarrhea
ENTITY
, and
fatigue
ENTITY
were significantly higher in the
CHM
ENTITY
group
ENTITY
than in the
no-CHM
ENTITY
group
ENTITY
. In conclusion,
CHM
ENTITY
intervention
ENTITY
can significantly reduce
NCT
ENTITY
and
COVID-19
ENTITY
symptoms
ENTITY
.
Chinese medicine
ENTITY
can be accurately
prescribed
ENTITY
based on a
telemedical consultation
ENTITY
.
en_ner_bc5cdr_md
SARS-CoV-2 Omicron led to the most serious outbreak of COVID-19 in Hong Kong in 2022. Under the pressure of a high volume of patients and limited medical resources, Chinese herbal medicine (CHM) has been extensively used. This is a case-control study of the infected patients that aims to evaluate the effectiveness of CHM using data extracted from the Hong Kong Baptist University Telemedicine Chinese Medicine Centre database. Patients with COVID-19 confirmed by either a rapid antigen test or a polymerase chain reaction who had completed two consultations and taken CHM within 10 days of the first positive test were included in the study (CHM group, [Formula: see text]). The matched control cases were those who did not take CHM within 10 days of the first positive test and were based on age ([Formula: see text] 3 years), vaccine doses ([Formula: see text] 3 doses, or 3 doses), and gender (no-CHM group, [Formula: see text]). The outcomes included the negative conversion time (NCT, primary outcome), total score of individual symptoms, number of the reported symptoms, and individual symptom disappearance rates. The NCT of the CHM group (median days: 7.0, interquartile range: 6.0-8.0) was significantly shorter than that of the no-CHM group (8.0, 7.0-10.5; [Formula: see text]). CHM treatment significantly reduced the total score of individual symptoms ([Formula: see text]) and the number of the reported symptoms ([Formula: see text]) as compared with that of the no-CHM group. Additionally, the symptom disappearance rates of symptoms such as
chills
DISEASE
,
cough
DISEASE
, sputum, dry
throat
DISEASE
,
itching throat
DISEASE
,
headache
DISEASE
,
chest tightness
DISEASE
,
abdominal pain
DISEASE
,
diarrhea
DISEASE
, and
fatigue
DISEASE
were significantly higher in the CHM group than in the no-CHM group. In conclusion, CHM intervention can significantly reduce NCT and COVID-19 symptoms. Chinese medicine can be accurately prescribed based on a telemedical consultation.
Summary
You can see a few things based on the list of entities which were extracted:
- Even though the en_core_web_sm model is small (in size), it is pretty good and identifies entities on par with a larger model like en_core_web_trf
- Domain specific models will (obviously) find more entities of interest – that is often the primary reason to create domain specific ML models
- Within domain specific models, you can have specific models which only identify specific types of entities – for e.g. en_ner_bc5cdr_md only identifies diseases and chemicals. More is not always better, and sometimes extracting fewer entities is a sign of a higher quality threshold.