How to use custom stop words in spaCy

Stop words are very common words like “a”, “the” etc which do not provide much information (low information density words) about the text you are analyzing.

spaCy allows you to check if a given word is a stop word. Let us see how we can do that.

First, create a file called stop_words.py and add the following code to it.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Dialogflow, previously known as api.ai, is a chatbot framework provided by Google. Google acquired API.AI in 2016.")

full_text = ''
for tok in doc:
    tok_text = tok.text
    if tok.is_stop:
        tok_text = f'[{tok.text}]'
    full_text += tok_text + ' '

print(full_text)

As you can see, I am simply iterating over all the tokens, and if the token is a stop word, I enclose it in square brackets.

This is the output when you run this script

spaCy allows you to modify the list of stop words.

Create a file called custom_stopwords.py and add the following code:

import spacy

cls = spacy.util.get_lang_class('en')
cls.Defaults.stop_words.remove('by')
cls.Defaults.stop_words.add('google')
nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Dialogflow, previously known as api.ai, is a chatbot framework provided by Google. Google acquired API.AI in 2016.")

full_text = ''
for tok in doc:
    tok_text = tok.text
    if tok.is_stop:
        tok_text = f'[{tok.text}]'
    full_text += tok_text + ' '

print(full_text)

Notice that I have added the word ‘google’ as a stop word, and removed the word ‘by’ from the existing list.

Very important: you must modify the list of stop words before you load the model

This is what the output looks like now