How to split a document into individual words in spaCy
The default parser in spaCy converts a document (such as a paragraph of text) into a list of sentences where the sentences are themselves composed of tokens.
Tokens are not words, they are a little more complex as the documentation explains.
During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token.
In the previous article, I explained how you can iterate over the tokens in each sentence.
But in fact, you can also iterate over the tokens directly
Note: you need to download the en_core_web_sm model first to be able to run the script below
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('This is the first sentence. This is the second sentence.')
for tok in doc:
print(tok)
This is what the output looks like when you run the script. Notice that the output is identical to the previous article.

But tokens are also complex objects and contain quite a lot of information.
You can see this by using Intellisense on an individual token.

For example, let us print the token’s lemma along with its text.
Here is the code, in a new file called text_lemmas.py
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('This is an example of how lemmas and tokens differ in spaCy')
for tok in doc:
print(f'Token text: {tok.text} Lemma: {tok.lemma_}')
You can see that the text and the lemma is not the same for all tokens. (for example the lemma of ‘is’ = ‘be’)

As you can see, you get access to a LOT of features with a very small amount of code in spaCy. This is why it is the best NLP library by a distance.
You must be logged in to post a comment.