Home / Learn spaCy / How to split a document into individual words in spaCy
Learn spaCy

How to split a document into individual words in spaCy

This article is part of the Learn spaCy series

The default parser in spaCy converts a document (such as a paragraph of text) into a list of sentences where the sentences are themselves composed of tokens.

Tokens are not words, they are a little more complex as the documentation explains.

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token.

Source

In the previous article, I explained how you can iterate over the tokens in each sentence.

But in fact, you can also iterate over the tokens directly

import spacy nlp = spacy.load("en_core_web_sm") doc = nlp('This is the first sentence. This is the second sentence.') for tok in doc: print(tok)
Code language: JavaScript (javascript)

This is what the output looks like when you run the script. Notice that the output is identical to the previous article.

But tokens are also complex objects and contain quite a lot of information.

You can see this by using Intellisense on an individual token.

For example, let us print the token’s lemma along with its text.

Here is the code, in a new file called text_lemmas.py

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('This is an example of how lemmas and tokens differ in spaCy')
for tok in doc:
    print(f'Token text: {tok.text} Lemma: {tok.lemma_}')

You can see that the text and the lemma is not the same for all tokens. (for example the lemma of ‘is’ = ‘be’)

As you can see, you get access to a LOT of features with a very small amount of code in spaCy. This is why it is the best NLP library by a distance.

<— End of article —>


This website contains affiliate links. See the disclosure page for more details. 
"The magic key I needed as a non-programmer"

The custom payload generator was the magic key I needed (as a non-programmer) to build a good demo with rich responses in DialogFlow Messenger. I've only used it for 30 minutes and am thrilled. I've spent hours trying to figure out some of the intricacies of DialogFlow on my own. Over and over, I kept coming back to Aravind's tutorials available on-line. I trust the other functionalities I learn to use in the app will save me additional time and heartburn.

- Kathleen R
Cofounder, gathrHealth
"Much clearer than the official documentation to be honest"

Thanks a lot for the advice (of buying and following your videos)! They helped a lot indeed. Everything is very clear when you explain, much clearer than the official documentation to be honest 🙂

Neuraz T
Review for Learn Dialogflow CX
"I will strongly recommend this course because even I can learn how to design chatbot (no programming background)"

I think Aravind really did a great job to introduce dialogflow to people like me, without programming background. He organizes his course in very clear manner since I have been a college professor for 20 years. It is very easy for me to recognize how great Aravind’s course is! Very use-friend and very easy to follow. He doesn’t have any strong accent when he gives the lectures. It is so easy for me to understand. Really appreciate it.

Yes, I will strongly recommend this course because even I can learn how to design chatbot (no programming background) after studying Avarind’s course, you definitely can!

Ann Cai
Review for Learn Dialogflow ES

Similar Posts