Home / DialogFlow ES / Automated Conversation Testing for Dialogflow
Conversation Testing | DialogFlow ES

Automated Conversation Testing for Dialogflow

Website Name Change

I have changed the name of this website from Mining Business Data to BotFlo. I am offering a 40% off discount on both my Dialogflow ES and Dialogflow CX courses till end of April 2021 for people who can help me spread the word about my new website.

In one of my articles, I mention using Automated Conversation Testing (ACT) as a way to improve your bot’s reliability. In other words, when you make a change anywhere in your bot, it would be a great help to know that you didn’t accidentally break something else in the bot which was working correctly.


The straight forward way (note: I don’t mean easy) to do ACT right now is to use Dialogflow’s REST API.

Each time you issue a query to your Dialogflow agent via the REST API, you get back the following information (some fields deleted to make it easier to read):

The three fields of interest to us are:

  • intentId (line 11)
  • intentName (line 14)
  • score (line 25)

Simple FAQ Bot

Let us consider a simple FAQ bot. It has no contexts. You ask a question, you receive an answer (provided the agent understood the question pattern).

To test this bot, all you really need to do is to create a spreadsheet file with a list of known questions and their known mapped intents.

You run a test by issuing a /query to your agent using the known question and by checking if the known intent is being mapped each time.

For the mapping, you can use either the IntentID or the IntentName. Both have their pros and cons.

Using the IntentId as the identifier

When you use the IntentId as the identifier, the advantage is that being a GUID it is a much easier task for doing comparisons.

The problem with the IntentId is that it can change when doing a restore, which means you have no guarantee that you will have the same IntentId for the same conceptual intent over time.

Using the IntentName as the identifier

The advantage of using the IntentName is two-fold. Manual checks are easier to do. And usually restores don’t affect the intent name.

The big disadvantage of using the IntentName is that you can easily change the intent name from within the Dialogflow console, and since all your automated testing lies outside the console, you will not be warned when you do this. (I suppose this depends on how robust you want your ACT to be. You could, if you really wanted to, simply ping the agent and fetch all the intent names every few minutes and send a warning email if an intent name change has been detected).

Handling contexts

Since Dialogflow’s REST API allows us to set a context as part of issuing a /query request, we can also handle contexts although it will certainly be a little more complex. You will need to save the current contexts as one of the columns in your spreadsheet.

That is, given the context that has been assumed to be set, you send a query to the agent, see what the response is, and save the mapped intent details to your spreadsheet. A full conversation is effectively a series of tuples of the type (input context, query, mapped intent).


The score provides us with an additional variable to use during our automated testing.

As you know, the score always has to be above the ML Threshold if you wish to avoid triggering the Fallback Intent (and consequently derailing the conversation).

You can check to see if the score gets lower as you keep updating your agent. Adding the score in this fashion provides a nice little check to your ACT efforts.


Use a combination of IntentID, IntentName and Score to do automated conversation tests for your chatbot. You can simply run a few conversations by hand, save the relevant fields into a CSV file.

Next, you build a tool which takes the CSV file as input, and sends each “query” column value to your agent via the REST API, and checks the following:

  • is the intentId different?
  • is the intentName different?
  • is the score lower than last time?

If any of these is true, it should highlight or flag the relevant line in the CSV file. Once you get into the habit of using this tool, you can be much more confident of the changes you are making to your chatbot.

"The magic key I needed as a non-programmer"

The custom payload generator was the magic key I needed (as a non-programmer) to build a good demo with rich responses in DialogFlow Messenger. I've only used it for 30 minutes and am thrilled. I've spent hours trying to figure out some of the intricacies of DialogFlow on my own. Over and over, I kept coming back to Aravind's tutorials available on-line. I trust the other functionalities I learn to use in the app will save me additional time and heartburn.

- Kathleen R
Cofounder, gathrHealth

Similar Posts


    1. Florian, thanks for the heads up. I think Botium’s (or any other “cross-bot framework”) automation testing will not work with Dialogflow’s contexts, meaning it can test single (standalone) utterances but not test the full conversation from start to end. I will write a more detailed article on this when I get some time. Also, am I correct in noting that Botium doesn’t utilize the “score” value coming back in the JSON?