The Cognilytica Voice Assistant Benchmark

This post was featured in our Cognilytica Newsletter, with additional details. Didn’t get the newsletter? Sign up here:

Out with Chatbots, In with Conversational Interfaces

Much of the human brain is devoted to generating, sensing, processing, and understanding speech and communications in its various forms. When humans were building their first civilizations we were also expanding spoken and written forms of communication. As we didn’t begin our forms of communication by swiping and typing, it’s no surprise that what humans most want in our interactions with machines is what we want with interactions with people: natural conversation. Or more plainly, humans prefer to communicate using the natural language of humans not the language of computers.

As such, along with the development of modern computers, technologists have been attempting to improve human-machine interaction through conversational interfaces popularly known as chatbots. At its simplest core, chatbots are automated software applications that accept input in the form of written or spoken text in “natural”, human language and provide output text in written or spoken form also in a natural language. The whole “trick” of a chatbot is to make it seem you’re conversing with a real human being, even though the interaction is with a software application. How successful this illusion is depends on the sophistication of the system and the extent to which the human wants to know it’s talking to a machine or not. Long story short, people want to be talking to their computers, and if you’ve seen any sci-fi movie from Star Trek to Star Wars to 2001: A Space Odyssey to Hitchiker’s Guide to the Galaxy, it’s clear the imagined world of the future are people talking to machines.

Chatbots are best suited for applications where a conversational style of interaction with humans are required, instead of other ways of interacting with an application such as desktop, mobile, or web-based user interfaces. These scenarios include interaction via phone, text and SMS-based messaging, web-based interfaces, and interaction with devices where physical input is not possible or convenient such as when driving, operating equipment, and in general, where hands-free interaction is preferred. Voice interactions are best suited for short back and forth interactions, and text is best suited for longer communications or situations that require visual interface. We spent quite a bit of time exploring Chatbots in our “Chatbots — Are They Useful?” AI Today podcast as well as our “Assistant-Enabled Commerce” AI Today podcast. Give those a listen to learn more about the current state of the chatbot ecosystem and use cases.

In terms of spoken interaction, chatbots are starting to take off with devices such as Alexa, Google Home, Siri, Cortana, and an increasing number of new entrants into a space Cognilytica calls “voice assistants”. We’re also seeing increased text-based chatbot interaction in instant messaging applications such as Facebook Messenger, Kik, Slack, Telegram, Discord, and others. These bots are helping to integrate with other applications in the case of Slack, or be part of the context of a conversation so you can embed information from somewhere while you’re having a conversation about something else.

Is the term “Chatbot” Meaningful?

Diving deeper into chatbots, it becomes clear there are really two things at play to make what we know as chatbots useful: the part that processes and produces natural, human-like conversation in text or spoken form, and the back-end systems that try to understand what you are saying. The term “chatbot” lumps these two things together. It combines both natural language processing (NLP) capabilities with the intelligent back-end system to carry on a useful conversation or perform a useful activity. However, these are clearly two different things. You can put a NLP interface on top of anything, from a chess game to a website, but that NLP interface is not providing anything uniquely intelligent other than providing a different mode of interaction instead of GUI or other forms of interface. Similarly, you could have a very smart back-end system capable of understanding intent and engaging in a wide range of activities without an NLP interface in front of it.

So, chatbots really aren’t indicative of much. You can have a chatbot with a really poor NLP interface and you’ll have a poor overall experience. Similarly, you can have a great NLP interface capable of understanding all manner of spoken words in different accents and with different word usages and combinations, combined with a poor back-end system and you’ll also get a poor user experience. In fact, it seems that’s where we currently stand. As we’ll explore below, we have a class of voice assistant devices that are particularly good at processing speech and interacting with voice and text to humans, but the back-ends that provide the actual actions and understanding of that communication are not nearly as good as you might hope or expect them to be.

For this reason, Cognilytica thinks the term “chatbot” is of limited use.  Instead of combining two different things together, we’re splitting apart the conversational interface, which is defined as the NLP capabilities of interpreting human speech in voice or text form combined with understanding of conversational and grammar structure, from the intelligent processing, which is the back-end, AI-powered systems that try to understand what you are talking about, what you are trying to achieve, and the best means to provide an answer or action based on that desire. Great conversational interfaces can be put on a wide range of systems, and indeed, we’re seeing them everywhere. Likewise, great intelligence processing can happen with or without conversational interfaces.  In fact, that’s the whole thrust of the AI movement – to apply intelligence across a wide range of interactions.

What’s the Purpose of a Voice Assistant?

Voice assistants are voice-based conversational interfaces paired with intelligent back-ends. Of course it’s valuable to ask, what is their primary use? In some cases, people are simply using these voice assistants to play music, tell them the weather, make phone calls, play games, answer basic questions, play podcasts, provide alarms, and other similar activities. In pretty much all of those use cases, an intelligent back end is not really even needed. The conversational interface is the smart part, but we’ve been playing music, making phone calls, getting basic facts from the web, finding out the weather, and getting alarms just fine without talking to our phones or computers. All that’s being provided here is convenience. But don’t we want more from our voice assistants? Do we just want a conversational interface to our existing not-so-smart capabilities, or do we want an intelligent voice assistant that can be a useful business and personal companion?

In our most recent AI Today podcast: “Cognilytica Tests Voice Assistants” we were surprised to see how not-so-smart the current iteration of voice assistants are. We tested Google Home, Amazon Alexa, and Apple Siri on a range of questions that tested the devices ability to answer factual and reasoning questions and they all came surprisingly short. Now, if you were expecting these voice assistants to just be conversational interfaces to regular non-intelligent back ends, then you wouldn’t be surprised that the devices couldn’t tell you how much a ton of peas weighs. But if you are expecting these to be actually intelligent, or perhaps show off the intelligence capabilities of major AI players, then you would be extremely dissatisfied.

Introducing The Voice Assistant Benchmark

Since Cognilytica is focused on the application of AI to the practical needs of businesses, and because we believe voice assistants can be useful to those businesses, we are dissatisfied with the current state of the voice assistant market. Rather than simply complain, we’ve decided to do something about it. We’ve introduced something we’re calling the Voice Assistant Intelligence Benchmark, or more simply the Voice Assistant Benchmark (so all the words can fit into a nice graphic). The purpose of the Voice Assistant Benchmark is to accomplish the following goals:

  • Determine the underlying intelligence of voice assistant platforms
  • Identify categories of conversations and interactions that can determine intelligence capabilities
  • Provide a means for others to independently verify the benchmark
  • Encourage others to collaborate with us to continue to expand, publish, and promote benchmark findings
  • Work with or otherwise motivate technologists and vendors to improve their voice assistants with regards to intelligence capabilities

The way the benchmark works is by identifying categories of questions and interactions with devices that aim to determine how “smart” the voice assistant is, and how well it is able to not just understand the words of the speaker, but the meaning of those words and the true intent of what the speaker has in mind.

This benchmark does not focus on testing the capabilities of the conversational interface itself with regards to speaker clarity, accents, volume, background noise, and other things. Those are matters of the conversational interface technology that are the inputs to the NLP system. We care about what happens when the NLP does its processing and those outputs are provided as input to an intelligent back-end system. We want to know — just how intelligent is that back end?

Next Steps: Engage with Us. Follow the Benchmark. Stay tuned for more Updates.

We are excited to be a part of the solution and forward process for making voice assistants more useful and more intelligent — and we want you to be a part of that too. Check out the Voice Assistant Benchmark and the questions we have listed.  Are we missing any important benchmark questions or categories? Are there additional ways we can and should test the intelligence in these devices?  We’ll be publishing regular updates to the benchmark questions as well as updating rankings on how well the voice assistants stack up. The questions and benchmark has been written in such a way that anyone can pose the same questions, get the same results, and come to the same conclusions.  If you don’t believe our rankings, you can test it yourself!  Work with us and help us make these voice assistants better, and stay tuned for more updates!  Like this work?  Then help us spread the word!