Home Research Cognilytica Voice Assistant Benchmark

Cognilytica Voice Assistant Benchmark

by rschmelzer
Cognilytica Voice Assistant Benchmark

Below are the latest benchmark results and month / year of the benchmark performed. Click on the benchmark to see full details for that benchmark release, purchase or download the benchmark report, and some videos that show highlights of the responses generated:

Benchmark DateBenchmark VersionVoice Assistants Tested (listed alphabetically by vendor company)
August 2019Version 2.0Amazon Alexa, Apple Siri, Google Home, Microsoft Cortana
July 2018Version 1.0Amazon Alexa, Apple Siri, Google Home, Microsoft Cortana

 

Jump to Section:


About the Voice Assistant Benchmark

As we’ve written about in previous research, much of the human brain is devoted to generating, sensing, processing, and understanding speech and communications in its various forms. Technologists have long attempted to improve human-machine interaction through conversational interfaces. AI and cognitive technologies are making natural language, conversational interfaces more realistic than ever before.

Conversational interface-based devices are quickly gaining popularity with devices and technology like Amazon Alexa, Google Home, Apple Siri, Microsoft Cortana, and an increasing number of new entrants into a space.  Cognilytica calls these devices “voice assistants”, rather than the less-useful term “smart speakers”.  A smart speaker conjures up a primarily output oriented device that aims to replace keyboard or button interaction with voice commands. Yet, that seems to be a particularly trivial application for the significant investments and competitive posture that these device manufacturers are taking.

The real play is something bigger than just a speaker you can control with your voice. The power is not in the speaker, but in the cloud-based technology that powers the device. These devices are really low-cost input and output hardware that are a gateway to the much more powerful infrastructure that sits at the major tech companies’ data centers. Rather than just being passive devices, intelligent conversational assistants can proactively act on your behalf, performing tasks that require interaction with other humans, and perhaps soon, other conversational assistants. The power is not in the speaker device.

Testing Cloud-based Conversational Intelligence Capabilities of Edge Voice Assistants

Voice assistants are voice-based conversational interfaces paired with intelligent cloud-based back-ends. The device itself provides basic Natural Language Processing (NLP) and Natural Language Generation (NLG) capabilities, and the back-end intelligence that gives these devices real, intelligent capabilities. Can the conversational agents understand when you’re comparing two things together? Do they understand implicit unspoken things that require common sense or cultural knowledge? For example, a conversational agent scheduling a hair appointment should know that you shouldn’t schedule a hair cut a few days after your last hair cut, or schedule a root canal dentist appointment right before a dinner party. These are things that humans can do because we have knowledge and intelligence and common sense.

Cognilytica is focused on the application of AI to the practical needs of businesses, and because we believe voice assistants can be useful to those businesses. As such, we need to understand the current state of the voice assistant market.

What this Benchmark Aims to Test:

  • Determine the underlying intelligence of voice assistant platforms
  • Identify categories of conversations and interactions that can determine intelligence capabilities
  • Provide a means for others to independently verify the benchmark
  • Encourage others to collaborate with us to continue to expand, publish, and promote benchmark findings
  • Work with or otherwise motivate technologists and vendors to improve their voice assistants with regards to intelligence capabilities

What this Benchmark Does NOT Test:

  • Natural language processing (NLP) capabilities. We will assume that all devices are capable of understanding human speech.
  • Natural language generation (NLG) capabilities. We will assume that all devices are capable of responding to queries in natural language spoken voice.
  • Discernment of accents, handling differences in volume, or any other sound or speech based dynamics. We are not testing how well the microphones work, the echo cancellation systems, or speech training capabilities.
  • Range of skills or capabilities. We’re not testing whether or not the devices can book a ride share, schedule an appointment, or perform any myriad of tasks. We will assume that the developer ecosystem will continue to build on that functionality with a range of capabilities.

We care about what happens when the NLP does its processing and those outputs are provided as input to an intelligent back-end system. We want to know — just how intelligent is the AI back-end?

Yes, We Know Voice Assistants Aren’t Smart… But the Bar is Moving.

Some of you  might be thinking that it’s obvious that these voice assistants don’t have intelligence capabilities. Surely, you’re thinking, that anyone who has spent any amount of time focused on AI or NLP or other areas of cognitive technology should know that these devices lack critical intelligence capabilities. Are we just naive to test these devices against what seems to be an obvious lack of capabilities? It doesn’t take an AI expert to know these devices aren’t intelligent – just ask any five year old who has spent any amount of time with Alexa, Siri, Google Home, or Cortana.

However, it’s not obvious that these devices are not intended to be more intelligent than they currently are. Certainly if you look at the examples of use cases from Amazon and Google they will show their devices being used not just to play music or execute skills, but rather as a companion for critical business tasks. This requires at least a minimum level of intelligence to perform without frustrating the user.

Purpose of Benchmark: Measure the Current State of Intelligence in Voice Assistants

All the voice assistant manufacturers are continuing to iterate on the capabilities of their cloud-based AI infrastructure, this means that the level of intelligence of these devices is changing on almost a daily basis. What might have been a “dumb” question to ask a mere few months ago might now be easily addressed by the device. If you’re basing the assumptions of your interactions on what you thought you knew about the intelligence of these devices, your assumptions will quickly be obsolete.

If you’re building Voice-based Skills or Capabilities on Voice Assistant Platforms, you NEED to Pay Attention

If you’re an enterprise end-user building skills or capabilities on top of these devices, or a vendor building add-on capabilities, then you definitely need to know not only what these devices are capable of currently but how they are changing over time. You might be forced to build intelligence into your own skills that compensate for the lack of capabilities by the vendor in that area. Or you might spend a lot of time building that capability to realize that the vendor is now already providing that base level of functionality. Even if you think you’re an AI expert with nothing to learn from this benchmark, you are mistaken. This is a moving industry and today’s assumptions are tomorrow’s mistakes. Follow this benchmark as we continue to iterate.


Benchmark Methodology

The way the benchmark works is by identifying categories of questions and interactions with devices that aim to determine how “intelligent” the voice assistant is, and how well it is able to not just understand the words of the speaker, but the meaning of those words and the true intent of what the speaker has in mind. The benchmark measures these capabilities by envisioning specific business use cases and then determining what base level of intelligence is needed to address those use cases. While the questions asked might be very specific to a particular piece of knowledge or information, the question is actually testing a more general capability, such as the ability to compare two things or determine what the real intent is of the user asking the question.

In the current iteration of the benchmark, there are 10 questions each in 10 categories, for a total of 100 questions asked, with an additional 10 calibration questions to make sure that the benchmark setup is working as intended. The responses from the voice assistants are then categorized into one of four categories as detailed below:

Response CategoryClassification Detail
Category 0Did not understand or provides a link to a search with the question asked requiring human to do all the work.
Category 1Provided an irrelevant or incorrect answer.
Category 2Provides a relevant response, but with a long list of responses or a reference to an online site that requires the human to understand the proper answer. Not a default search but rather a “guess” conversational response that makes human do some of the work.
Category 3Provided a correct answer conversationally (did not default to a search that requires the human to do work to determine correct answer).

THIS IS NOT A RANKING!

A Benchmark is used as a way of measuring performance against an Ideal metric. Not as a way of ranking between technology implementations. The goal of this benchmark is that ALL vendors should score ALL Threes in ALL categories (eventually). Until then we’ll measure how well they are performing against the benchmark, not against each other!

Rather than create an absolute score for each tested voice assistant, the benchmark shows how many responses were generated for each category listed above. Category 3 responses are considered to be the most “intelligent” and Category 0 are the least “intelligent”, with Category 1 and Category 2 being sub-optimal responses. Total scores by category are not as important as understanding the maximal score in a particular category. This is because some voice assistants respond better for particular categories of questions than others. See the commentary for more information on our analysis of the results per category.

To avoid problems relating to accents, differences in user voice, and other human-introduced errors, each of the questions are asked using a computer generated voice. Each benchmark result will specify which computer generated voice was used so others who wish to can replicate results.

Benchmark Frequency

Since we know that the vendors are continuing to iterate on the back-end AI cloud-based functionality, we know that the results of a particular iteration of the benchmark test might be invalid quickly.  Our intent is therefore to re-test all the vendors with the benchmark on a regular basis. The benchmark question categories and questions will change regularly to avoid vendors from “hard coding” answers to these benchmark questions.

Open, Verifiable, Transparent. Your Input Needed.

The Cognilytica Voice Assistant Benchmark is meant to be open, self-verifiable, and transparent. You should be able to independently verify each of the benchmark results we have posted here on the voice assistants we have listed. You can also test your own voice assistant based on proprietary technology or ones not listed here. If you are interested in having a voice assistant added to our regularly quarterly benchmark, please contact us.

Likewise, we are constantly iterating the list of benchmark questions asked. We would like to get your feedback on the questions that we are asking as well as feedback on further questions that should be asked. Please reach out to us with feedback on what we should be asking or how to modify the list of questions for an upcoming benchmark version.


Latest Benchmark Test Results and Details

Below are the latest benchmark results and month / year of the benchmark performed. Click on the benchmark to see full details for that benchmark release as well as videos that show the responses generated:

Benchmark DateBenchmark VersionVoice Assistants Tested (listed alphabetically by vendor company)
July 2018Version 1.0Amazon Alexa, Apple Siri, Google Home, Microsoft Cortana

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept