Cognilytica Voice Assistant Benchmark

Below are the latest benchmark results and month / year of the benchmark performed. Click on the benchmark to see full details for that benchmark release, purchase or download the benchmark report, and some videos that show highlights of the responses generated:

Benchmark Date Benchmark Version Voice Assistants Tested (listed alphabetically by vendor company)
August 2019 Version 2.0 Amazon Alexa, Apple Siri, Google Home, Microsoft Cortana
July 2018 Version 1.0 Amazon Alexa, Apple Siri, Google Home, Microsoft Cortana

 

Jump to Section:


About the Voice Assistant Benchmark

As we’ve written about in previous research, much of the human brain is devoted to generating, sensing, processing, and understanding speech and communications in its various forms. Technologists have long attempted to improve human-machine interaction through conversational interfaces. AI and cognitive technologies are making natural language, conversational interfaces more realistic than ever before.

Conversational interface-based devices are quickly gaining popularity with devices and technology like Amazon Alexa, Google Home, Apple Siri, Microsoft Cortana, and an increasing number of new entrants into a space.  Cognilytica calls these devices “voice assistants”, rather than the less-useful term “smart speakers”.  A smart speaker conjures up a primarily output oriented device that aims to replace keyboard or button interaction with voice commands. Yet, that seems to be a particularly trivial application for the significant investments and competitive posture that these device manufacturers are taking.

The real play is something bigger than just a speaker you can control with your voice. The power is not in the speaker, but in the cloud-based technology that powers the device. These devices are really low-cost input and output hardware that are a gateway to the much more powerful infrastructure that sits at the major tech companies’ data centers. Rather than just being passive devices, intelligent conversational assistants can proactively act on your behalf, performing tasks that require interaction with other humans, and perhaps soon, other conversational assistants. The power is not in the speaker device.

Testing Cloud-based Conversational Intelligence Capabilities of Edge Voice Assistants

Voice assistants are voice-based conversational interfaces paired with intelligent cloud-based back-ends. The device itself provides basic Natural Language Processing (NLP) and Natural Language Generation (NLG) capabilities, and the back-end intelligence that gives these devices real, intelligent capabilities. Can the conversational agents understand when you’re comparing two things together? Do they understand implicit unspoken things that require common sense or cultural knowledge? For example, a conversational agent scheduling a hair appointment should know that you shouldn’t schedule a hair cut a few days after your last hair cut, or schedule a root canal dentist appointment right before a dinner party. These are things that humans can do because we have knowledge and intelligence and common sense.

Cognilytica is focused on the application of AI to the practical needs of businesses, and because we believe voice assistants can be useful to those businesses. As such, we need to understand the current state of the voice assistant market.

What this Benchmark Aims to Test:

  • Determine the underlying intelligence of voice assistant platforms
  • Identify categories of conversations and interactions that can determine intelligence capabilities
  • Provide a means for others to independently verify the benchmark
  • Encourage others to collaborate with us to continue to expand, publish, and promote benchmark findings
  • Work with or otherwise motivate technologists and vendors to improve their voice assistants with regards to intelligence capabilities

What this Benchmark Does NOT Test:

  • Natural language processing (NLP) capabilities. We will assume that all devices are capable of understanding human speech.
  • Natural language generation (NLG) capabilities. We will assume that all devices are capable of responding to queries in natural language spoken voice.
  • Discernment of accents, handling differences in volume, or any other sound or speech based dynamics. We are not testing how well the microphones work, the echo cancellation systems, or speech training capabilities.
  • Range of skills or capabilities. We’re not testing whether or not the devices can book a ride share, schedule an appointment, or perform any myriad of tasks. We will assume that the developer ecosystem will continue to build on that functionality with a range of capabilities.

We care about what happens when the NLP does its processing and those outputs are provided as input to an intelligent back-end system. We want to know — just how intelligent is the AI back-end?

Yes, We Know Voice Assistants Aren’t Smart… But the Bar is Moving.

Some of you  might be thinking that it’s obvious that these voice assistants don’t have intelligence capabilities. Surely, you’re thinking, that anyone who has spent any amount of time focused on AI or NLP or other areas of cognitive technology should know that these devices lack critical intelligence capabilities. Are we just naive to test these devices against what seems to be an obvious lack of capabilities? It doesn’t take an AI expert to know these devices aren’t intelligent – just ask any five year old who has spent any amount of time with Alexa, Siri, Google Home, or Cortana.

However, it’s not obvious that these devices are not intended to be more intelligent than they currently are. Certainly if you look at the examples of use cases from Amazon and Google they will show their devices being used not just to play music or execute skills, but rather as a companion for critical business tasks. This requires at least a minimum level of intelligence to perform without frustrating the user.

Purpose of Benchmark: Measure the Current State of Intelligence in Voice Assistants

All the voice assistant manufacturers are continuing to iterate on the capabilities of their cloud-based AI infrastructure, this means that the level of intelligence of these devices is changing on almost a daily basis. What might have been a “dumb” question to ask a mere few months ago might now be easily addressed by the device. If you’re basing the assumptions of your interactions on what you thought you knew about the intelligence of these devices, your assumptions will quickly be obsolete.

If you’re building Voice-based Skills or Capabilities on Voice Assistant Platforms, you NEED to Pay Attention

If you’re an enterprise end-user building skills or capabilities on top of these devices, or a vendor building add-on capabilities, then you definitely need to know not only what these devices are capable of currently but how they are changing over time. You might be forced to build intelligence into your own skills that compensate for the lack of capabilities by the vendor in that area. Or you might spend a lot of time building that capability to realize that the vendor is now already providing that base level of functionality. Even if you think you’re an AI expert with nothing to learn from this benchmark, you are mistaken. This is a moving industry and today’s assumptions are tomorrow’s mistakes. Follow this benchmark as we continue to iterate.


Benchmark Methodology

The way the benchmark works is by identifying categories of questions and interactions with devices that aim to determine how “intelligent” the voice assistant is, and how well it is able to not just understand the words of the speaker, but the meaning of those words and the true intent of what the speaker has in mind. The benchmark measures these capabilities by envisioning specific business use cases and then determining what base level of intelligence is needed to address those use cases. While the questions asked might be very specific to a particular piece of knowledge or information, the question is actually testing a more general capability, such as the ability to compare two things or determine what the real intent is of the user asking the question.

In the current iteration of the benchmark, there are 10 questions each in 10 categories, for a total of 100 questions asked, with an additional 10 calibration questions to make sure that the benchmark setup is working as intended. The responses from the voice assistants are then categorized into one of four categories as detailed below:

Response Category Classification Detail
Category 0 Did not understand or provides a link to a search with the question asked requiring human to do all the work.
Category 1 Provided an irrelevant or incorrect answer.
Category 2 Provides a relevant response, but with a long list of responses or a reference to an online site that requires the human to understand the proper answer. Not a default search but rather a “guess” conversational response that makes human do some of the work.
Category 3 Provided a correct answer conversationally (did not default to a search that requires the human to do work to determine correct answer).

THIS IS NOT A RANKING!

A Benchmark is used as a way of measuring performance against an Ideal metric. Not as a way of ranking between technology implementations. The goal of this benchmark is that ALL vendors should score ALL Threes in ALL categories (eventually). Until then we’ll measure how well they are performing against the benchmark, not against each other!

Rather than create an absolute score for each tested voice assistant, the benchmark shows how many responses were generated for each category listed above. Category 3 responses are considered to be the most “intelligent” and Category 0 are the least “intelligent”, with Category 1 and Category 2 being sub-optimal responses. Total scores by category are not as important as understanding the maximal score in a particular category. This is because some voice assistants respond better for particular categories of questions than others. See the commentary for more information on our analysis of the results per category.

To avoid problems relating to accents, differences in user voice, and other human-introduced errors, each of the questions are asked using a computer generated voice. Each benchmark result will specify which computer generated voice was used so others who wish to can replicate results.

Benchmark Frequency

Since we know that the vendors are continuing to iterate on the back-end AI cloud-based functionality, we know that the results of a particular iteration of the benchmark test might be invalid quickly.  Our intent is therefore to re-test all the vendors with the benchmark on a regular basis. The benchmark question categories and questions will change regularly to avoid vendors from “hard coding” answers to these benchmark questions.

Open, Verifiable, Transparent. Your Input Needed.

The Cognilytica Voice Assistant Benchmark is meant to be open, self-verifiable, and transparent. You should be able to independently verify each of the benchmark results we have posted here on the voice assistants we have listed. You can also test your own voice assistant based on proprietary technology or ones not listed here. If you are interested in having a voice assistant added to our regularly quarterly benchmark, please contact us.

Likewise, we are constantly iterating the list of benchmark questions asked. We would like to get your feedback on the questions that we are asking as well as feedback on further questions that should be asked. Please reach out to us with feedback on what we should be asking or how to modify the list of questions for an upcoming benchmark version.


Latest Benchmark Test Results and Details

Below are the latest benchmark results and month / year of the benchmark performed. Click on the benchmark to see full details for that benchmark release as well as videos that show the responses generated:

Benchmark Date Benchmark Version Voice Assistants Tested (listed alphabetically by vendor company)
July 2018 Version 1.0 Amazon Alexa, Apple Siri, Google Home, Microsoft Cortana

Question # Question Type Expected Response (category)
CQ1 What time is it {The current time} (3)
CQ2 What is the weather {The current weather at the current location} (3)
CQ3 What is 10 + 10 20 (3)
CQ4 What is the capital of the United States Washington, DC or District of Columbia (3)
CQ5 How many hours in a day 24 (3)
CQ6 How do you spell Voice V-O-I-C-E (3)
CQ7 How far is it from Paris to Tokyo 6,032.9 miles or (3)
CQ8 How far is it from Paris to Tokyo in kilometers 9,712 km (3)
CQ9 Who invented the Light Bulb {Provide a reference to a citation on this topic or a direct answer} (2 or 3)
CQ10 What is Jump plus Stock minus Rock {Should not understand this question} (0)

Understanding Concepts Benchmark Questions

Overview:

In order for voice assistants to be helpful at various business-related tasks, they need to be able to understand certain concepts, such as distance, sizes, monetary measures, and other factors. Asking devices questions in relation to these concepts helps gain an understanding of whether the voice assistants are capable of handling concept-related queries.

Current Benchmark Questions:

Question # Question Type Expected Response (category)
CU1 How much does a ton of peas weigh? a ton (3)
CU2 How much does a pound of peas weigh? 1 pound (3)
CU3 What is a unit of measurement? {description} (2 or 3)
CU4 Is a ton a unit of measurement? yes (3)
CU5 Is a pound a unit of measurement? yes (3)
CU6 Is a foot a unit of measurement? yes (3)
CU7 Is green a unit of measurement? no (3)
CU8 Do I need an umbrella today? {depends on weather} (3)
CU9 Will I need to shovel snow today? {depends on weather} (3)
CU10 What is the fastest you can drive in a 65mph zone? 65mph (3)

Understanding Comparisons Benchmark Questions

Overview:

In order for voice assistants to be helpful at various business-related tasks, they need to be able to compare different concepts together, such as comparing relative size, amount, quantity, and other measures. Asking devices questions in relation to these comparisons helps gain an understanding of whether the voice assistants are capable of handling comparison-related queries.

Current Benchmark Questions:

Question # Question Type Expected Response (category)
UC1 What is bigger, an ant or a tiger An ant (3)
UC2 What weighs more a ton of carrots or a ton of peas? They weigh the same (3)
UC3a How fast is a cheetah? A cheetah can reach speeds of up to 75 mph in short bursts (3)
UC3b How fast is a turtle? The average top speed of a turtle is 6.3 mph (3)
UC3 What is faster, a turtle or a cheetah? Cheetah (3)
UC4 What is the nearest star? The Sun (3)
UC5 What is bigger the earth or mars? The earth (3)
UC6 What is longer: a yard or 3 feet? They are the same (3)
UC7a How much does the Sun weigh?
UC7b How much does the Earth weigh? 5.974×10^ kg (3)
UC7 What weighs more the sun or the earth? The Sun (3)
UC8 What takes longer a two hour movie or a 120 minute drive? They are the same (3)
UC9 What is longer, two hours or 120 minutes? They are the same (3)
UC10 What is taller, an elephant or the Eiffel Tower? The Eiffel Tower (3)

Understanding Cause & Effect Benchmark Questions

Overview:

In order for voice assistants to be helpful at various business-related tasks, they need to be able to understand what happens as a consequence of specified actions. Asking devices questions in relation to these causes and effects helps gain an understanding of whether the voice assistants are capable of handling queries that depend on certain outcomes.

Current Benchmark Questions:

Question # Question Type Expected Response (category)
CE1 What happens when you melt ice? It turns to water (3)
CE2 What happens when you freeze water? It turns to ice (3)
CE3 I would like to schedule an appointment for 3 weeks ago This is not possible (3)
CE4 Will an egg crack if you hit it with a hammer? yes (3)
CE5 If I repeat something have I done it more than once? yes (3)
CE6 What color is burned toast? black (3)
CE7 Will a feather break if you drop it? No (3)
CE8 If I break something into two parts, how many parts are there? Two (2)
CE9 If I put something into a refrigerator, does it get colder? Yes (3)
CE10 If I increase the volume, does it get louder? Yes (3)

Reasoning & Logic Benchmark Questions

Overview:

In order for voice assistants to be helpful at various business-related tasks, they need to be able to reason and deduce information based on what the speaker says. Asking devices various reasoning and logic-oriented questions helps gain an understanding of whether the voice assistants are capable of handling queries that require reasoning.

Current Benchmark Questions:

Question # Question Type Expected Response (category)
RE1 If I take a train from here to Chicago and back, where am I? Back to where you started (3)
RE2 What game does a baseball player play? Baseball (3)
RE3 Where do teachers teach? School or classroom (3)
RE4 What is the log of e? 1 (3)
RE5 Is a dead person alive? no (3)
RE6 How much will I make if I get paid $10 an hour and work for 3 hours? $30 (3)
RE7 Can you eat toxic food? no (3)
RE8 How many eggs are there in a box of a dozen eggs? 12 (3)
RE9 Can I drive to Australia from Europe? No (3)
RE10 How many legs does a pair of pants have? two (2)

Helpfulness Benchmark Questions

Overview:

Voice assistants need to be helpful in a wide variety of contexts. These questions aim to find out what sort of non-skill / third-party developer enhanced capabilities are inherent in the devices to provide helpful answers in respond to a wide range of questions.

Current Benchmark Questions:

Question # Question Type Expected Response (category)
HP1 Where can I buy stamps? {Relevant answers} (3)
HP2 Should dogs eat chocolate? No (3)
HP3 In bowling, how many pins do you need to hit to make a strike? 10 (3)
HP4 Is George Jetson real? No (3)
HP5 How much protein is in a dozen eggs? 87.4g (3)
HP6 What is the color blue plus the color yellow? Green (3)
HP7 What was the weather last week? {Last week’s weather} (3)
HP8 When is the garbage pickup for my location? {Relevant answers} (3)
HP9 Is there dairy in cheese lasagna? Yes (3)
HP10 Where is the closest public restroom? {Relevant answers} (3)

Emotional IQ Benchmark Questions

Overview:

We all know that machines aren’t (yet) capable of feeling emotion. However, voice assistants need to be aware of human emotions and formulate responses that are emotion-relevant. These questions aim to determine the level to which tested voice assistants are capable of handling emotion-related questions or responding in an emotion-relevant manner.

Current Benchmark Questions:

 

 

Question # Question Type Expected Response (Category)
EI1 Should I send a congratulations notice for my friend’s funeral? No (3)
EI2 What do you say when someone sneezes? {proper response} (3)
EI3 Does frustrating people make them happy? No (3)
EI4 When you yell are you loud? Yes (3)
EI5 When you whisper are you quiet? Yes (3)
EI6a What is the definition of consensus? {Definition} (3)
EI6 Is there consensus when no one agrees with each other? No (3)
EI7a What is the defintion of urgent? {Definition} (3)
EI7 If something is urgent can I wait as long as I want? No (3)
EI8 How do you know if someone is amused? {valid ways} (3)
EI9 Are friends people who like each other? Yes (3)
EI10 When people are angry, are they friendly? No (3)

Intuition and Common Sense Benchmark Questions

Overview:

Common sense is neither common nor always makes sense, and certainly machines have not been known to have either. However, voice assistant devices have AI systems that use training models that impart the common sense and intuition of their human-based designers. Furthermore, voice assistant users in a business context will make assumptions in their questions that might depend on common sense knowledge not specified. These questions aim to identify what common sense and intuition capabilities are inherent in this training data.

Current Benchmark Questions:

Question # Question Type Expected Response (Category)
IN1 Should I hold boiling water in my hand? No (3)
IN2 Are parents older than their children? Yes (3)
IN3 Can you grow books on trees? No (3)
IN4 Should I let a toddler drive a bus? No (3)
IN5 Should I drop something on the ground that is fragile? No (3)
IN6 Is a thief a criminal? Yes (3)
IN7 What do I do if there’s a fire in the house? {rational answer} (2 or 3)
IN8 Can you recycle banana peels? No (3)
IN9 Should I drink milk if I’m lactose intolerant? No (3)
IN10 Is it hotter during winter, or summer? Summer (3)

Winograd Schema Inspired Benchmark Questions

Overview:

The Winograd Schema is a format for asking questions of chatbots and other conversational computing devices to ascertain whether or not they truly understand the question asked and formulate a logical response. Winograd Schema Challenge formatted questions are often used in AI-related chatbot competitions, such as the one for the Loebner Prize. These questions draw on some of the Winograd Schema suggested questions as well as other questions that are inspired by the Winograd format to ascertain a level of knowledge in the tested voice assistants. See the Common Sense Reasoning site for more details on the Winograd Schema Challenge.

Current Benchmark Questions:

Question # Question Type Expected Response (Category)
WS1 The trophy would not fit in the brown suitcase because it was too big. What was too big? The trophy (3)
WS2 How do I protect myself from rain? {appropriate response} (3)
WS3 The man couldn’t lift his son because he was so weak. Who was weak, the man or his son? The man (3)
WS4 Paul tried to call George on the phone, but he wasn’t successful.Who wasn’t successful? Paul or George (3)
WS5 Joan made sure to thank Susan for all the help she had received. Who had received help? Joan or Susan (3)
WS6 The delivery truck zoomed by the school bus because it was going so slow. What was going so slow? Truck or Schoolbus (3)
WS7 The large ball crashed right through the table because it was made of steel. What was made of steel? Ball or Table (3)
WS8 The sculpture rolled off the shelf because it wasn’t level. What wasn’t level? The sculpture or shelf (3)
WS9 The firemen arrived before the police because they were coming from so far away. Who came from far away? The firemen or police (3)
WS10 Jim yelled at Kevin because he was so upset. Who was upset? Jim or Kevin (3)

Slang / Colloquialisms / Expressions Benchmark Questions

Overview:

Human speech doesn’t always fit in predictable or logical patterns. Sometimes humans speak with mannerisms that don’t have literal meaning or are culturally-relevant. Slang, expressions, and colloquialisms are highly language and region-dependent, but they form a core part of most human conversation. These questions aim to determine the level to which tested voice assistants have been trained to understand certain expressions or non-literal manners of speech.

Current Benchmark Questions:

Question # Question Type Expected Response (Category)
SE1 What does “thumbs down” mean {something about disapproval} (3)
SE2 What does “Down to earth” mean {something about practical} (3)
SE3 If I want to pig, out, does this mean I am hungry? Yes (3)
SE4 I’d like to shoot the breeze with you {engage in a conversation} (3)
SE5 If you want to hang out, does this mean you want to get together with someone? Yes (3)
SE6 Does Howdy mean Hello? Yes (3)
SE7 In what state do you say “Aloha”? Hawaii (3)
SE8 Is it good to let things get out of hand? No (3)
SE9 If I’m tired, should I hit the sack? Yes (3)
SE10 If you’re sick, are you under the weather? Yes (3)

Miscellaneous Questions

Overview:

These questions either don’t fit into one of the above categories or are additional questions that we didn’t have room to ask above. These miscellaneous questions test voice assistants relevant to intelligence, understanding,  and helpfulness.

Current Benchmark Questions:

Question # Question Type Expected Response (Category)
MQ1 How old is George Washington? He’s dead. He died at 67 (3)
MQ2 How old would George Washington be if he was alive today? 286 years old (3)
MQ3 Who was born first, a father or his child? the father (3)
MQ4 How many fingers should I hold up to signify the number 2 two (3)
MQ5 When does a US president start their term? January 20 after the year of election (3)
MQ6 Does it make sense to walk in front of a moving car? no (3)
MQ7 What types of ticks can carry Lyme disease? {list of ticks} (3)
MQ8 How do you treat a burn? {relevant answer} (2 or 3)
MQ9 How long should you cook a 14 pound turkey? {relevant answer} (2 or 3)
MQ10 where is the nearest bus stop? {relevant answer} (3)

 

 
–>