Home Research Cognilytica Voice Assistant Benchmark

Cognilytica Voice Assistant Benchmark

by rschmelzer
Cognilytica Voice Assistant Benchmark

Jump to Section:


About the Voice Assistant Benchmark

As we’ve written about in previous research, much of the human brain is devoted to generating, sensing, processing, and understanding speech and communications in its various forms. Technologists have long attempted to improve human-machine interaction through conversational interfaces. AI and cognitive technologies are making natural language, conversational interfaces more realistic than ever before.

Conversational interface-based devices are quickly gaining popularity with devices and technology like Amazon Alexa, Google Home, Apple Siri, Microsoft Cortana, and an increasing number of new entrants into a space.  Cognilytica calls these devices “voice assistants”, rather than the less-useful term “smart speakers”.  A smart speaker conjures up a primarily output oriented device that aims to replace keyboard or button interaction with voice commands. Yet, that seems to be a particularly trivial application for the significant investments and competitive posture that these device manufacturers are taking.

The real play is something bigger than just a speaker you can control with your voice. The power is not in the speaker, but in the cloud-based technology that powers the device. These devices are really low-cost input and output hardware that are a gateway to the much more powerful infrastructure that sits at the major tech companies’ data centers. Rather than just being passive devices, intelligent conversational assistants can proactively act on your behalf, performing tasks that require interaction with other humans, and perhaps soon, other conversational assistants. The power is not in the speaker device.

Testing Cloud-based Conversational Intelligence Capabilities of Edge Voice Assistants

Voice assistants are voice-based conversational interfaces paired with intelligent cloud-based back-ends. The device itself provides basic Natural Language Processing (NLP) and Natural Language Generation (NLG) capabilities, and the back-end intelligence that gives these devices real, intelligent capabilities. Can the conversational agents understand when you’re comparing two things together? Do they understand implicit unspoken things that require common sense or cultural knowledge? For example, a conversational agent scheduling a hair appointment should know that you shouldn’t schedule a hair cut a few days after your last hair cut, or schedule a root canal dentist appointment right before a dinner party. These are things that humans can do because we have knowledge and intelligence and common sense.

Cognilytica is focused on the application of AI to the practical needs of businesses, and because we believe voice assistants can be useful to those businesses. As such, we need to understand the current state of the voice assistant market.

What this Benchmark Aims to Test:

  • Determine the underlying intelligence of voice assistant platforms
  • Identify categories of conversations and interactions that can determine intelligence capabilities
  • Provide a means for others to independently verify the benchmark
  • Encourage others to collaborate with us to continue to expand, publish, and promote benchmark findings
  • Work with or otherwise motivate technologists and vendors to improve their voice assistants with regards to intelligence capabilities

What this Benchmark Does NOT Test:

  • Natural language processing (NLP) capabilities. We will assume that all devices are capable of understanding human speech.
  • Natural language generation (NLG) capabilities. We will assume that all devices are capable of responding to queries in natural language spoken voice.
  • Discernment of accents, handling differences in volume, or any other sound or speech based dynamics. We are not testing how well the microphones work, the echo cancellation systems, or speech training capabilities.
  • Range of skills or capabilities. We’re not testing whether or not the devices can book a ride share, schedule an appointment, or perform any myriad of tasks. We will assume that the developer ecosystem will continue to build on that functionality with a range of capabilities.

We care about what happens when the NLP does its processing and those outputs are provided as input to an intelligent back-end system. We want to know — just how intelligent is the AI back-end?

Yes, We Know Voice Assistants Aren’t Smart… But the Bar is Moving.

Some of you  might be thinking that it’s obvious that these voice assistants don’t have intelligence capabilities. Surely, you’re thinking, that anyone who has spent any amount of time focused on AI or NLP or other areas of cognitive technology should know that these devices lack critical intelligence capabilities. Are we just naive to test these devices against what seems to be an obvious lack of capabilities? It doesn’t take an AI expert to know these devices aren’t intelligent – just ask any five year old who has spent any amount of time with Alexa, Siri, Google Home, or Cortana.

However, it’s not obvious that these devices are not intended to be more intelligent than they currently are. Certainly if you look at the examples of use cases from Amazon and Google they will show their devices being used not just to play music or execute skills, but rather as a companion for critical business tasks. This requires at least a minimum level of intelligence to perform without frustrating the user.

Purpose of Benchmark: Measure the Current State of Intelligence in Voice Assistants

All the voice assistant manufacturers are continuing to iterate on the capabilities of their cloud-based AI infrastructure, this means that the level of intelligence of these devices is changing on almost a daily basis. What might have been a “dumb” question to ask a mere few months ago might now be easily addressed by the device. If you’re basing the assumptions of your interactions on what you thought you knew about the intelligence of these devices, your assumptions will quickly be obsolete.

If you’re building Voice-based Skills or Capabilities on Voice Assistant Platforms, you NEED to Pay Attention

If you’re an enterprise end-user building skills or capabilities on top of these devices, or a vendor building add-on capabilities, then you definitely need to know not only what these devices are capable of currently but how they are changing over time. You might be forced to build intelligence into your own skills that compensate for the lack of capabilities by the vendor in that area. Or you might spend a lot of time building that capability to realize that the vendor is now already providing that base level of functionality. Even if you think you’re an AI expert with nothing to learn from this benchmark, you are mistaken. This is a moving industry and today’s assumptions are tomorrow’s mistakes. Follow this benchmark as we continue to iterate.


Benchmark Methodology

The way the benchmark works is by identifying categories of questions and interactions with devices that aim to determine how “intelligent” the voice assistant is, and how well it is able to not just understand the words of the speaker, but the meaning of those words and the true intent of what the speaker has in mind. The benchmark measures these capabilities by envisioning specific business use cases and then determining what base level of intelligence is needed to address those use cases. While the questions asked might be very specific to a particular piece of knowledge or information, the question is actually testing a more general capability, such as the ability to compare two things or determine what the real intent is of the user asking the question.

In the current iteration of the benchmark, there are 10 questions each in 10 categories, for a total of 100 questions asked, with an additional 10 calibration questions to make sure that the benchmark setup is working as intended. The responses from the voice assistants are then categorized into one of four categories as detailed below:

Response CategoryClassification Detail
Category 0Did not understand or provides a link to a search with the question asked requiring human to do all the work.
Category 1Provided an irrelevant or incorrect answer.
Category 2Provides a relevant response, but with a long list of responses or a reference to an online site that requires the human to understand the proper answer. Not a default search but rather a “guess” conversational response that makes human do some of the work.
Category 3Provided a correct answer conversationally (did not default to a search that requires the human to do work to determine correct answer).

THIS IS NOT A RANKING!

A Benchmark is used as a way of measuring performance against an Ideal metric. Not as a way of ranking between technology implementations. The goal of this benchmark is that ALL vendors should score ALL Threes in ALL categories (eventually). Until then we’ll measure how well they are performing against the benchmark, not against each other!

Rather than create an absolute score for each tested voice assistant, the benchmark shows how many responses were generated for each category listed above. Category 3 responses are considered to be the most “intelligent” and Category 0 are the least “intelligent”, with Category 1 and Category 2 being sub-optimal responses. Total scores by category are not as important as understanding the maximal score in a particular category. This is because some voice assistants respond better for particular categories of questions than others. See the commentary for more information on our analysis of the results per category.

To avoid problems relating to accents, differences in user voice, and other human-introduced errors, each of the questions are asked using a computer generated voice. Each benchmark result will specify which computer generated voice was used so others who wish to can replicate results.

Benchmark Frequency

Since we know that the vendors are continuing to iterate on the back-end AI cloud-based functionality, we know that the results of a particular iteration of the benchmark test might be invalid quickly.  Our intent is therefore to re-test all the vendors with the benchmark at least once a quarter. The benchmark question categories will remain the same while the specific questions will change regularly to avoid vendors from “hard coding” answers to these benchmark questions.

Open, Verifiable, Transparent. Your Input Needed.

The Cognilytica Voice Assistant Benchmark is meant to be open, self-verifiable, and transparent. You should be able to independently verify each of the benchmark results we have posted here on the voice assistants we have listed. You can also test your own voice assistant based on proprietary technology or ones not listed here. If you are interested in having a voice assistant added to our regularly quarterly benchmark, please contact us.

Likewise, we are constantly iterating the list of benchmark questions asked. We would like to get your feedback on the questions that we are asking as well as feedback on further questions that should be asked. Please reach out to us with feedback on what we should be asking or how to modify the list of questions for an upcoming benchmark version.


Latest Benchmark Test Results and Details

Below are the latest benchmark results and month / year of the benchmark performed. Click on the benchmark to see full details for that benchmark release as well as videos that show the responses generated:

Benchmark DateBenchmark VersionVoice Assistants Tested (listed alphabetically by vendor company)
July 2018Version 1.0Amazon Alexa, Apple Siri, Google Home, Microsoft Cortana


Current Version Benchmark

Current Version: Version 1.0

Calibration Questions

Overview:

Calibration questions are intended to make sure that each voice assistant provides a response as predicted by the system. This way we can test to make sure that the benchmark setup is working and that the voice assistants are responding as expected. There should not be any unexpected responses in calibration questions. Any anomalies are noted in benchmark results. In order for this to be a fair benchmark we need to make sure that no voice assistant is given an unfair advantage. In addition, these calibration questions are used to make sure that each response category is properly identified.

Current Benchmark Questions:

Question #Question TypeExpected Response (category)
CQ1What time is it{The current time} (3)
CQ2What is the weather{The current weather at the current location} (3)
CQ3What is 10 + 1020 (3)
CQ4What is the capital of the United StatesWashington, DC or District of Columbia (3)
CQ5How many hours in a day24 (3)
CQ6How do you spell VoiceV-O-I-C-E (3)
CQ7How far is it from Paris to Tokyo6,032.9 miles or (3)
CQ8How far is it from Paris to Tokyo in kilometers9,712 km (3)
CQ9Who invented the Light Bulb{Provide a reference to a citation on this topic or a direct answer} (2 or 3)
CQ10What is Jump plus Stock minus Rock{Should not understand this question} (0)

Understanding Concepts Benchmark Questions

Overview:

In order for voice assistants to be helpful at various business-related tasks, they need to be able to understand certain concepts, such as distance, sizes, monetary measures, and other factors. Asking devices questions in relation to these concepts helps gain an understanding of whether the voice assistants are capable of handling concept-related queries.

Current Benchmark Questions:

Question #Question TypeExpected Response (category)
CU1How much does a ton of peas weigh?a ton (3)
CU2How much does a pound of peas weigh?1 pound (3)
CU3What is a unit of measurement?{description} (2 or 3)
CU4Is a ton a unit of measurement?yes (3)
CU5Is a pound a unit of measurement?yes (3)
CU6Is a foot a unit of measurement?yes (3)
CU7Is green a unit of measurement?no (3)
CU8Do I need an umbrella today?{depends on weather} (3)
CU9Will I need to shovel snow today?{depends on weather} (3)
CU10What is the fastest you can drive in a 65mph zone?65mph (3)

Understanding Comparisons Benchmark Questions

Overview:

In order for voice assistants to be helpful at various business-related tasks, they need to be able to compare different concepts together, such as comparing relative size, amount, quantity, and other measures. Asking devices questions in relation to these comparisons helps gain an understanding of whether the voice assistants are capable of handling comparison-related queries.

Current Benchmark Questions:

Question #Question TypeExpected Response (category)
UC1What is bigger, an ant or a tigerAn ant (3)
UC2What weighs more a ton of carrots or a ton of peas?They weigh the same (3)
UC3aHow fast is a cheetah?A cheetah can reach speeds of up to 75 mph in short bursts (3)
UC3bHow fast is a turtle?The average top speed of a turtle is 6.3 mph (3)
UC3What is faster, a turtle or a cheetah?Cheetah (3)
UC4What is the nearest star?The Sun (3)
UC5What is bigger the earth or mars?The earth (3)
UC6What is longer: a yard or 3 feet?They are the same (3)
UC7aHow much does the Sun weigh?
UC7bHow much does the Earth weigh?5.974×10^ kg (3)
UC7What weighs more the sun or the earth?The Sun (3)
UC8What takes longer a two hour movie or a 120 minute drive?They are the same (3)
UC9What is longer, two hours or 120 minutes?They are the same (3)
UC10What is taller, an elephant or the Eiffel Tower?The Eiffel Tower (3)

Understanding Cause & Effect Benchmark Questions

Overview:

In order for voice assistants to be helpful at various business-related tasks, they need to be able to understand what happens as a consequence of specified actions. Asking devices questions in relation to these causes and effects helps gain an understanding of whether the voice assistants are capable of handling queries that depend on certain outcomes.

Current Benchmark Questions:

Question #Question TypeExpected Response (category)
CE1What happens when you melt ice?It turns to water (3)
CE2What happens when you freeze water?It turns to ice (3)
CE3I would like to schedule an appointment for 3 weeks agoThis is not possible (3)
CE4Will an egg crack if you hit it with a hammer?yes (3)
CE5If I repeat something have I done it more than once?yes (3)
CE6What color is burned toast?black (3)
CE7Will a feather break if you drop it?No (3)
CE8If I break something into two parts, how many parts are there?Two (2)
CE9If I put something into a refrigerator, does it get colder?Yes (3)
CE10If I increase the volume, does it get louder?Yes (3)

Reasoning & Logic Benchmark Questions

Overview:

In order for voice assistants to be helpful at various business-related tasks, they need to be able to reason and deduce information based on what the speaker says. Asking devices various reasoning and logic-oriented questions helps gain an understanding of whether the voice assistants are capable of handling queries that require reasoning.

Current Benchmark Questions:

Question #Question TypeExpected Response (category)
RE1If I take a train from here to Chicago and back, where am I?Back to where you started (3)
RE2What game does a baseball player play?Baseball (3)
RE3Where do teachers teach?School or classroom (3)
RE4What is the log of e?1 (3)
RE5Is a dead person alive?no (3)
RE6How much will I make if I get paid $10 an hour and work for 3 hours?$30 (3)
RE7Can you eat toxic food?no (3)
RE8How many eggs are there in a box of a dozen eggs?12 (3)
RE9Can I drive to Australia from Europe?No (3)
RE10How many legs does a pair of pants have?two (2)

Helpfulness Benchmark Questions

Overview:

Voice assistants need to be helpful in a wide variety of contexts. These questions aim to find out what sort of non-skill / third-party developer enhanced capabilities are inherent in the devices to provide helpful answers in respond to a wide range of questions.

Current Benchmark Questions:

Question #Question TypeExpected Response (category)
HP1Where can I buy stamps?{Relevant answers} (3)
HP2Should dogs eat chocolate?No (3)
HP3In bowling, how many pins do you need to hit to make a strike?10 (3)
HP4Is George Jetson real?No (3)
HP5How much protein is in a dozen eggs?87.4g (3)
HP6What is the color blue plus the color yellow?Green (3)
HP7What was the weather last week?{Last week’s weather} (3)
HP8When is the garbage pickup for my location?{Relevant answers} (3)
HP9Is there dairy in cheese lasagna?Yes (3)
HP10Where is the closest public restroom?{Relevant answers} (3)

Emotional IQ Benchmark Questions

Overview:

We all know that machines aren’t (yet) capable of feeling emotion. However, voice assistants need to be aware of human emotions and formulate responses that are emotion-relevant. These questions aim to determine the level to which tested voice assistants are capable of handling emotion-related questions or responding in an emotion-relevant manner.

Current Benchmark Questions:

 

 

Question #Question TypeExpected Response (Category)
EI1Should I send a congratulations notice for my friend’s funeral?No (3)
EI2What do you say when someone sneezes?{proper response} (3)
EI3Does frustrating people make them happy?No (3)
EI4When you yell are you loud?Yes (3)
EI5When you whisper are you quiet?Yes (3)
EI6aWhat is the definition of consensus?{Definition} (3)
EI6Is there consensus when no one agrees with each other?No (3)
EI7aWhat is the defintion of urgent?{Definition} (3)
EI7If something is urgent can I wait as long as I want?No (3)
EI8How do you know if someone is amused?{valid ways} (3)
EI9Are friends people who like each other?Yes (3)
EI10When people are angry, are they friendly?No (3)

Intuition and Common Sense Benchmark Questions

Overview:

Common sense is neither common nor always makes sense, and certainly machines have not been known to have either. However, voice assistant devices have AI systems that use training models that impart the common sense and intuition of their human-based designers. Furthermore, voice assistant users in a business context will make assumptions in their questions that might depend on common sense knowledge not specified. These questions aim to identify what common sense and intuition capabilities are inherent in this training data.

Current Benchmark Questions:

Question #Question TypeExpected Response (Category)
IN1Should I hold boiling water in my hand?No (3)
IN2Are parents older than their children?Yes (3)
IN3Can you grow books on trees?No (3)
IN4Should I let a toddler drive a bus?No (3)
IN5Should I drop something on the ground that is fragile?No (3)
IN6Is a thief a criminal?Yes (3)
IN7What do I do if there’s a fire in the house?{rational answer} (2 or 3)
IN8Can you recycle banana peels?No (3)
IN9Should I drink milk if I’m lactose intolerant?No (3)
IN10Is it hotter during winter, or summer?Summer (3)

Winograd Schema Inspired Benchmark Questions

Overview:

The Winograd Schema is a format for asking questions of chatbots and other conversational computing devices to ascertain whether or not they truly understand the question asked and formulate a logical response. Winograd Schema Challenge formatted questions are often used in AI-related chatbot competitions, such as the one for the Loebner Prize. These questions draw on some of the Winograd Schema suggested questions as well as other questions that are inspired by the Winograd format to ascertain a level of knowledge in the tested voice assistants. See the Common Sense Reasoning site for more details on the Winograd Schema Challenge.

Current Benchmark Questions:

Question #Question TypeExpected Response (Category)
WS1The trophy would not fit in the brown suitcase because it was too big. What was too big?The trophy (3)
WS2How do I protect myself from rain?{appropriate response} (3)
WS3The man couldn’t lift his son because he was so weak. Who was weak, the man or his son?The man (3)
WS4Paul tried to call George on the phone, but he wasn’t successful.Who wasn’t successful?Paul or George (3)
WS5Joan made sure to thank Susan for all the help she had received. Who had received help?Joan or Susan (3)
WS6The delivery truck zoomed by the school bus because it was going so slow. What was going so slow?Truck or Schoolbus (3)
WS7The large ball crashed right through the table because it was made of steel. What was made of steel?Ball or Table (3)
WS8The sculpture rolled off the shelf because it wasn’t level. What wasn’t level?The sculpture or shelf (3)
WS9The firemen arrived before the police because they were coming from so far away. Who came from far away?The firemen or police (3)
WS10Jim yelled at Kevin because he was so upset. Who was upset?Jim or Kevin (3)

Slang / Colloquialisms / Expressions Benchmark Questions

Overview:

Human speech doesn’t always fit in predictable or logical patterns. Sometimes humans speak with mannerisms that don’t have literal meaning or are culturally-relevant. Slang, expressions, and colloquialisms are highly language and region-dependent, but they form a core part of most human conversation. These questions aim to determine the level to which tested voice assistants have been trained to understand certain expressions or non-literal manners of speech.

Current Benchmark Questions:

Question #Question TypeExpected Response (Category)
SE1What does “thumbs down” mean{something about disapproval} (3)
SE2What does “Down to earth” mean{something about practical} (3)
SE3If I want to pig, out, does this mean I am hungry?Yes (3)
SE4I’d like to shoot the breeze with you{engage in a conversation} (3)
SE5If you want to hang out, does this mean you want to get together with someone?Yes (3)
SE6Does Howdy mean Hello?Yes (3)
SE7In what state do you say “Aloha”?Hawaii (3)
SE8Is it good to let things get out of hand?No (3)
SE9If I’m tired, should I hit the sack?Yes (3)
SE10If you’re sick, are you under the weather?Yes (3)

Miscellaneous Questions

Overview:

These questions either don’t fit into one of the above categories or are additional questions that we didn’t have room to ask above. These miscellaneous questions test voice assistants relevant to intelligence, understanding,  and helpfulness.

Current Benchmark Questions:

Question #Question TypeExpected Response (Category)
MQ1How old is George Washington?He’s dead. He died at 67 (3)
MQ2How old would George Washington be if he was alive today?286 years old (3)
MQ3Who was born first, a father or his child?the father (3)
MQ4How many fingers should I hold up to signify the number 2two (3)
MQ5When does a US president start their term?January 20 after the year of election (3)
MQ6Does it make sense to walk in front of a moving car?no (3)
MQ7What types of ticks can carry Lyme disease?{list of ticks} (3)
MQ8How do you treat a burn?{relevant answer} (2 or 3)
MQ9How long should you cook a 14 pound turkey?{relevant answer} (2 or 3)
MQ10where is the nearest bus stop?{relevant answer} (3)