Bespoken LLM Benchmark: Does ChatGPT know more than Google and Amazon?
Learn more
July 3, 2019 in Blog

The Mars Agency Case Study

TL;DR The Mars Agency cut down errors on the speech recognition of their voice app by more than 80% before launch, tuning it with the help of Bespoken against a comprehensive set of utterances and accents. Read about the how they did it below.

The Mars Agency – Leaders In Voice Commerce

The Mars Agency is a global marketing practice, specializing in marketing to shoppers, consumers, and retailers across the ever-expanding omni-commerce environment. Proud of its independence and growth-for-clients focus, The Mars Agency operates internationally across the Americas, Europe and Asia through its network of 13 offices. For more than 45 years, Mars has been a leading force for shopper innovation, helping clients identify and apply emerging technologies to create revolutionary commerce experiences. Mars has developed best-in-class capability and expertise across various technologies, with particular focus on Artificial Intelligence and Voice, and leveraged strategic partnerships with Amazon, Google and IBM to launch unique solutions like:

  • SmartAisle™: The world’s first voice-powered shopping assistant for brick & mortar retail.
  • Marilyn℠: The first end-to-end AI-enabled predictive commerce intelligence platform for marketing to shoppers.

The Challenge – Ensuring A Great Customer Experience

Mars worked with a global, Fortune 500 cosmetics brand to bring the revolutionary SmartAisle™ voice solution in-store to consumers who want to improve their skin care. The app was created to provide detailed information on skin care, allowing users to find and purchase the product that best suits their needs via a simple conversation.

To ensure the success of the project, the Mars Agency needed to be certain that the user experience was consistently delightful. To accomplish this, they faced the complex challenge of anticipating how customers would interact with their app, which includes accommodating various accents, colloquial expressions, phonetic variations, and even background sounds.

With this in mind, the Mars team formulated a plan with their client to test the application before it launched. This plan was not only difficult, but also limited by budget and time constraints:

  • Step 1: Find a representative group of English-speaking users with common accents to test the app.
  • Step 2: Have each person interact several times with the app, repeating a list of utterances to cover the app’s functionalities.
  • Step 3: Collect and analyze the results. Identify code and interaction model improvements. Assign backend changes to the development team.
  • Step 4: Development team applies the changes from Step 3.
  • Step 5: Start over with Step 2 to verify the detected errors were corrected and no new errors were generated in the modifications made.

Keith Porter, Senior Product Manager for The Mars Agency, asked himself: Is there another way? And indeed, there is.

Bespoken’s Usability Performance Testing To The Rescue

Keith contacted us and shared both the success story of SmartAisle™ for the sale of whiskey and his need to test the voice app they were constructing for the aforementioned cosmetics brand:

We’ve been building skills for a while now and are looking to take our development process to the next level by integrating automated testing. Bespoken seems to be the obvious choice…

After the first meeting, it was clear they needed to make Automated Usability Performance Testing. Let’s quickly review what is it:

With Usability Performance Testing you can send a slew of generated or recorded audio utterances to your voice app. When executed we’ll identify which slot values or intent names were misinterpreted. This allows you to improve your interaction model and increase the success rate of speech recognition. The audio interactions you send can contain accent variations or different background noises, allowing you to test almost any possible scenario your users might encounter.

Taking advantage of Usability Performance Testing, the scope and work plan to help The Mars Agency instead looked like:

  • Time scope: 3 weeks for the entire testing project.
  • Define the intents and slot values to test. For example, the Mars Agency wanted to verify the skin care product names were correctly recognized. For this reason, the product names were heavily tested.
  • Define the utterances and phrase variations to use.
  • Perform three rounds of tests.
  • Use two types of audio interactions:
    • Generated audio: More than 1500 interactions per test round were sent to the voice app.
    • Recorded audio: 900 real-life interactions per test round were sent to the voice app.
  • The tests with generated audio were done using AWS Polly and Google Wavenet voices with Chinese and Spanish accents.

The tests were carried out under these premises in cycles consisting of:


The first execution of Usability Performance Testing served to get an initial performance grade and establish a baseline from which to begin to make improvements to the interaction model and code. After each recurring test cycle, the success rate increased as seen in the next image (audio interactions only):

Several types of improvements were made to the intent model:

  • Adding synonyms/sounds-alike phrases
  • Correcting typos and alternative spellings
  • Disambiguation of utterances defined for different intents

Achieving 96% of success rate for recognition in such a short time was greatly appreciated by The Mars Agency:

96% is pretty awesome news. I cannot reiterate enough how helpful this has been. Our confidence going into this launch is significantly higher because of this testing.

And the boost in confidence is just one of the several benefits. Among others, Usability Performance Testing allowed Mars to:

  • Execute an extensive and comprehensive set of tests in an automated fashion, quickly and repeatably, saving time and money.
  • Problematic verbiage was detected easily, reducing the time needed to improve the interaction model, which yielded in better speech recognition and happier/more engaged users.
  • Decreased the chances of getting bad reviews.

How to get started with Usability Performance Testing

Launch your voice app with confidence! Like The Mars Agency, you can detect ASR/NLU-related errors in advance, minimizing the chances of getting bad reviews. Remember that 1-star reviews can undermine the confidence of stakeholders and negatively affect your reputation.

To get started just follow these 3 simple steps by filling out this form:

  • Tell us what you want to test
    • Sequence (one-shot, in-session, etc.)
    • Intent and/or slot value to test
    • Scenarios – types of speakers, phrases, etc.
  • Send us the interaction model (privacy policy applies)
  • Get the results

Leave a Reply

Your email address will not be published. Required fields are marked *