Bespoken LLM Benchmark: Does ChatGPT know more than Google and Amazon?
Learn more
December 6, 2019 in Blog

The Next Model for Voice: How Platforms And Builders Can Work Together

TL;DR The next wave of Voice is not about Apps but Domains - defined by users and the queries and commands they speak. Third-parties and platforms can work together in this ecosystem with trust and well-defined rules.

There are a lot of great conversations happening within the voice-first community nowadays. We hear about discoverability, the role of voice assistants versus voice apps, as well as the future of third parties on the major platforms. Many of these are topics that have been of interest for some time.

However, I believe we have arrived at a point where these are not just rhetorical questions – thanks to all the hard work of many, many people across diverse enterprises and backgrounds, we have reached a “local peak”. Perhaps we can’t see the final destination, but we can take stock of where we have been, and we can see a clear path in front of us to take us to the next horizon.

These moments of cresting the hill are exhilarating:

On a clear day, you can see the next five years…

And as the Founder of Bespoken, it’s been thrilling to be along for this ride. We like to think we have made our own contribution to it, and we want to help shape the path forward in this still nascent but burgeoning industry.

With this in mind, I propose the following conceptual model for voice platforms and experiences. This is about 50% forecast, 50% nudge – our view of the world that we expect and hope to see.

A New Model For Voice

The next phase of voice is grounded in three simple concepts. These are the pillars of voice, and they are the essential pieces to understand when trying to analyze where one fits into this new ecosystem.

Domain

This comes first because it is the point of integration that brings everything together. A domain represents a set of related queries and commands that a user can make of a voice-enabled device. Example domains are “weather”, “recipes”, “thermostat”, “timer”, etc. These are all the things that devices are known for performing today.

A domain represents an area of expertise, as well as potentially an area of control. Domains are meant to be discrete – they do not encompass all possible things a user might say, but ones related to a particular topic or service.

They are the natural point of integration for third parties because they allow for:

  • Independent authorities to provide their insight (The Weather Channel responding to weather requests, a sports site responding to basketball queries, etc.).
  • Specific products owned or used by a consumer to handle relevant queries (Chevy responding to car-related queries for owners of Chevy vehicles, etc.).
  • Hyper-targeting of queries for domains that specifically reference a brand or product

Voice Assistant

A voice assistant provides a number of things:

  • Ability to query across one or many domains
  • Ability to fulfill commands across one or many domains
  • The ability to execute complex multi-turn dialogs
  • A unique persona [OPTIONAL]

The last two points are what distinguishes an assistant from a domain – the ability to engage in some form of conversation as well as the introduction of a unique personality. Though it is worth keeping in mind the persona is not essential – this comes down to a design decision on the part of the implementer.

Platform

The platforms comprise:

  • Ability to respond to queries and commands across many domains
  • Core speech recognition and natural language understanding capabilities
  • APIs for enabling voice capabilities on hardware devices
  • APIs for third-party fulfillment of queries (aka skills)
  • Well-defined contracts for how the platform interacts with third parties and arbitrates between first-party and third-party fulfillment

The creation of the APIs described above is obviously a massive undertaking, and in the market today we see only a handful of players that are able to take it on at scale, and even fewer that are able to do it effectively. Also implied in these features are large teams to stand behind these capabilities to evangelize them and educate and support builders on the platforms. Given this, it’s no surprise that Amazon and Google dominate, given the resources and strategic commitment required.

And of course, many will refer to a platform as a voice assistant – we don’t object to this label, but within this document, we use these different terms to distinguish between offerings that are vastly different in scope, even if they also have many things in common.

What It Means

Forget About The App Model

You might ask – why this emphasis on domains? And what is the distinction between an app and a domain? This may be the most important point I have to make. The difference is that:

Characteristic App Domain
Boundaries An app is defined by its builder A domain is defined by its users
Features The app builder decides which features and use cases they support Users define the use cases; the domain provider can handle them well or poorly
Invocation Apps are explicitly launched by users Domain providers are implicitly selected by the platform based on the query, user preferences and core platform algorithms

This distinction, between app and domain, is critical to understand how voice experiences differ from mobile and web ones, and how providers need to behave in these ecosystems.

And though we have relied on an app-centric model to date, it is time to let go of this. The app analogy has taken us as far as it can – we now have had enough time with voice to see it for what it is, and take it on its own terms.

Burn the ships – there is no turning back

And the app-model simply is broken. Why? Because….

Discoverability Is Moot

One of the biggest impediments to the proliferation of the app model is the oft-cited issue of discoverability. But what is discoverability really? It is essentially memorization, and no amount of force-feeding on the part of Amazon and Google is going to cause people to miraculously, instantaneously recall tens if not hundreds of three-word phrases. Most people are lucky to remember where they left their keys, much less the name of your oh-so-compelling-but-nonetheless-irretrievable skill. Of course, there are other approaches such as alerting and contextual prompts, but alerting (at least for now) is far too disruptive for voice-first experiences. And contextual prompts (“you asked about a recipe for lasagna – would you like to watch an Italian cooking class from Food Network?”) are unlikely to move the needle – at least not without annoying the ever-loving love out of everyone.

Rather, the real challenge is how third-parties play while NOT requiring explicit invocation or memorization on the part of users. We already see this being well-executed in the music domain, where a variety of music services are seamlessly integrated into the platforms. When I request a song, the fact that it is delivered by Spotify is not something I need to explicitly specify or think about – it is done automatically based on preferences established at setup. Other domains should work in a similar way. The outcome?

First Party Intents, Third-Party Fulfillment

This probably should have been the model from the outset. The “Alexa, ask so and so to do such and such” has forever struck me (and I’m sure everyone else) as odd, and it has not improved or grown more natural with age. In fact, as the number of use cases and the size of the speech model has expanded for Alexa and Google Assistant, it seems to me the accuracy with which third-party skills are selected has perhaps even degraded.

But as mentioned above, third-party invocation did draw a direct analogy to apps, which was an easy conceptual leap for devs as they came to this new platform. Now, though, we can focus on a more natural point of integration for third parties – handling some set of queries in which they have expertise, which are automatically fed to them by the platforms. Who should get what? That is very important, indeed, and establishing fair rules for it is critical – I propose a more detailed contract below. But first, what is to become of interactions that cannot be reduced to single-shot queries or commands?

Conversation Is For People, Context Is For Machines

Another notion I am eager to see put to rest is that of extended conversations between people and robots. Some may be disappointed by this, and for them, I am sure scientists at Amazon, Google and elsewhere are hard at work trying to make this a reality in the future. In the meantime, if you want to talk to someone, call your Mom. If you want to build a voice experience:

  • One-shot is preferred
  • Where one-shot is impossible, quick, contextual follow-ups are the next best option
  • As a last resort, attempt an extended multi-turn dialog

Is this to say that extended interactions are a complete failure? Absolutely not! But describing them as conversations misses the point. How are they not conversations?

  • They are not open-ended
  • They lack important context – such as body language, intonation, emphasis, and past interactions
  • They have very poor understanding – both in terms of speech and intent recognition

All of these things are likely to improve radically over a five to ten-year time horizon. But 12-18 months? Not so much. Almost certainly not enough to change what is feasible for most implementers.

So where does this leave us? Well, I’ve heard many, many folks in the last few years casually dismiss IVR as if it were a four-letter word, but there is a reason phone systems were built upon menu systems with explicit direction and constraints for users – because they work better than the alternative (note – better, not perfect – obviously these systems leave much to be desired). Inviting users to say whatever they want and then consistently misunderstanding or misdirecting them is a disastrous UX. So think in terms of “queries, context and directed dialog” – not conversation.

A Contract For 1P-3P Interactions

Another reason why the app model was foisted upon us was this: it’s scary to think of Amazon, Google and Apple holding all the cards. Why should a developer rely on them to share users’ queries on a consistent, predictable, and fair basis? That type of relationship demands a lot of trust.

And the platform providers have not done enough to earn that trust. However, at this point, two things are clear:

  • Voice is real, and will be a major interface for users for years to come
  • Explicit invocation is broken

So, in the absence of an alternative, a bit of faith is required. But if the platforms want a real ecosystem, there are steps they can take now to get builders to buy in:

Allow Users To Explicitly Choose Their Providers

The approach to the music domain again is instructive – we can select which provider we prefer to fulfill music requests. The same can be done for other popular domains. And for users that do not make an explicit selection….

Intents And Algorithms Should Be Transparent; Decisions Should Be Explained

Much of what has been proposed here is actually seemingly already embodied in the canFulfillIntent request feature for Alexa. This allows for developers to specify which top-level intents and slots they can provide responses to. But if you survey the development community, no one is satisfied with how this works – it is opaque, unverifiable, and done entirely “just-in-time” (i.e., there is no way to reason ahead of time around what will happen for any specific request – the only way to know is to try it).

For this to work at scale, platforms must indicate to domain providers:

  • Which intents are most popular within a given domain
  • How the algorithm works when explicit preferences are not available
  • When an intent is sent to a domain provider, WHY it was sent – what explicit factors played into it

I can imagine that some will be skeptical that any of this information will ever be provided. But Samsung is already doing this in part. And the opaqueness of many of the algorithms has just gotten silly – these are not state secrets, and it’s a generally understood weakness of AI that it CANNOT be effectively reasoned about or explained. It’s an obstacle for users and third parties alike, and Amazon and Google will be well-served to address it so the entire ecosystem can thrive.

Or perhaps they just want to own the whole thing? It’s possible, but if the major platforms want to play kingmaker for one of the smaller competitors, this might be the best way to do it. Because third parties WANT to play and be part of the voice revolution. If Amazon and Google try to keep it all to themselves, that is a sure way to invite a massive new competitor onto the scene. Someone like Houndify could leapfrog from vital-but-second-tier player to behemoth.

Why Should You Believe Me

Most of what is written above hinges on just a couple of key observations:

  • Users do not remember invocation names
  • Multi-turn dialogs sort-of work – in some cases they are useful and appropriate. But for the most part they annoy users and should be avoided.

If you accept these observations, everything else I’ve laid out follows fairly naturally. Of course, someone might come up with (or perhaps unbeknownst to me, already has) how to (a) improve users’ memories (b) remind them of phrases and experiences without annoying the love out of them, and/or (c) miraculously, markedly improve the state of the art of speech recognition. But assuming none of the above occur in the next 12-18 months, I believe most of what I have written is inevitable. At least, it is if we want to have a vibrant ecosystem for third parties.

Now What?

Given all this, what can practitioners do today to prepare themselves for this next wave?

Be The Master Of Your Domain

If you are a brand or app builder trying to figure out how to play, my suggestion – see what domain you most effectively fit into. And see if you can own it. If you work in financial services, don’t think about rebuilding your mobile app for Alexa. Instead, think about how to most effectively answer users’ most frequently asked questions. Make yourself invaluable to current users. And think about how you might be invaluable to prospective users. Can you help with retirement planning? Financial literacy? Credit card applications? A query- and domain-centric mindset will lead to a better place than an app-centric one.

And if you are a retail company, a mom-and-pop shop, a small business, etc., think about a micro-domain – you don’t need to be the know-it-all on ice cream, or shoes, or faucets. But you can be the source of expertise on what you do. And once that is built out, you can “own” that particular piece of voice real estate – in a world of only top-level intents, SEO is that much more of a winner-take-all, do-or-die scenario.

Of course, it’s a truism that no one is ever going to know your brand as well as you do – certainly not Alexa or Google – so it behooves you to be there for users to guide them where the core platforms fall short.

Don’t Try To Be Master Of The Universe

Conversely, I hear a lot of discussion about brands launching their own assistants, from Bank of America to Capital One to BBC. Now, some of these enterprises may find this an effective strategy, and may have the resources to do it. But even for the most deep-pocketed enterprises, I fear this may be a decision they rue. Why?

Because if you build a self-contained assistant, one that sidesteps the platforms, then you:

  • Limit your footprint (e.g., only available via website visitors and mobile app users)
  • Limit your capabilities (need to do your own ASR and NLU)
  • Force users to learn your language and idioms

Now builders might say, “My mobile presence is massive. Much bigger than Alexa’s.” And that may be true today. But consider this….

The Big Picture

The mantra for this article is it’s time to take voice on its own terms. And those terms are staggering – just look at all the voice-enabled devices:

Sure ThingsMaybe Someday
Smart SpeakersGlasses
PhonesMicrowaves
EarbudsRefrigerators
WatchesMirrors
TelevisionsToilets
CarsJackets
Smart PhonesRings

The devices in column one are inevitable and in some cases are already essential. Column two? Many may seem silly but some nonetheless will prove indispensable.

And these are JUST the devices with voice-capabilities embedded – the march of voice continues to be the march of IoT. Voice is our point of control for the ubiquitous computing power that exists around us. If you imagine a world in which the average cell phone owner has just ONE of each of the above items, the coming wave of voice-enabled devices looks like a tsunami. And if you factor in the devices under their control (thermostats, lights, power switches, appliances, etc.), it becomes even more staggering.

And the very good news is third parties have a huge role to play – the big guys need to provide the platforms and the device access, but they cannot do all the fulfillment. The future of the ecosystem is everyone playing nicely together in this new query-centric, domain-centric world, in which first and third-parties work together seamlessly.

For the platforms, it’s the chance to employ, at massive scale, the wisdom of the crowd – the wisdom of every brand, app builder, API and website on earth. What an amazing achievement it will be.

For third parties, it’s the opportunity to meet users, wherever they are, whatever they are doing – properly done, they will be just a short trip of the tongue away.

About Us

We are excited about this brave new world, and we love helping organizations marry their knowledge, expertise and capabilities to these amazing platforms with our testing, tuning and monitoring tools.

2 responses to “The Next Model for Voice: How Platforms And Builders Can Work Together”

Leave a Reply

Your email address will not be published. Required fields are marked *