The Amazon Echo, powered by the Alexa Voice Assistant, opens up so many great possibilities for developers and content creators. It’s obvious the first time people try it just how powerful and accessible it is. And it’s easy to imagine so many creative ways to take advantage of the platform. But there are subtleties and tricks to building for Alexa, and mastering them can be the difference between a great skill and a mediocre one. To help you become an Alexa ninja, here is our handy quick reference of just what the custom Alexa skills can do.

Quick Reference Card

Below, we summarize what Alexa skills can do depending on which state they are in:
Disabled In-Session Out-Of-Session Playing Audio
How To Interact “Alexa, enable My Skill” Wait For The Prompt And Respond Built-in Intents “Alexa, open My Skill” “Alexa, tell My Skill to Play” “Alexa, open My Skill” “Alexa, tell My Skill to Play” Built-in Intents
Time To Respond N/A ~5.5 seconds N/A N/A
Number Of Reprompts N/A 1 N/A N/A
Supports Text to Speech (TTS) N/A Yes N/A No
Supports Cards N/A Yes N/A No
Supports SSML N/A Yes N/A No
Playback Duration N/A 90 seconds N/A Unlimited
External Audio Format N/A MP3 (via SSML) HTTPS required N/A MP3, M4A, HLS , PLS, M3U HTTPS required
External Audio Quality N/A 16khz, 48kbps N/A 16kbps – 384kbps
Supports Built-in Intents No Yes No Yes
 

Alexa Skill States

What follows is a detailed description of each of the states and capabilities.

Disabled State

If people are going to use your skill, they need to first enable it. This can be done via the Alexa app or website. For example, here is the Rise Above skill. Or by voice, by speaking to your Echo device like so: Alexa, enable My Skill Once enabled, users can begin to interact with it. They just say: Alexa, open My Skill

In-Session State

In-session means that someone has opened up your skill. Two types of content can be presented to the user:
  • Alexa Text-To-Speech (TTS)
  • Short-form audio content
The TTS is the Alexa voice we are all familiar with. It is very easy to work with – as a developer, you just type what you want Alexa to say. The short-form audio content can be an MP3 file of “produced” audio. You can listen to the intro to Rise Above for a nice example of mixing produced audio and TTS. We think produced audio sounds awesome! The short-form audio does have limitations – it is low-quality (16kHz, 48kbps for audio nerds) and can only be up to 90 seconds long. For help with encoding your audio to these specific standards, check out our BST Encode tool. The great thing about being in-session is that once the audio plays, the user has about 5.5 seconds to respond. And if they do not respond in that time, you can re-prompt them once (“Remember, you can say, X, Y or Z”), though only with TTS (no produced audio allowed in the re-prompt!). And you can leverage Amazon’s excellent built-in intents and slots to easily gather common types of information from people (names of movies, places, musicians, etc.). Lastly, responses can include cards – these are displayed in the Alexa app. Cards can display text and images, but do not allow hyperlinks to be included except for the purpose of linking accounts. However, if your user does not respond at all, they go then go Out Of Session (and you also have the option to just end the session, if the skill has completed the interaction).

Out-Of-Session State

When your skill is out of session, a user needs to re-open it to interact with it. They can do this (obviously) by simply saying: Alexa, open My Skill Or they can jump into a particular part of your skill by saying: Alexa, tell My Skill to tell a joke

Playing Audio State

The AudioPlayer allows playback of long-form audio. The long-form audio can be high-quality (up to 384 kbps) and be a stream or fixed-length content. This is how to playback music, podcasts or other content that is longer than the 90-second window provided when in-session. However, this power comes with some constraints:
  • The user cannot be prompted during long-form audio playback – i.e., you can’t ask them a question and wait for a response
  • The Alexa voice cannot be used for Text-To-Speech, except upfront before the long-form audio begins playing
There are also some special capabilities, such as the AudioPlayer-specific top-level built-in intents. They let users interact with your skill directly without invoking its name – users can say things like “Alexa, Play Next” or “Alexa, Repeat”. These top-level intents avoid the long-winded syntax typically needed when out-of-session (e.g., “Alexa, tell Spotify to play Lola by the Kinks”). We use these special top-level intents to do some neat stuff – if you take a look at the We Study Billionaires podcast skill, we re-purpose the “AMAZON.NextIntent” to allow listeners to jump from a podcast snippet to the full podcast. Very cool, right? And keep in mind, even if the user says “Stop”, and has not used your skill for a very long-time, as long as your skill was the last one to use the AudioPlayer, it will receive any AudioPlayer intents! So if someone says “Play Next” two days after last using your skill, as long as they did not use Amazon Music or Spotify in the intervening time, your skill gets the intent. Also neat, right!?

Summary

We hope this is a super-helpful summary of the Alexa’s capabilities. And we did not even touch on Smart Home Skills or Flash Briefing Skills – they provide more tailored APIs for specific purposes. Don’t forget about them! Our goal at Bespoken is to make Alexa development as easy as possible, so if you have questions or comments, talk to us on Gitter. And stay updated through GitHub