The next generation of technology interfaces is taking us beyond the inherent limitations of swipes, taps, and clicks with the more nuanced and natural signals of human speech. The design techniques and metaphors of graphical user interfaces (GUIs) simply don’t apply in this new era of voice interaction. The new wave of voice user interface (VUI) design has to be based on conversation—the communication system we learned first and know best.
Conversation is a highly complex but systematic medium, with defining principles far more subtle and compelling than grade-school admonishments like “Never start a sentence with ‘And’,” or “Don’t interrupt.” When we interact with other humans, we take the complexity of conversation for granted; it’s already second nature. But when we set out to design a spoken dialog with a device, not understanding the true, inner workings of conversation will result in a frustrating user experience. And because voice is such a personal marker of an individual’s social identity, the stakes are especially high: users of poorly designed VUIs report feeling “foolish,” “silly,” and manipulated by technology, and so they avoid repeat usage. But it doesn’t have to be this way.
Here are six rules based on principles of everyday conversation that will not only keep a dialog on track, but will help VUI designers navigate this new era of effortless, human-centered UIs.
1. Give your VUI a personality
You may think your simple voice application doesn’t need a personality. But it’s not about “needing” one. Compare these two calendar apps, the first with the assistant's character, or “persona,” left to chance and the second with persona by design:
This VUI’s persona was left to chance
This VUI’s persona was designed
All voices project a persona whether you plan for one or not. VUIs that are supposedly designed without a persona, as in the first example, consistently score low on personality attributes like "friendly" and "helpful," while scoring high on the "boring" scale.
Thanks to a few hundred thousand years of evolution, we humans can’t help but evaluate speech in terms of personality traits—even if it’s a smartwatch or GPS system that’s talking. This is not opinion. It’s a fact well-documented by sociolinguists (e.g. Labov 1964) that even minimal speech samples will conjure impressions of the speaker’s character. We’ve evolved to be expert at summing up folks based on how they sound.
In one compelling study (Giles & Powesland 1975), teachers were asked to assess eight fictional students based on three things: a sample of written work, a photograph, and a recorded speech sample. The results are surprising—favorable impressions generated by the speech sample overrode negative impressions of both the written work and photograph. Conversely, unfavorable impressions of a student’s speech overrode favorable impressions from the other two sources. Other studies have shown that we rely on speech to evaluate other people in terms of friendliness, honesty, trustworthiness, intelligence, level of education, punctuality, generosity, being romantic, being “privileged,” and suitability for employment. In short: speech is powerful!
The Takeaway: Don’t leave your VUI persona to chance. From the very beginning, create the ideal employee to represent your brand. What are they like? How should they sound? And most importantly, how do they behave? Use this persona as an anchor to ground your user experience and give it a familiar consistency. For more information on persona design, visit Actions on Google and check out this codelab on Crafting a Character.
2. Move the conversation forward
In everyday conversation, there are a lot of questions that seem to require yes or no answers. But they’re actually asking for much more information. Here are two examples:
A question like "Do you know who's coming to the party?" is not a request for a simple "yes" or "no" answer
Responding to "Can you play a song for me?" with a "yes" or "no" doesn't meet conversational expectations
You’re probably wondering why these speakers seem so uncooperative. It’s because they’ve broken a core rule of conversation called the Maxim of Quantity. According to this principle of conversational behavior, a speaker provides the listener with as much information as is needed to advance the purpose of the interaction. So even if a speaker addresses the literal intent of a question, the interaction won’t feel satisfying unless they move the conversation forward informatively. In these examples, we never find out who all is coming to the party, nor do we hear a satisfactory reason for not being indulged with a tune.
In the same way that these speakers leave us wanting more, the same can be true of a virtual assistant. Compare these two ways of handling the same situation—a price surge that the user isn’t too happy about:
This VUI prompts an end to the interaction, failing to move the conversation forward
This VUI offers more options for the user, keeping the conversation progressing like a natural human dialogue
Clearly, the persona of the second VUI is more competent and likeable.
But it’s not just the assistant who is socially intelligent enough to move the conversation forward—it’s also your users. And this instinct can’t be suppressed. Here’s an example of a user trying to move the conversation forward, just as if he were conversing with a person:
This user naturally expects the VUI to deduce a number from some personal information
Now, if the recognition grammar has been designed to expect only numbers, like “two,” this user will be dinged with an error prompt, just for being informative. If responses like this one can’t be handled by the recognizer, consider getting the conversation back on track with an easy-breezy conversational reprompt, like “Sorry, how many people was that?” (rising intonation). Research shows that, in the case of misrecognition errors, users often just need a simple reprompt. No need to draw attention to the error with robotic, heavy-handed industry clichés like “I’m sorry I didn’t understand. Please speak the number of people in your party now. You can say, for example, ‘two.’”
The Takeaway: Look beyond the literal when designing a conversational flow. Try to anticipate moments when your VUI can keep the conversation going by offering more information and recognizing informative answers from users. Also realize that what the industry considers a “recognition error” is actually the fallout of our very human impulse to furnish cooperative, informative contributions.
3. Be brief, be relevant
Speech, unlike writing, is inextricably bound to the passage of time. The longer someone holds the floor, the more brainwork they’re imposing on the listener. We can only mentally process so much information until it becomes an undue burden on our short-term memory. Listening is often considered a “passive” skill in contrast to speaking, which is thought to be more “active” and “productive.” In reality, listening involves a lot of work. So it’s important for your VUI to give the listener a break from listening and let them have their turn, too. Compare these two examples, the first VUI overwhelms the listener while the second is more concise:
This VUI overwhelms the listener with flight information
This VUI remains brief and to-the-point, providing a more natural and pleasant interaction
Unlike the “permanence” of writing, speech is transitory, immediately fleeting. The speech signal is also linear, making irrelevant information especially irksome in VUIs, because unlike GUIs, there’s no way to skim over the material. By obliging users to wade through the uninformative, poorly designed VUIs are a drain on their valuable time. I’d argue that irrelevant verbiage is the number one reason people loathe customer service apps. Lots of VUI designers and developers foist irrelevant messages on the public, in the form of promotional messages, upsells, and when giving instructions. You’re no doubt familiar with the obvious instruction to “enter your ten-digit phone number, starting with the area code.”
Users’ perception of benefit is key. People do not appreciate taking extra time or jumping through hoops to find things out or to get things done. Research has shown that if a VUI does not offer a clear advantage to alternative ways of accomplishing the same task, users will avoid it. After all, what’s the point? Successful VUI design therefore offers the benefits of relevance and expedience.
The Takeaway: Keep messages short and relevant. Let users take their turn. Don’t go into heavy-handed details until or unless the user will clearly benefit.
4. Leverage context
To be relevant, we have to attend to context: A good conversational participant keeps track of the dialog, has a memory of previous turns and of previous interactions, and evidences awareness of the user’s circumstances—for example, that they’re in a foreign country, that there’s a severe storm on its way, or that they’ve already tried three times today to make some sort of settings change.
Likewise, VUI designs should leverage the user’s context as much as possible. If a user entrusts the interface with their information, it should respond based on, for example, what they’ve done, what they already know, and what’s been said earlier in the dialog. Obvious failures to attend to context will effectively undermine the perception of an intelligent assistant. A well-known example that’s universally disliked is a VUI's request to "Please listen carefully as our menu options have recently changed." Here's another version:
Intending to be helpful, this message is time-consuming and irrelevant
This sort of message, which is the centerpiece of a genre I call “VUI kitsch,” is irksome because it brashly ignores the user’s context. Just think of all the assumptions it is happy to admit. It assumes that the user has called before. It assumes the user who has called before heard a different design (different “options”)—that is, they didn’t call just a few minutes ago. It often assumes that “recently” is understood to mean “several months ago.” It assumes users who heard the different design actually remember it! Because the prompt disregards the user’s context, it ends up being irrelevant to practically everyone who hears it, and we’ve already talked about how and why irrelevance in the VUI world feels like punishment.
The Takeaway: We talk a lot in this industry about personalization, artificial intelligence, and data-driven innovation. But designs that simply keep track of the conversation and remain “aware” of the user’s context will effectively advance the perception of human intelligence.
5. Direct the user’s focus through word order and stress
The VUI’s awareness of what’s been said is also critical to determining how individual messages should be structured. Otherwise, failure to “keep track” burdens the listener’s comprehension process and causes vague discomfort. Listen to these two examples of different VUIs responding to a user's request to book a flight on a date that doesn't exist:
This recording puts new information before the old, breaking the end-focus conventions of normal conversation
This recording puts the new information where it should be: at the end
Why does the first recording sound weird and robotic, while the second seems conversational? The explanation is the End-Focus Principle. According to this rule of conversation—greatly simplified here—language users have unconscious expectations about how information is laid out in an utterance. “New” information by default comes at or near the end of the sentence and is stressed, while “old” information precedes it. In the examples you’ve just heard, what’s “new” is the info “30 days,” so it feels right at the end, and stressed appropriately. In the version that sounds strange, the old information, the topic “June,” has been miscast as if it were new information for the listener. In order to sound natural, it shouldn’t be stressed or come at the end of the sentence.
End-focus violations cause undue friction in the interaction. But by respecting users’ expectations of how information should be structured, the user experience not only feels more intuitive, it offers users the added benefit of confirming that the VUI accurately heard them.
Stating known information first lets the user know they were heard correctly, bolstering trust in the technology
This example shows that by putting the old info first (“the PM of India”), the user will know right away that the recognizer heard “the PM of India,” as opposed to, say, “the PM of Italy,” in which case there would be no need to pay attention to the (wrong) name that follows.
The Takeaway: To focus the user’s attention on what’s important, leverage their expectations of word order and stress placement. Unless your VUI’s persona is based on Yoda from Star Wars, put known information before new information when possible.
6. Don’t teach “commands”—speaking is intuitive
One of my pet peeves is emblematic of amateur VUI design: “teaching” users how to speak. Here are two examples:
Teaching the user how to communicate, these instructions are modeled after prompts typical of touchtone interfaces
These messages imply that you need to be taught how to use English; otherwise, the VUI wouldn’t be giving you these instructions. This style of prompting: “To VERB, say/do X,” “For NOUN, say/do X,” etc.—is a vestige of customer-service touchtone applications: “For technical support, press 1. For payments and billing, press 2…” And in fact, these messages are informative...but only in the world of touchtone. That’s because no one grew up knowing that “1” means “tech support”; we have no intuition about what meaning a developer or designer has assigned to the hash key. But in the world of VUI, this prompting style sounds absurd. It reveals a failure to understand that the whole point, the real benefit, of offering the public a VUI is that speech is intuitive; it doesn’t need to be taught.
Compare those touchtone-style prompts with this agreeably conversational alternative...
This prompt leverages conversational structures to impart familiarity, comfort, and naturalness
It’s hard to imagine someone having difficulty with such simple, straightforward prompts. And if they did, you’d address the issue in a reprompt. We’ve already talked about being brief, being relevant, and leveraging context. So again, deal with errors only when necessary.
The Takeaway: Avoid "teaching commands" in a VUI. If you have to explain a command, something’s wrong; go back to the drawing board. Instead of spoonfeeding commands, why not ask a question and make it clear the user can take their turn—sound familiar? That’s conversation!
With the advent of chatbots, assistants, and apps to wow the masses, we now have the opportunity to spread the word that conversation is the key to a successful user experience. And it doesn't just mean sounding folksy, saying "you're" instead of "you are,” saying “Oh” and “Thanks,” or eliciting opinions about ice cream. Conversation is one of Nature's greatest masterpieces and our most powerful means of communicating through sound. We’d be foolish not to model our interactions after rules as old as the human race itself. The first step is becoming aware, technically, of what conversation is really all about.
Resources and Recommended Reading:
Voice User Interface Design by Michael Cohen, James P. Giangola and Jennifer Balogh
The Social Stratification of English in New York City by William Labov (PDF)
Speech Style and Social Evaluation by Howard Giles and Peter Powesland
The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places by Clifford Nass and Byron Reeves
“Logic and Conversation” by H. P. Grice (PDF)
A Concise Grammar of Contemporary English by Randolph Quirk and Sidney Greenbaum
James Giangola is a linguist and Creative Lead with Google’s Conversation Design team. He is a co-inventor and linguist on the U.S. patent “VUIs with Personality,” and co-author of the book Voice User Interface Design. With over ten years of experience as a classroom language teacher, he is also author of The Pronunciation of Brazilian Portuguese and u aufabetu brazileiru regularizadu, a proposal for a transitional "teaching" alphabet that would bring literacy to Brazilian learners, young and old, in just months rather than years.