Designing voice interfaces with Ben Sauer

A transcript of Episode 207 of UX Podcast. James and Jonas talk to Ben Sauer and discuss the complex world of designing for voice..



JONAS: You’re listening to the UX Podcast coming to you from Stockholm, Sweden. We’re your hosts, Jonas Söderström.

JAMES: And James Royal-Lawson.

JONAS: We have listeners in 182 countries from Columbia to Thailand.

JAMES: Jonas, you’re really helping me think about speaking slowly. I speak terribly fast and you speak quite nicely and clearly and slowly. So I will try and do the same. Today, we are joined by Ben Sauer to talk about voice interfaces and designing voice interfaces.

JONAS: Ben spent about six years working at Clearleft and joined us on episode 98 back in 2015 when he attended From Business to Buttons Conference.

JAMES: Currently, Ben is the director of Conversation Design at Babylon Health, a healthcare startup which makes extensive use of digital interfaces to provide healthcare to customers.

JONAS: And during the previous years, Ben has spent a lot of his time researching and working with the design of voice user interfaces.

JAMES: And how hard can it be? Let’s talk to Ben.


So, Ben, designing voice interfaces; we talk all the time, surely that’s a really easy thing.

BEN: [Laughter] Yes, that is the classic assumption. So the way to think about this – and I’m not the only person who said this – is we’ve been interacting with computers in mass market for roughly 30 years, right? Although computers have been around a bit longer than that. We’ve been interacting with our voice for potentially hundreds of thousands of years so our expectations about how it works and how well it should work are vastly different. And that’s why, traditionally, foreign voice interfaces have tended to be so disappointing to people because our expectations of the interactions are so high. And their complexity behind the conversation is invisible to us as human beings because we’re doing it every day. We just do it. Those are those billions of neurons up here that are making that happen.

JAMES: Yes, we’ve been doing conversations since the day we were born, we’ve been building that network layer in our heads to cope and deal with this.

BEN: Yes. Let’s talk about the timelines nature of this. I think it was Aristotle who said there is only one condition in which we may not need slaves or subordinates and that condition would be that machines could take our commands. So the dream of us speaking to machines is a very old one and it seems like with Alexa and Siri and so on, just in the last few years, that ancient dream is starting to come through. But as we all experienced, it’s kind of broken, it kind of doesn’t work very well.

JAMES: It’s an illusion.

BEN: Yes, exactly. I say we’re creating smart folks, we’re not creating real conversations.

JONAS: Actually, we talk about voice user interface as one thing but it’s actually two things. It’s talking to and being talked,  which are two quite different –

BEN: It is. And in fact, if we talk about multimodal that can mean that you’re using one but not the other. So definitely those mean two very different things and they’re very differently capabilities and problems attached to both. So you’re absolutely right.

JAMES: So one of your design decisions or even strategic decisions would be, are we going to create an interface which can interpret our commands or are we going to create an interface that can ask us questions and understand the answers?

BEN: Yes, there’s also some interesting work being done in this. Multimodal isn’t anything new, people have been sort of playing around with it for a long time. I don’t know if you’ve ever seen the demo entitled, Put That There from MIT in which a man points at the screen and says things and it all works kind of together to move objects around a screen. That’s from the ‘70s, it’s really old.

But yes, so for example, I went to a talk by KLM, the Dutch Airline and they said that they’re starting to practice something that they’re calling touch-free design. So it’s the idea of voice input, screen output. So when you’re holding your mobile device, people are getting so comfortable at speaking, that actually we can just remove the touch input but you’re still getting all of the app on screen as you go through a booking flow or conversation or something they’re providing.

JAMES: That would be similar-ish to what we’ve seen with Google’s voice interface on the mobile. You speak into your device and it searches for you, it gives you some kind of information on screen.

BEN: Yes, absolutely. So it’s going to get complicated. We’re back in an era that’s broadly similar to when we went mobile. One thing we’re starting to learn, a lot of people have noticed this is, it’s like when we started to say mobile first. If you design it voice-only first and you get that right, it tends to work well in other mediums. But if you do it the other way around, it’s really broken. If you take all the content in your FAQs that you wrote for screen right now. For example, let’s say you’re a content strategist and say, I’m going to pick all this out and use them as answers in a voice interface or a conversation interface. And you very quickly find, “Oh, this is terrible” when something reads out.

In the organization I work in at the moment, which is called Babylon Health, we’ve started to say, are we thinking in designing voice first? Even if we’re not necessarily delivering a voice interface, just starting with that thought in mind can really help us.

JAMES: If you’ve been working with chatbots, so textual conversation interfaces. Does that also fail to be a first step? Is voice absolutely the definite first step?

BEN: Yes, absolutely. And I would say, even if you’re designing a chatbot, you should be doing it in voice even if you’re not going to deliver a voice interface because the motions you go through of improvising and doing read-throughs actually make it feel and sound natural. When we speak and when we write, we produce different outputs. And so if you want to aim for natural, you should really be doing read-throughs and improv and things like that to make sure it’s as natural as possible.

JONAS: I have a split relationship with the voice interface as it’s the classic horror stuff, the disembodied voice, that stockpile of horror movies, isn’t it?

BEN: Yes.

JONAS: And not only HAL but all kinds of spirits and ghosts and things. On the other hand, the first time I drove a car in the US with GPS and I missed an exit on the highway and I panicked. And then comes on the voice from the GPS with the most reassuring female motherly voice saying, “Recalculating. Drive 7 miles.” I was so relieved, and I just fell in love with the GPS right there. So it can be reassuring but kind of scary too.

BEN: Yes. I have this paradox in mind that relates to that, which I’ve been thinking about a lot and I think I tweeted it a while ago. Voice as an interface has the highest potential for trust because it’s tapping into our ancient systems, the things we’ve evolved to do this to relate to other humans. But it also has the highest risk of breaking, it’s the most error-prone. So it’s a paradox. You could achieve amazing level of trust with the voice interface. But because machines aren’t there and fully understanding us yet, you can break it in a second. And I’ve noticed this in testing.

If I test a long voice flow – if you even encounter a small error up front, you can see the kind of lights go out in people’s eyes, they kind of glaze over and they’re like this is kind of – And so you have this paradoxical relationship that I assume will change overtime as the capabilities of the machines increase.

JAMES: Go back to what we said about we’re just faking it, kind of thing. If you do a good job of making it sound convincing, then our ancient brains who are used to language really are convinced. And then it’s the Wizard of Oz thing, isn’t it? When something goes wrong the curtain gets pulled back and we suddenly realize it’s a machine. So we fall back then into dealing with a machine. And we try to adapt our behavior to now what is clearly a computer.

BEN: Very much so. And in fact, even if you don’t encounter an error, it starts earlier than that. So for example, I work with some NLP language machine learning data scientists’ types. And at some points we’ve been testing open-ended voice inputs. So saying maybe 20 to 30 words. We want the user to give that kind of input. But when you actually test it, they just won’t speak naturally to it. They just won’t because of expectations of it being broken or bring brought into the room. It’s almost like the broken nature of it at the moment creates this kind of – everybody has a robotic Alexa voice, you’ve all probably heard it.

Actually, when the technology can accept more natural conversation, we’ve got a design problem which is to say, no, hey user, you can speak more naturally now.

JAMES: Exactly. Transitioning users from one expectation of an interface to another.

BEN: Yes, absolutely.

JONAS: The Wizard of Oz, James brought that up and that is a classic example, but you’re actually using that as a method for deciding aren’t you?

BEN: Yes, I do. It’s something I teach, it’s something I highly advocate and if you’re not familiar with the Wizard of Oz method I’d just quickly recap. It’s the idea that you play the machine and the user is no wiser as to whether the system is truly intelligent or not. You’re just pushing buttons of lines you’ve written. So the way I do this, I’m a real fan of rapid low-effort prototyping in order to improve the design. So along with Abi Jones at Google I developed this method where you put your imagined dialog into just a text file, actually. Like something really, really simple and basic.

And then I wrote a little piece of software called Say Wizard. And what Say Wizard does is it maps all the lines in your dialog to keys on your keyboard and then you can have somebody come in for a test, give them a scenario and then you’re just pushing the QWERTY keys or whatever they are on your keyboard and reading out your lines. And the beauty of that is if the dialog goes wrong or if they say something unexpected, the only thing you’re editing is a text file in order to test with the next person or try something new again.

And really, you know what I say in my talks is like you can hopefully get the kind of happy path 90% right before you’ve written a line of code.

JAMES: Which is a pretty high success rate from a prototype and testing.

BEN: It is. What that doesn’t capture is the unending kind of well of despair that is error recovery. So we spend most of our time in voice actually dealing with all the ways in which the conversation can break and all the ways in which the machine might misinterpret or misunderstand. I think what’s important to know is that our language is so deeply conceptual and implicit. The things that we take for granted are really not if you break them down.

There’s so much sort of shared content that we have with other humans even in this conversation. I don’t need to give a specific instruction when is say Wednesday, you will probably infer next Wednesday, right? And a machine cannot make that assumption. So you have to go through layers of smart design and confirmation and error recovery to be absolutely sure that we’re capturing the input we want. And this is far or less of a problem on screen.

JAMES: I guess what falls into the realm of errors is also task divergence. And what I mean by that is your interface is designed for a particular task. But when I’m talking to your interface, I’m human so I kind of wander, oh by the way, well, can’t you just – why are you doing that? So all these things that we kind of throw in to normal conversations, that becomes effectively an error pattern.

BEN: So voice is an unbounded input. I think it’s an interesting transition for screen designers, because screen designers are often a little bit of the OCD. They’re often a little bit of, like, well I define the bottoms user and I tell you where you can go and I make it nice and tidy. And of course, a conversation is just as enormously messy spaghetti pile of assumptions and mental models and all the different places it could go. So you have to kind of abandon your sense of control over what’s going on.

The principle I talk about most is managing expectations. How do you convey to a user that it can’t do everything and kind of gracefully get them back on track when they ask for something off the rails that you’re providing?

JAMES: Dialog nudging.

BEN: Yes. And you can do it gracefully without it feeling like an error. But it’s difficult. It’s very easy to build a happy path voice interface, absurdly easy. It’s much easier than a screen interface. It can just be as simple as writing a text file. But to actually make it robust – and of course, it is the most error prone mode of input that we have, right? I sometimes like to say a Scottish accent never broke my trackpad.

JONAS: This kind of interface is so complicated and so error-prone and so extremely messy. What’s causing the shift towards voice? What’s the reason for going there?

BEN: So I guess like many technologies that are buzzwords at the moment, machine learning has powered most of what we’re seeing in the voice base. So just for context, before, we’d like fine systems that we spoke to. They had what was called a fixed dialog or fixed grammars. So it could only recognize a very limited number of words that were spoken in a very particular way. In fact, the fact the very first voice interface going back to the ‘50s could only understand 10 digits spoken by one person.

And so now we have machine learning and deep learning. We can say, right, if you’re Google or if you’re Amazon, you can take a thousand examples of a word, audio clips of a word. Let’s say we’re in Sweden so Surströmming, something like that, which is a fish for the non-Swedish audience, a fish dish.

JAMES: A fermented rotten one.

BEN: Yes. And then all of a sudden through deep learning by analyzing all the ways in which we say the words and all the different samples of it. It can then get a much higher recognition rate if somebody new says that word. So that’s the one layer. That’s called the ultimate to speech recognition. And then the second layer down, so the ones you’ve actually just understood what the words are like, you know, writing our in text, for example. That’s what the machine is capturing as a first later. Then the second layer is actually applying meaning to that and then rooting the user in the appropriate direction

So that’s where natural language processing comes in. So when I say something ambiguous like, “Alexa, play Live and Let Die” Alexa can then infer that because I listened to the Wings version of that last week I mean Wings. And because I’m saying it in my kitchen in a speaker, I want to listen to a song and I don’t want to play the James Bond film on my TV screen. So you can see that example; the unimaginable ambiguity or the endless ambiguity in the things that we say that a computer then has to deal with.

JAMES: I think any of us that live in families and has a voice interface somewhere would have come across the whole hierarchy problem as well that, “I don’t want the kids to override my music choice.” Or “When you’re angry you shouldn’t really delete all my photos of my wife or whatever because I don’t really mean it in the heat of the moment.  That kind of understanding of the context is incredibly difficult to interpret.

BEN: Exactly. We have no real sense day-to-day of the enormous subtlety that our brain is sort of dealing with and calculating instantly on. And actually, that one you mentioned – let’s say deleting something. This is actually where if voice interfaces are going to be truly useful and companion-like to us, then they really need to do a much better job of understanding that context, understanding who we are and all the security implications that are attached to that. And the controls we have over that with the big platforms at the moment are incredibly basic because it’s such a hard problem, understandably so.

JAMES: Is this what we will call, then, affordance in a voice interface? When there’s things like deleting that we would make sure that you really are sure, kind of thing. Or maybe you can recover it. There’s an undo. We have to build a similar kind of thing into voice interfaces

BEN: Absolutely. Just to put it really bluntly, if we’re talking about voice-only interface, there is no affordance. Looking at website navigation is standing on a highway looking at the signs above the highway telling you where to go. Then the voice interface is like standing in the desert spinning 360 degrees around with no clue where to go. Because it’s two layers, there’s two layers to that. How do I speak to it? What words and commands will it accept? And then there’s also, what can it do? So there’s the how do I talk to it and then there’d discovery. So the affordance issue is kind of double.

I think with regards not deleting all my photos, I guess that is I suppose more to do with confirmation, I would say. As in, we would in a screen interface be very careful about an action like that and we would make it very undoable. So if you’re asking a person, “What are the main questions you need to ask when you’re designing voice is what would a person do?” Really, that’s kind of the main question. So if you were commanding somebody in your office to delete all your photos, they would look at you quizzically and go, “Really?” And that’s the same expectation we’d have.

JAMES: If I said to Jonas; Jonas, can you delete all the pictures of my wife? Jonas might well go, “Okay, I’ll do it.” But he wouldn’t do it.

JONAS: I would like.

JAMES: Exactly. The compassion in you or the human understanding would go, “James really would regret this tomorrow.”

BEN: Yes. Or once he’s through the divorce process, yes.


JAMES: My wife is going to hear this episode as an example.


BEN: Sorry.

JAMES: But yes, there’s so many human aspects to a voice interface.

BEN: Basically you’re tripping into all these social expectations that we mostly are not dealing with in screen world.

JAMES: I heard another thought as well; whiles we’re talking about designing voice interfaces and the way that we have to test and prototype them, but how do you go about actually building them?

BEN: So it’s mostly done through the means of writing dialogs. Let me take you through some of the basics that I try to teach. So there’s one understanding what your use case is which is a sort of user research/strategic question. That’s step one. Step two is sort of standard UX practices, like user research would tell you, well, how do you people talk about that thing and what are their expectations of a conversation in this situation? If you were designing a coffee voice app you would go and sit in a café and listen to how people order coffee. And then you would start the process of imagining how those dialogs go.

So it might be like storyboarding, role play, writing scripts. And then you do some sort of early stage testing where you just kind of play around a bit, like I mentioned with the Wizard of Oz testing. You’re not trying to solve the whole thing but you kind of see if you’re on the right track. And then once you’ve got some sense of whether you’re in the right direction, you would then go into flows. So you’re literally just designing flow diagrams at various levels of zoom. So you might have the overview flow diagram which sort of tells you here are the use cases. So, like, I want to book something, I want to change it, I want to cancel it or I want check it. Booking a train ticket, I need to do those four things.

And so you need to make sure the conversations work for those and then you need to start designing the endless amounts of error recovery in those very complicated conversations. And then you’re passing that on to developers and testing real artefacts. And I would say, like, most of the voice interfaces that people have designed for smart speakers, they really haven’t had the rigor that we would consider to be appropriate in UX world. It’s very easy to just make something that works with a happy path and be done with it and launch something.

So, for example, I sometimes turn on something called sleep sounds for my kids which just rain noises and things like that to kind of help them go to sleep. And its’ incredibly rigid. Like, you can only say certain things to it. If I say, “Open sleep sounds.” And then I say, “What kind of sounds do you have?” There’s no answer to that. It has to be spoken to an incredibly rigid way, I have to look at a screen to find out what sounds it can produce.

So your job, really, as a UX’er, I suppose in this space, is to make sure you’re building in lots of the error recovery and flexibility that we would have expectations around in a human conversation.

JONAS: As a designer here, you’re designing phrases and flows of dialog. What about emotions? What about tone of voice?

BEN: Yes, so I kind of skip that one, actually. I guess I spend a lot of time in the early stages thinking about the persona. Now, the way in which the voice interfaces expresses itself, most designers have fairly limited control. You’re not necessarily choosing an original human voice to use, you cannot control the nuances of the expression very well, so mostly what you’re designing around is word choice.

Google does allow you to choose from a subset of voices for their actions, so that’s quite useful. And then you have a language called SSML, it’s a speak markup language like HTML. So you can sort of add pauses or emphasis. But it’s quite –

JONAS: That’s very rudimentary. Just emphasis is –

BEN: Yes, it’s very rudimentary.

JONAS: Like we started to design website with just strong and emphasis for typography substituting for all kinds of amazing typographic detail. Somewhere, at some point, we need to be able to direct the voices with other kinds of commands, I guess. It’s the Anton Chekhov play where you have a phrase which says, “Cheerfully through tears.” Which is from Three Sisters, I think. I would love to hear the voice saying, “My train is delayed” or something like that. That being expressed cheerfully through tears.

BEN: Yes, that’s a lovely rich example. So what Amazon are actually experimenting with at the moment is using deep learning to analyze contextual tone of voice or subject tone of voice if you like, maybe that’s a better way of saying it. So for example, give a thousand hours of news readers speaking the news and then have it express a couple of paragraphs or text differently using the expressions it has learned from all the samples of newsreaders.

So in the future, we could say, right, I’m doing a cheerful food delivery app. This is the kind of domain or tone of voice or subtleties that I need. And then when you write your dialog, the machine is learning what kind of expression it ought to give your dialog as well as maybe some manual control over that as well.

JAMES: To finish off – we have to finish off shortly – but when you said that about the markup language; and we’ve also mentioned Google, we mentioned Siri, we mentioned Alexa and there’s Cortana as well. How much are you forced to design for each of these individual platforms? Do we have a HTML for voice?

BEN: I’m shaking my head. No. There are some broad principles that are the same but the platforms do tend to work differently. And I think somebody at Google said it is a bit like the early days of the browser.


BEN: You have to kind of think differently about each one in a slightly different way, you can’t use the same tools yet to kind of export your designs into them. That’s starting to happen, it’s starting to see an ecosystem tools that will help with that. But it’s very early days. So what I train people on is like, look, you’re going to have to do some reconnaissance around the platform you’re designing for because not only are they difference but they’re changing all the time. The capabilities are proving at quite a pace. So you have to kind of stay on top of it.

JAMES: I think the analogy comparing it to the early days of the browser and also, maybe, the early days of mobile. We still suffer from that. Companies choose to do the iOS App or the Android app or the web app and they’re all slightly different with different challenges. We seem to be destined to never really be truly free from that life as a designer.

BEN: Kevin Kelly calls it – like the sort of technium, doesn’t he? He has this idea of a metaphorical world or drive that just keeps pushing the progress and we’re all kind of caught in the vortex.

JAMES: Caught in the vortex. I don’t know if it should be my job title or my epitaph. Thank you very much, Ben, for joining us and talking about voice interfaces.

BEN: It’s a great pleasure. It’s been nice to talk to you again.


JONAS: Thank you for spending your time with us. Links and notes from this episode can be found on If you can’t find them in your pod-playing tool of choice, there should also be a transcript available there too.

JAMES: And if you want something to listen to next, then we recommend Episode 121 which is Agentive technology with Chris Noessel.

JONAS: Remember to keep moving.

JAMES: See you on the other side.


JAMES: Knock-knock.

JONAS: Who’s there?

JAMES: Crock and Dial.

JONAS: Crock and Dial, who?

JAMES: Crocodile Dundee.

This is a transcript of a conversation between James Royal-Lawson and Jonas Söderström recorded in March 2019 and published as Episode 207 of UX Podcast. 

Transcript kindly provided by Qualtranscribe.