Multimodal design

A transcript of Episode 251 of UX Podcast. James Royal-Lawson and Per Axbom are joined by Cheryl Platz to discuss designing for multiple inputs and outputs and the challenges of creating multimodal, cross-device experiences.

This transcript has been machine generated and checked by Daniel Meagher.

Transcript

Per Axbom
Thank you to everyone who is helping us with our transcripts. You’re doing a great job helping us make sure they’re published together with the podcast. If you’d also like to help out with publishing the podcast, just email us at hej@uxpodcast.com hej at UX podcast.com

Computer voice
UX Pocast Episode 251.

[Music]

Per Axbom
Hello, I’m Per Axbom.

James Royal-Lawson
And I’m James Royal-Lawson.

Per Axbom
This is UX podcast. We’re in Stockholm, Sweden, and you’re listening, and 197 countries and territories in the world from Germany to Belize.

James Royal-Lawson
Cheryl Platz is a designer, speaker and actor. She designs complex interfaces for screens and machines, known for a cutting edge work on Amazon Alexa, the Sim series of games and Microsoft’s as your platform.

Per Axbom
Cheryl is now also an author, having just released her book “Design Beyond Devices: Creating Multimodal, Cross-Device Experiences”. In our book, we learn how to go beyond screens, keyboards and touchscreens, by letting your customers humanity drive the experience.

James Royal-Lawson
We’re delighted to have Cheryl on the show teaching all of us how to stay relevant and keep up with a future that is already here.

Per Axbom
Stay tuned after our chat with Cheryl, for our post interview reflections.

Per Axbom
I’ve been reflecting back on episodes that you and I have done over the years and I remember us really early on talking about smart speakers and you were upset about the fact that it’s I mean, how will it work with families? How will that work? When, all of us are speaking? And who has the loudest voice? Or who does it listen to first I mean, things like that. And those types of problems I haven’t really seen or heard anyone deal with until I read Cheryl’s book. And I think one of the sentences in the beginning of the book that really, really resonated with me is more than ever, we need to question everything, because the world we were trained for no longer exists. And this is like it feels like this is why you wrote the book.

Cheryl Platz
Yes, I have such fond memories of my training at Carnegie Mellon in human computer action. And I’m very, very grateful for it. But my senior project was on Palm Pilots, like could not be farther from what we do today. Aside from like, I guess there’s still styluses around. But nothing about multiple users at the same time. Nothing about hands free interaction, nothing about like, the gap is so large, but even people who trained five years ago, I mean, I get the echo was just sort of coming into the market then and people weren’t sure if it was a fad.

I’m pretty sure most schools weren’t teaching it yet. And even if they were, it was still pretty much despite the entertaining ad that introduced the echo. The echo didn’t work like that, as you say like they didn’t support families still doesn’t really natively support like a natural conversation, I thought there was some really good work at interaction 18 somebody I believe from the United Kingdom had gone through and done a research study where they worked with transcripts from families talking to the Echo, and it was just a mess, right? Like just incomplete conversations, and spaghetti strings of context. Life is messy, like life is full of people, even office contexts, right?

This year, everything is changed. We can’t assume that people are in a quiet environment or have one on one environment. And so I’m glad it resonates with you. Because many times in my career, I felt like I’m in uncharted waters. And I know how it feels to be to just kind of feel like, well, what the heck do I do with? No one told me what to do for this. Is their a solution? And the only way I’ve gotten through is approaching these problems and saying like, well, what would I do if it was solvable? And this book is the results of that approach is approaching his tough problems and saying, okay, I’m gonna assume it’s solvable. And we’ll just, we’ll just try. And some things have worked and some things are a little bit more hypothesis and some things I’ve got evidence on, but I hope this is useful for folks as we head into a very uncharted age.

James Royal-Lawson
I think maybe that’s it, maybe our industries are destined to be doomed,, you know, I always have this feeling of what we learned is no longer kind of what accounts anymore. I mean, it’s, I think we’re probably always going to be trapped in that way of working, I think.

Cheryl Platz
And it was frustrating at Amazon because I was at Amazon in 2015 and 2016, working on Echo when it was still pretty early. But, you know, we were working on the Echo show at the time, like we knew we were going multimodal, like full, full force. And I was trying to try to get people thinking about some of the concepts in this book already. Like, hey, we kind of need a spectrum, we need to talk, we need to share a shared language for talking about, like when we use different types of multimodality and what they are like when we choose to do like full voice versus sort of like partial voice and then fall over to screen.

But, people were running so fast, they weren’t ready for that. It’s very hard for us to take a step back and do a systems design person take on all of this. And I get that too, which is another reason why the books out there’s like I get it, it’s really like complicated, and you’re probably not staffed to do this work. So let me let me do some of the heavy lifting for you. Because I’ve been there too.

James Royal-Lawson
So, let’s just take a step back a moment and answer the question, which I know you will have been asked several times before, what is multimodal on multimodality?

Cheryl Platz
Yes, yes. And it has so many meanings at which it was funny. It’s always been kind of clear my head. And then when I went out and looked as many designers have encountered at multiple times in their career, it was not as clear as I thought, at its most basic and non-formal multimodality, as this book operates is a system that supports multiple modes of or modalities of interaction, both output and input. Usually, it’s both. Both output and input.

Technically, you could have a multimodal system that only has a screen, and you know, supports touch and voice. But the thing about it is, if a system, you know, and you could also have the converse where a system speaks to you, but you have to type to it. But especially when you get into things like voice, there’s a lot of research that indicates that our brains interpret speech as you know, another human talking to us and so not speaking back and forth violates a social contract. So that’s a whole other thing.

So usually, if voice is involved, you’re probably doing some kind of visual and voice. Like Technically, the echo is multimodal because there’s an LED and, that’s not always…you can’t always assume someone’s looking at it. So that’s one part of what I call the multimodal interaction model, you need to take into account like if you’re totally hands free, it’s important what you can and can’t assume, like, I can’t assume in a hands free system that I’m looking at a device, and I can’t assume that it’s an arm’s reach.

So that stuff can be bonus, like the echo has a button on it where you can, like, make it stop the alarm noise. So if it’s near me, I could do that, but I can’t assume that’s the only way people are going to interact with the system and same with the LEDs, like you can, you know, it can give you reinforcement. But that can’t be the only way we impart information to you. So, multimodality is that bidirectional and sort of not infinite, but a system with many possibilities.

James Royal-Lawson
Yeah, so there could be many ways of inputting. And there can be many ways of outputting, but, they don’t all have to be in use at the same time or in the same period of time.

Cheryl Platz
Yes, and I’m glad, that’s a really good question too, because it gets, there’s different types of multimodality like different approaches. So you could do sequential multimodality where you’re kind of like, okay, I am working with keyboard and mouse now and I’m going to switch modes and now we are going to talk to the machine and it’s just talking. Then I’m gonna go back to keyboard, a mouse, you could do simultaneous multimodality, which we were trying to experiment with on Windows automotive, where you like, say I want to go to there and at the same time, I’m literally touching a point on the screen, which corresponds to there. That’s a lot more complicated, and it’s a lot more costly and you kind of need to know what the cost benefit is for doing that. It’s like a Holy Grail, but I understand it’s costly. So we need to know why we’re doing it.

Then there’s sort of orchestrated, multimodality where you let people you make it really easy for people to do whatever they want in the moment. The sequential model usually isn’t super flexible, like people let you move. The way I see that a lot of the time is like early Echo, there were parts of the out of box experience you had to do from an app. So it was multimodal, because they didn’t have a good way to do WiFi passwords. So that was and they may still do that I haven’t set up an Echo in a while. So that’s, that’s a sequential thing you had to you can do voice and then you have to go to the app, and then you come back. Orchestrated experience would be like, they have multiple ways to get you through the out of box experience. So you could do voice or you could do touch, and it’s up to the customer to choose what’s right for them in the moment.

Per Axbom
And that has to do with accessibility. I think you point out in the book, I mean, if you require an input, I mean that’s worse for accessibility, but if you allow more than one input, that’s better for accessibility.

Cheryl Platz
Yes. Such a great point. Because I had this like crisis effect, when you start working on on product, like smart speakers, there’s this moment where you’re like, “we’re doing such great stuff for humanity, we’re including folks who have been excluded for a while”. And then you start hearing cries of like voice first and voice only, like, well wait, aren’t we just leaving the previous folks out, like the folks who screens included in touch? So if you’re anytime you’re requiring one modality, only you’re leaving someone behind?

So the more you can get to an orchestrated experience, the more inclusive you’re going to be. I understand it’s complicated, I totally get it. It’s a lot. That’s part of why my book is so long. But it’s important and if you can get there, it’s it’s it changes lives, especially like if your system deals with something that people deal with every day it can it can really make a difference to provide that flexibility.

James Royal-Lawson
Is there. Is there a point where or is there a definition of where inputs can be or must be active inputs are going to be passive inputs. I’m just wondering about the whole thing with when I’m knowingly given instructions inputs to something and when it’s passive data collection, but also an important part in the modal and multimodal experience. Did I make that question complicated? Too complicated?

Cheryl Platz
Yeah, let’s rephrase it for me, please.

James Royal-Lawson
Right. Okay. What’s that again? Look, so basically, maybe if your speaker is listening, and it can tell there’s multiple people in the room, then it’s understanding it’s Listen, it’s input, but it’s not an active kind of command. It’s passive information, it’s picking up that it would then maybe use to help the situation?

Cheryl Platz
Yeah, that’s, that’s a whole bucket of questions about trust and feasibility and desirability, and that’s technically possible. And a lot of people believe and I believe there’s some proof that some phones do this already. Depending on your settings, that like, “Oh, I talked about, you know, french fries, and now Facebook’s just recommending, like burger joints to me, and, you know, when people to have that conversation, it’s rarely in a positive light. And a lot of the current voice assistants have some form of fail-safe in place to prevent ambient listening that usually the wake word, but I do always encourage people, when they asked me to challenge the manufacturers of devices they make, like, be familiar with the privacy controls, the wake words could go away at some point.

And it could be built as this great removal of friction from the process. But then, how do you know that folks and the devices aren’t listening and trying to make act take action on ambient cues that they’ve heard, like, if we ever get to that point, customers will have a great additional cognitive burden learning about what those devices are doing with the potential ambient understanding that they’re receiving. I like the wake word. I don’t know if that makes that makes me old fashioned. We’ll see how folks relate if we ever get to the point where you know it’s just a switch you could turn off the wake word, they could do that. They haven’t really moved away from that and it’s kind of the special sauce that enabled the Echo to be welcomed into people’s homes.

I know that was a lot of the development that Amazon put into the Echo originally was, how can we do some processing locally? So we’re not just arbitrarily sending all people speech into the cloud. But yeah, if you’re in a trusted environment, yeah you could if you consent, and you’re okay with it. There’s a lot you could potentially gather from ambient listening. If, you could sort out who was speaking and if you had really clear microphone arrays, because the further you get, the more things get messy.

The interesting thing about, and this kind of goes back to Per’s point earlier. There’s been some form of speaker recognition for a while on, the Echo and I believe on Google, but we haven’t got to the point where it’s really these devices that are voice interaction enabled, are really fluidly detecting who’s speaking at any one time. Certainly not within a single conversation.

It’s mostly switch profiles, and it’s even not really good at that, like our Alexa has never been able to tell the difference between me and my husband and he’s a trained Shakespearean actor and a baritone and I’m like a soprano. I don’t know what’s going on with that, but okay. So, it’s weird, because I know some the technology to do the separation and to maybe be able to deal with a multi user scenario really fluidly and understand. “Well Mom said she wanted to watch Moulin Rouge! and dad said he wanted to watch Transformers”, you know, how are we going to disambiguate those two choices? The technologies technically there, it’s not perfect, but I haven’t seen much play with it, with that rich multi-user interaction.

James Royal-Lawson
Because I’ve read about it, I’ve read about like, isn’t your voice is effectively a fingerprint, isn’t it? I mean, you…

Cheryl Platz
Yes.

James Royal-Lawson
So you’d think if you if you’ve come to the point of saying, well, your voice is a fingerprint, then you’d be able to kind of get that out of the sound recording somehow into “Ah, Cheryl’s speaking”.

Cheryl Platz
And to get to full voice printing, and we that was another like holy grail for us in Windows Automotive. That’s still a little processor intensive for a lot of the smart speakers today. So we could, and for the automotive interfaces at the time, we were designing back in like 2013. So like voices, a truly biometric identity unit, none of these devices have that level of disambiguation yet, so they instead they have this voice profile or voice login thing where it’s differentiating basically on pitch. And, you know, other like, possibly prosody and other qualities of the voice, but it’s not biometric quality, like it wouldn’t, if there’s another woman with the same pitch, it would probably not be able to distinguish us and it can’t even distinguish me and my husband’s so the FBI would not use it for voiceprinting. No, no James Bond and Alexa.

Per Axbom
That makes me think of I mean, I was on call earlier today, and someone was complaining about how he has more and more smart devices. I mean, there are microphones in everything now. And when he uses his wake words, several devices, of course, turn on and want to please him. And that itself, of course, is becoming a problem, which how do the devices know which one he is talking to? Because you have the same wake phrases? Usually.

James Royal-Lawson
I just can imagine millions now. Yeah. Looking at the same point, the Claw.

Cheryl Platz
It reminds me of when I was working on the Alexa team, and we were they were preparing for the first Super Bowl ad the one with like Alec Baldwin. In the ad, he was supposed to, like order wheels of pecorino cheese or something and so we all had an Echo on our desk. And if you forgot to mute it, bad things happen, because like they were testing the commercial to see that, a) that the commands were supported and b) like whether the commercial was properly muting itself. So I would come back to my desk sometimes and like my right, go home, and my shopping list would have like 400 wheels of pecorino cheese. Okay, that’s Amazon problems. So luckily, it wasn’t the it wasn’t ordering those, it was just on my shopping list.

But yeah, the device proliferation thing is a real challenging, I mean, I kind of almost see this as like a design ethics problem when we talk about waste and consumption. Because, if in 2015, and 2014, people didn’t have Fairfield microphones in their homes. So yeah, we had to sell speakers. Now we’re at the point where, as you say, almost everything your laptop, your phone, your smart speaker, there’s one in every room, you might have multiples. They’re all far field speakers. So is it ethical to continue to sell things with far field speakers? Is there a way we can leverage this the far field speakers that already exist? Can we come up with an interoperability pattern, so that these things can work together?

There’s the wake word. I will say it was smart of the Amazon folks from the beginning to offer multiple wake words so you could kind of hack a light sort of disambiguation. I’ve never understood why we didn’t do at least a little bit with distance, like the volume of the voice or the proximity, like you can, if you can beam form the Alexa cam, you know it’s got that blue ring and then the light blue ring, so it knows where you are, it probably knows how close you are to the mic too. And if it knows how close you are to the mic, other devices probably know that too. So the one you’re closest to should be the one that responds.

I don’t know that all devices have that level of sophistication, but if you have distance, then and you know, you’re aware of all the microphones, then it should be just the one that’s closest to you that’s making the response and all the other ones to just politely sit and wait their turn. Or if you have cameras, then gaze is a really important thing. But you see how like you have to start thinking about context. And that’s why the for the second chapter of my book is just all about capturing customer context. Because in a multimodal world, it’s so so important to just look more broadly than just like a persona.

Like, we need, you know, Bill Buxton has talked about the concept of placeonas, and it’s like, you need to understand the customer space deeply, like, how many of these devices are they likely to have? And where are they likely to have them? How noisy are their environments? You know, what is their emotional state? And what is their emotional relationship with the device? Especially if it’s voice enabled? What is this? What’s in their hands? And how does that change over time, all those things are really important when you talk about interacting with like microphones in the home, or gesture devices in the home or touch devices in the home or at work.

And I think if we had a couple of comfortable years in the industry, where it was like we could assume it was mobile, and you know, we like people just kind of tuned out their surroundings for the most part. Or we could assume it was the office and that was a pretty constant environment. But as we’ve learned in the pandemic, as we get this camera view into people’s lives, home is a very variable, very variable experience.

James Royal-Lawson
So I mean, you mentioned a few months ago, interoperability, and it made me think, can we actually achieve multimodal without interoperability? These things can actually mean, the example you gave with different speakers, different places. I mean, surely, we’ve got to have some thing, somewhere, some standards to how these all connect.

Cheryl Platz
We can achieve multimodality with individual devices, certainly, you know, like the Google… I always lose track of what it’s named now, the Google Nest Home Hub, but they’ve got gesture, they’ve got voice, they’ve got touch. Like they’re cranking on all cylinders, to some extent. And that’s just a single device, but can we achieve multi modality in an unobtrusive way where six devices aren’t responding to you all the time? That’s, that’s the question.

James Royal-Lawson
And I know so I was just thinking as well about not just across cross platform, multimodal interoperability, because I mean, me and Per over the years, I’ve always joked about the fact that I’m in the Google world and then power is in the Apple world. And if you are in one or the other, and it’s a lifestyle choice, it’s you know, you can, it’s not practical to even try and blend these…

Per Axbom
And the more you invest, I mean, then you can’t go back.

Cheryl Platz
Yeah, and I agree. It’s a tough situation. I agree that right now, it’s not really feasible to do a multi sort of multi platform home, when it comes to these multimodal devices. I’m obviously I worked at Amazon, so most of the devices in my home were purchased for from Amazon. So we’re an Amazon household right now, and as I’ve thought about bringing some Google devices in the home, and I just I’m like, “I can’t, that’s just that, what, what are we going to do?” That’s it… It’s not that… I can’t get my head around it, and I work on this stuff.

So it’s, and it’s sticky to like, there’s this huge cost barrier once you’ve invested in hardware. So as an industry, we have to deal with that, like grapple with thoughts of that, like, is it good for our industry that everybody’s stuck? Is that truly good? I mean, obviously, Amazon’s happy that they’ve got people who are stuck in Google’s happy that they’ve got people that are stuck at Apple’s happy that they’ve got people that are stuck, but you know.

James Royal-Lawson
How are we as customers? That’s the question, yeah. How would we feel as customers when we, you know, we get to the point of feeling trapped?

Cheryl Platz
And, you know, how if that stagnation is keeping out other competitors, those are all U.S. companies, that has sometimes ended badly. So there’s a real question of like, do you need improve interoperability to avoid antitrust concerns? I mean, Google’s got some problems coming up on in the horizon and that’s a real question. And would other folks follow suit? These are these are big questions. I’ve always have wished that there would be more openness and interoperability in the space. Partially also, because of this big paradox around bias. The way each of these systems was spun up, it was in secrecy. So the initial sampling for the natural language corpuses for the voice interaction was basically employees and their families which…

James Royal-Lawson
American families employees as well. Yeah,

Cheryl Platz
Yeah. Basically, occasionally, you’ve got some folks who have like, migrated on visas, but they tend to be from specific countries, and, like the gender diversity is not there, the national diversity is not there.

James Royal-Lawson
Education levels are all very high.

Cheryl Platz
Yeah. So you get this in, you hear it from customers stories that the accuracy of the natural language systems are it doesn’t reflect the customers and so there’s not really a fiscal, or there’s not really a fiscal incentive to do a massive overhaul of those systems. So, I’ve long been wishing that there would be an entry for smaller players to come in and build more open systems from scratch. Mozilla has been working on that with Project Voice.

And I hope that that continues, but they can’t just plug their voice recognition engine in to anything right now. It’s out there for like independent projects and Raspberry Pi, but kind of stuff I believe, but how might we give people the ability to feel more represented in a world where the big players haven’t committed to opening up or expanding or being more interoperable. I’ve seen some token efforts, maybe they’re more than token, but like, you know, Google’s trying to get folks with that Down syndrome and engaging with them and trying to get them into their natural language program.

But, you know, I don’t see a lot of effort to, like, directly engage with the black community, and there’s like a whole other, like, language structure, and regional intonations and things and I feel like not engaging directly with them. Maybe it like implicit bias, or unconscious bias, I feel like that that deserves as much attention, especially in the States, it’s tough. So, but, you know, the big players would say like, “Oh, is that making us money?” Well, if you were more interoperable, or more open, maybe smaller players could come in and help cover that gap. A healthier ecosystem could help do that.

James Royal-Lawson
Yeah. I think, thinking about in the book, you talk about interrupter or interruption metrics was one of the things that you mentioned later on in the book, about when it’s good to interrupt. A conversation, that’s not the right word, is it? Well, activity, that was what I was looking for, and when you’re building an interruption metrics about when it is and isn’t a good time. It’s not just a personal interruption metrics, ultimately, it’s a cultural one. Because like, what might be culturally okay, in one particular society, might be really, really not something you do in another. Maybe they interrupt interoperability is something that could help fill the cultural gaps.

Cheryl Platz
Yes, that too. And I will say that I like I was happy to see that when we were working on the Amazon Echo, like Amazon would spin up voice teams in the countries in which they were launching. So that’s… and they weren’t shipping designers over they were hiring local folks to get some of the regional perspective. But yet, there’s so much, especially when you get into natural user interfaces, when you’re moving beyond typing, and mouse clicking, everything becomes cultural, everything from like, what hand gestures are appropriate to what language we use, and the intonation we use.

And some cultures don’t like accepting directions from women. There’s a story in this book “Wired for Speech” about that, and BMW, so that’s, there’s a lot to unpack there. We won’t do that here. But yes, interrupting people culturally, like, you know, if we were in a region where it was predominantly Muslim, there are times of day that you certainly probably wouldn’t want to be dinging with the notification interruption that you would probably want to be aware and very respectful that like, just because they’re not interacting with you doesn’t mean they’re available, like the during prayer intervals. And that you don’t you don’t get that perspective, if you’re not bringing diverse voices to the table and making sure that you’re interacting with the folks in the regions that you’re deploying to.

Per Axbom
And something I I’ve experienced quite a lot with my Apple TV is that I actually have devices on different languages, so some of my devices are set on Swedish and some on English. And of course, now the Apple TV remote has a voice input, which I appreciate because it’s hard to type with the remote, but I have several Apple TVs and I have a different language for some reason. And it doesn’t tell me which language it is on.

So often when I say it in English, I get something that would be the interpretation, well, how it would sound like in Swedish, and get a movie for that. So I’m not prepared for the response, which is really, really confusing. Then I have to spell it out. And sometimes I just got back the keyboard because it just makes it easier because I can’t make it sound like wants me to sound.

James Royal-Lawson
I can tell you first of all that in my car, I have Google Android and Google Auto, and my phone, my Google phone is in English, but generally, when I’m navigating, when I’m driving to places, I’m driving in Stockholm, so I say the place I want to go, so it won’t let me type because I’m driving it blocks up from me. So I have to use a voice interface. So I say the place and I say it in Swedish, because I can speak Swedish and I’m in Sweden, I’m navigating in Stockholm. And it has absolutely none, I cannot use Google Maps while I’m driving, because my phone’s in English and I’m in Stockholm.

Cheryl Platz
Totally, this this problem came up when we were on Windows Automotive, as we say James, and Per, that was exactly what I was thinking. Thinking about, like, Europe is such a beautiful set of challenges with how many cultures and languages are all jammed up next to each other, and obviously technologies developed in America don’t lead with that even though we have so many Spanish speakers, there’s a lot to unpack there, too. But, that when the automotive case is the best example of why you would want a bilingual system, like you might also want to just because you’re a bilingual household, but like the way to get at Americans is to talk about that, like maybe you’re in an area that doesn’t have a need, you know, single language street signs.

And it’s a challenging problem when you’re switching inside a single utterance multiple languages, because that’s additional computational power. You have to keep both corpuses in the system at once, and I’m not defending the systems, but I’m hoping we can get there. We were we were trying to figure out how we would do that in a car, but it was still a little early at the time. I believe we’ve got to have the technology soon, but in Per’s example, I believe that Alexa is getting close, if not there already where it’s starting to support bilingual interactions to some extent, but I don’t know if it’s bilingual in a single utterance, like the car example.

Or if it’s like, I can speak to it in either one or the other, but the way language works, the way spoken language works I don’t see why, if like, in your example, where you had spoken English to a Swedish system that was expecting Swedish, if you were able to say I speak English in Swedish, if you speak something in English, and it can’t spit, like really low confidence results. Why doesn’t it just take an extra second, run that utterance against the Swedish set. I’m not deeply engaged in the speech science here, but to me, the computer science there seems like you ran one NLP check. If the numbers came back bad, you take the same utterance, you just do another NLP check, and then if that’s better, you’re bilingual. Like you just you seamlessly switched over to the the other language. That it.

Per Axbom
That’s so weird. I never thought about that, that it’s single language. Why is it? Why did I only set one language per system? Why don’t I get to select more?

Cheryl Platz
It’s just a multi-select box, if you’ve got multiple languages you should support them.

James Royal-Lawson
Yes, but this is again about this, like, I’d really want to have a little box of information, of context that says, “Look, I’m James, I’m 47 years old, I live in Stockholm, I speak these two languages. I have a wife and two children, one was a boy and one was a girl…” and we, you know, just give a full bag of basic context, so that these systems would have something to chew on and make life a lot better.

Per Axbom
And you would expect them to know all about you already.

James Royal-Lawson
From all the data I’ve given them for free. Yeah.

Cheryl Platz
And it’s this dicotomy where we’re like, “God, we’re stuck in a surveillance culture.” It’s a really bad surveillance culture because they surveil and yet they don’t know anything about us all the time, like how did that happened did we get in a surveillance culture when they don’t know any of the important things.

James Royal-Lawson
So that is actually probably the biggest question. It’s kind of like we’re giving so much data all the time, but it’s been used to give us adverts in kind of like, you know, news feeds we don’t really want to see.

Cheryl Platz
And not anything of value, which if you wanted to survive, that would be the first thing to do is to offer value, and that’s the tough part with the, like, central profile is like, could that data be used in any way against you? Ideally, it would be super secure, and you could just choose who you share it with. And I agree that like, I hate re-entering my data, I you know, I deal with the disability and every time a doctor asks me to onboard to their portal, I know this is probably not a thing you have to deal with, because you have lovely, universal health care. We don’t, and so every doctor has a different portal you have to sign up with and when they say, “hey, do you want to do you want to sign up in their portal like the last time that happened?” I nearly had a panic attack. I’m like, “I don’t want to enter my medical history”.

It’s the same thing. Don’t ask me. So identity is one of the multimodal traps I point out and privacy in chapter where it’s like when you start to expand your system to this complexity, that sort of question like, what languages does someone speak? And what do they need to keep private, especially like, if you’re in a multi-user household? Those are really tricky questions that you are going to come up against, and I don’t have easy answers. But you have to the point is you need to like be aware that those are pitfalls, and ask those tough questions early in the design process when you’re building these complex systems.

Per Axbom
So for the listeners, I mean, what are the most common obstacles? Or mistakes or failures? And what are the techniques that you should be employing when you get to these devices? Because most people are used to working with screens, and you do prototypes in one way. But now it’s like it’s a full body experience in a completely different way. As you promised, in the start of the book, we don’t know we haven’t been taught for this, what do we need to know?

Cheryl Platz
Well, some of the traps I talked about, one of them is around ergonomics just being so excited about voice and adding it to an existing system that we’re not thinking about forcing people back and forth. It’s the same with different types of physical controllers, or touch and mouse, or gesture and any of these things. The human body has limits, repetitive use, all of these things, like if you’re not taking into account the impact on the physical body, when you move into a multimodal space, you can cause genuine harm.

You know, even when you’re just like working and playing we games, there’s a game called Cooking Mama, and like I genuinely got a shoulder injury from playing that game like it was… And I am very sensitive to like moves between the keyboard and mouse, but that’s not even super multimodal there, but switching back and forth between gesture and keyboard would be a lot of movement on the wrist and could potentially be stressful for folks. So it’s important, and if you had like two different types of physical controller that could be potentially physically stressful for folks.

So that’s something to keep in mind. I feel like that’s one of the things that designers have been most they’ve had the luxury of ignoring for so long. They’re like “everybody’s using a keyboard and mouse or everybody’s using a phone. I’m not a human factors person. I don’t need to deal with this.” Well, we do now in this world. Another one of the traps is around context. You know, if we talk about chat bots, which extend very easily to voice interaction. And when we speak to a device, we expect it to adhere to the social contract, because our brains are like it’s a conversation, that means it’s a person brains, so helpful.

So we expect our conversational partners to maintain the context of the conversation, because otherwise they’re not paying attention, and that is rude. So it seems rude when you know, I’m having a conversation with a bot. And I’m like, how’s the weather in Orlando? And then we have that conversation. I’m like, how about how about Seattle? And it’s like, I’m sorry, what was your question? Like? That’s it. That seems rude even though like I as a developer, I’m like, Well, you didn’t tell me what, what you wanted to know about Seattle. Like, we were talking about weather, and it’s a really simple fix there. You just remember what the last question was? Yeah. So if they don’t give you a new question, you just use the last question and the new parameter they gave you.

But you need to have those conversations. Like you need to talk early on about what parts of the context of your conversation or your customers engagement all up are important to maintain and for how long? I mean, the longer you maintain context, the more likely it can be used against your customers. So you have to be careful. Too much context can be dangerous like you might be in for Too much. So it’s a context is a complicated concept, but it’s important to think about what your customers are going to be expecting from you what would they will believe is appropriate in this situation. I mentioned identity and privacy.

Another one, which is just I feel like early days for us, it has been a soapbox for me it’s so many places where I’ve worked as discoverability. Right now, and multimodal systems, there’s a lot of new stuff like new ways to interact with the system, like how do you we’re so used to like little toasts and bubbles to teach people how to use new… traditional keyboard and mouse software? And we don’t have that we have a voice interface and designers are like “What, how do I do it?”

And, I still hope that we’re gonna get more sophisticated than where we are now. Like, Alexa, for example, that I mentioned, I give a lot of people likes examples, because my household is Alexa, not meaning to say that they’re the best or the worst, by any means. But, there was a feature we had talked about when I was working there, that’s finally made it partially into the system, which was, I had been proposing a tag on kind of thing where like, we don’t want to interrupt you, before your task is complete, to tell you about something that’s getting between you and your goal, that’s obstruction, don’t do it.

But at the end of the successfully completed task, maybe you would be open to some new information. So, they finally got that in. So sometimes, like, I’ll ask for the weather, you know, for Seattle for today, and she’ll be like, do you want to know it for this weekend? Or she’ll say, did you know you can ask for the weather, you know, or she’ll tell me about a new feature, which is cool, like, that follow on thing is, is essentially like one of those little pop ups one of those little teaching teaching cards.

The problem is, it’s not supported by intelligent awareness of what features I’ve already used. So it’s all really repetitive. It’s, and that’s the thing like it would always, I’d always hoped it would be a pide with some level of contextual awareness of the customer’s journey. But you have to have that conversation at the architectural level and commit to storing some, like, what learning data is important, and are we tracking it and then only teach things that they haven’t achieved yet? You know,

James Royal-Lawson
I mean, that’s something that goes back to the human interaction stuff. We talked about how well you know, you’ve got beginners and then as you use the system more, you become more experts. And you should, you should reveal kind of help and assistance that is applicable to the level of the user of the system, the interface. And you know, we don’t do that well enough in digital interfaces, and it’s kind of I suppose, in that sense, not surprising. We don’t do it very well in voice ones either.

Per Axbom
Isn’t that why we killed the Microsoft Paperclip? Because it was just so annoying, I know this.

Cheryl Platz
Yeah, it no attention to that arc. So when we were doing Cortana, the the discovery arc I was working on was, after you’ve done a touch scenario, giving you a way to say like, you know, passively just on the screen, like “Hey, did you know you can do this invoice to use this utterance to do exactly what you just did?” It’s not like pop up in your face. It’s on the same screen that’s confirming what you just completed, but we know it’s something that might be useful to you because you just completed that task, that specific task, as opposed to like, it looks like you might be trying to write a document like that.

Okay, well, just let me do my thing. Just Just don’t don’t get in my way. Clippy don’t do that. Clippy, he’s having a bit of a resurgence somebody was giving away Clippy stickers in my last year at Microsoft, like what’s happening? Maybe that was the first sign that everything was gonna go topsy turvy once like Clippy started becoming hip again, I should have known that we were in the upside down.

James Royal-Lawson
You see, it’s the 80s they’re back. They are popular again, people are too young. This is the problem. They don’t remember it. So suddenly Clippy is trendy ago. No, they just didn’t live it. Oh!….the pain…

Per Axbom
And I think the last piece of advice for everyone, of course, is to read your book. I mean, I got a lot out of it, and I think that, especially now I’m working in healthcare. There’s so many devices there and so much happening with sensors and stuff that I mean, it’s so important to get it right.

Cheryl Platz
Such an interesting space. I can’t wait to hear more about how you apply some of the stuff you do in your work. Maybe on a future podcast I listen to but…

Per Axbom
Definitely, yeah.

James Royal-Lawson
It’s been really good fun talking to you today Cheryl.

Cheryl Platz
Thank you, James. Thank you. Per. It was a delight to come and talk to you.

[Music]

James Royal-Lawson
First up in the heat of the interview, I realized that I called the little green men from Toy Story, minions. I mixed up my kind of animated digital animated film references in one sentence, sorry.

Per Axbom
Not okay, James, not okay.

James Royal-Lawson
Let’s do better.

Per Axbom
I have to say that I mean, even though I myself, I think I avoid smart speakers a lot. I know you, James, I don’t you don’t have smart speakers either. Nope. And I, for me, it’s I’m more wary of how they can be misused even for reviews and stuff. So that’s probably why I avoid them a lot, but it’s also one of the reasons I’m hugely appreciative of Cheryl’s attention to human messiness, in her book, where she also talks about how people are not predictable and how they always exist within a space that can be noisy or quiet, or space, where they’re holding stuff or other things have their attention, or their capabilities are constrained.

And so being able to observe and understand and map all that that’s the essence of doing UX work that’s necessary to make technology, the technology adapts to humans, and not the other way around. So for me, that’s the absolute imperative stuff to an ethical design approach that’s really being communicated.

James Royal-Lawson
Shared digital experiences is really tough. Because we’ll talk about, isn’t it? We’re talking about, you know, normally traditionally, in you know, computer interaction, you have a user, singular, and a computer, singular. Much of this we’re talking about now is multiple users and multiple interfaces, and switching between interfaces switching between users and doing this seamlessly. Of course, it’s complicated.

Per Axbom
Yeah, it’s, it’s bound to mess up.

James Royal-Lawson
Yeah, Doomed to fail.

Per Axbom
Maybe we’ll, maybe we’re at that point where it’s too complex to get it right.

James Royal-Lawson
Yeah, well, I suppose the whole thing about you were racing along to obviously racing and running, but you don’t have the time to stop and think about getting it right. And and some of the frameworks and tools in Cheryl’s book are there to help you try and get it right, given the experience she’s had working with multiple different voice platforms. But, I reflected on the time, you mentioned a couple of times about something that she’d worked on, has eventually made it into the product, and listening back and thinking about the time aspects. She’s referring to things maybe multiple years in the past that she’d helped research or design, were starting to trickle through into products. Which is a kind of good thing that it’s taken that amount of time to kind of incubate and to develop. But when it comes out, it’s still not quite good enough. So interesting about how much time we need to try and get this right and to process it.

Per Axbom
Exactly, patience, and humility is really important.

James Royal-Lawson
Yeah, definitely.

Per Axbom
We have some recommended listening for you all, and actually, for me, as well, because this is Episode 207, “Designing Voice Interfaces” with Ben Sauer and I’m not even in that episode. So I have to admit I haven’t listened back to it.

James Royal-Lawson
They see that just means that for you, you’ve got to get out of jail card because you can’t have kind of contradicted yourself or messed up what you’ve said in Ben’s interview when it was Cheryl and me, whereas I can have could have fallen all over the place contradiction myself and making a mess of things. So I should listen back to as well, or read the transcript. Sure not, including a transcript can be found on uxpodcast.com if you can’t access them directly from wherever you are listening right now. So click follow subscribe others if you aren’t already doing so, and join us again for our next episode.

Per Axbom
And if you’d like to contribute to funding UX Podcast, then visit uxpodcast.com/support. Remember to keep moving

James Royal-Lawson
See you on the other side.

[Music]

James Royal-Lawson
So Per, why did half a chicken cross the road?

Per Axbom
I don’t know James, why did half a chicken cross the road?

James Royal-Lawson
To get to its other side.

James Royal-Lawson
I love the macabre mental picture of that chicken. Did you picture the left half or the right half, or its head or his feet? I don’t know, I just wondered. But as another thing connected to smart speakers, the amount of articles and things written about the smart speakers and jokes is incredible. There are whole teams of writers working on the jokes for the speakers. It must be the biggest use case, I think for these speakers.

Per Axbom
It probably is. That’s amazing.

James Royal-Lawson
Crazy


This is a transcript of a conversation between James Royal-LawsonPer Axbom and Cheryl Platz recorded in November 2020 and published as episode 251 of UX Podcast.