(Via OrangeCone) Gartner have a new report out on The Evolving User Interface (link goes to summary, real report is $$$) that mentions speech recognition.
Despite delivery of continuous speech recognition in 1997 and subsequent advances that now
deliver accuracy superior to the average person’s typing, speech recognition has not been widely
adopted. Broad adoption is hindered by personal or emotional factors, social barriers and low
value combined with high complexity for most users. Most people are accustomed to a writing
style in which one watches what is written and corrects errors on the fly. In addition, talking to the
computer is often seen as a social faux pas or raises concerns about personal privacy.
Acutally it takes about 6 months to get proficient enough at using speech recognition to approach a sensible level of throughput, taking into account correction time. And broad adoption is not so much hindered by “personal or emotional factors” or by “social barriers” but by the fact that using speech recognition productively, as a replacement for typing and mousing is really, really hard. This is mostly because most, if not all, desktop computer applications are designed to work with a mouse and keyboard and using them through speech recognition is like typing in mittens. I’d love to see a proper word processing package based on speech, rather than a GUI word processor with speech layed over the top.
At the same time, Gartner’s point of view reproduces the classic technological determinist view that if a technology fails, it’s because people made it fail, or weren’t ready for it, not that the technology is bad or wrong or inappropriate or some middle ground. Small vocabularly speech recogntition works and works well in a small number of quite specific situations. Open ended speech recognition on the desktop or on the mobile is just a bad idea. Computers do not understand speech in the same way that people do and they never1 will.
For most users, there is no pressing need or compelling advantage over existing interfaces. Only
when users learn to speak as part of the creative process, unlearn the correct-as-you-go habit
and become familiar with the conventions of a speech-centric interaction model (navigating,
formatting and so on) will speech recognition become widely adopted. Speech recognition will not
become the dominant mode of text entry for at least a decade.
“At least a decade”? Yep. But people have been saying that for at least three decades now.
Many people speak as part of the creative process and have done for years. Lawyers and many other people have dictated to secretaries and tape recorders successfully for decades. Dictating to a person, even one who is at a remove from the dictator is different to dictating to a computer (because (sing it if you know it!) Computers do not understand speech in the same way that people do).
Finally, the correct-as-you-go habit is a pretty good one. Why would we want to unlearn it? We even do it as we speak—why would we want to unlearn years of verbal habits to fit with an interface?
Mobile and workflow applications are more-promising candidates for gaining value from speech
recognition, however. These applications use speech recognition for application control and will
feed speech recognition to the desktop over time as users become comfortable with the voice
interface paradigm.
Mobile apps will use speech for application control? Maybe. But why, when simple app controls map so sweetly to a single-handed control like a scroll-wheel and when anything more complicated will be faster to accomplish with manual interaction. Besides which, if you’re going to be carrying around something with enough processing power to do good speech recognition, is there any reason to think that you need to be using it on the move when your hands and eyes will be busy?
Also, speech on the desktop for application control will be a “surprise and delight” feature but not something that I think people will use. Why? Intuition that changing modes from manual to speech is hard.
1 Ok, for a given value of “never”. In this case “never” is sometime after the first consumer-grade quantum computer is on sale. And maybe not even then if no-one figures out how to build semantics into a computer.