Contents
|
Previous
|
Next
|
Chapter 3
Designing Effective Speech Applications
Speech applications are like conversations between the user and the computer. Conversations are characterized by turn-taking, shifts in initiative, and verbal and non-verbal feedback to indicate understanding.
A major benefit of incorporating speech in an application is that speech is natural: people find speaking easy, conversation is a skill most master early in life and then practice frequently. At a deeper level, naturalness refers to the many subtle ways people cooperate with one another to ensure successful communication.
An effective speech application is one that simulates some of these core aspects of human-human conversation. Since language use is deeply ingrained in human behavior, successful speech interfaces should be based on an understanding of the different ways that people use language to communicate. Speech applications should adopt language conventions that help people know what they should say next and that avoid conversational patterns that violate standards of polite, cooperative behavior.
This chapter discusses when a speech interface is and is not appropriate, and then provides some concrete design ideas for creating effective speech applications that adhere to conversational conventions.
3.1 When to Use Speech
A crucial factor in determining the success of a speech application is whether or not there is a clear benefit to using speech. Since speech is such a natural medium for communication, users' expectations of a speech application tend to be extremely high. This means speech is best used when the need is clear - for example, when the user's hands and eyes are busy - or when speech enables something that cannot otherwise be done, such as accessing electronic mail or an on-line calendar over the telephone.
Speech applications are most successful when users are motivated to cooperate. For example, telephone companies have successfully used speech recognition to automate collect calls. People making a collect call want their call to go through, so they answer prompts carefully. People accepting collect calls are also motivated to cooperate, since they do not want to pay for unwanted calls or miss important calls from their friends and family. Automated collect calling systems save the company money and benefit users. Telephone companies report that callers prefer talking to the computer because they are sometimes embarrassed by their need to call collect and they feel that the computer makes the transaction more private.
Speech is well suited to some tasks, but not for others. The following tables list characteristics that can help you determine when speech input and output are appropriate choices.
Table 3-1 When is speech input appropriate? Use When... Avoid When...
Table 3-2 When is speech output appropriate? Use When... Avoid When...
Including speech in an application because it is a novelty means it probably will not get used. Including it because there is some compelling reason increases the likelihood for success.
3.2 Design for Speech
After you determine that speech is an appropriate interface technique, consider how speech will be integrated into the application. Generally, a successful speech application is designed with speech in mind. It is rarely effective to add speech to an existing graphical application or to translate a graphical application directly into a speech-only one. Doing so is akin to translating a command-line-driven program directly into a graphical user interface. The program may work, but the most effective graphical programs are designed with the graphical environment in mind from the outset.
Graphical applications do not translate well into speech for several reasons. First, graphical applications do not always reflect the vocabulary, or even the basic concepts, that people use when talking to one another in the domain of the application. Consider a calendar application, for example. Most graphical calendar programs use an explicit visual representation of days, months, and years. There is no concept of relative dates (e.g., "the day after Labor Day" or "a week from tomorrow") built into the interface. When people speak to each other about scheduling, however, they make extensive use of relative dates. A speech interface to a calendar, whether speech-only or multi-modal, is therefore more likely to be effective if it allows users to speak about dates in both relative and absolute terms. By basing the speech interface design exactly on the graphical interface design, relative dates would not be included in the design, and the usability of the calendar application would be compromised.
Information organization is another important consideration. Presentations that work well in the graphical environment can fail completely in the speech environment. Reading exactly what is displayed on the screen is rarely effective. Likewise, users find it awkward to speak exactly what is printed on the display.
Consider the way in which many e-mail applications present message headers. An inbox usually consists of a chronological, sometimes numbered, list of headers containing information such as sender, subject, date, time, and size:
You can scan this list and find a subject of interest or identify a message from a particular person. Imagine if someone read this information out loud to you, exactly as printed. It would take a long time! And the day, date, time, and size information, which you can easily ignore in the graphical representation, becomes quite prominent. It doesn't sound very natural, either. By the time you hear the fifth header, you may also have forgotten that there was an earlier message with the same subject.
An effective speech interface for an e-mail application would probably not read the date, time, and size information from the message header unless the user requests it. Better still would be an alternate organization scheme which groups messages into categories, perhaps by subject or sender (e.g., "You have two messages about `Boston rumors'" or "You have two messages from Arlene Rexford"), so that the header list contains fewer individual items. Reading the items in a more natural spoken form would also be helpful. For example, instead of "Three. Hilary Binda. Change of address." the system might say "Message 3 from Hilary Binda is about Change of address."
On the speech input side, users find speaking menu commands is often awkward and unnatural. In one e-mail program, a menu called "Move" contains a list of mail box names. Translating this interface to speech would force the user to say something like "Move. Weekly Reports." A more natural interface would allow the user to say "File this in my Weekly Reports folder." The natural version is a little longer, but it is probably something the user could remember to say without looking at the screen.
3.3 Challenges
Even if you design an application with speech in mind from the outset, you face substantial challenges before your application is robust and easy to use. Understanding these challenges and assessing the various trade-offs that must be made during the design process will help to produce the most effective interface.
3.3.1 Transience: What did you say?
Speech is transient. Once you hear it or say it, it's gone. By contrast, graphics are persistent. A graphical interface typically stays on the screen until the user performs some action.
Listening to speech taxes users' short-term memory. Because speech is transient, users can remember only a limited number of items in a list and they may forget important information provided at the beginning of a long sentence. Likewise, while speaking to a dictation system, users often forget the exact words they have just spoken.
Users' limited ability to remember transient information has substantial implications for the speech interface design. In general, transience means that speech is not a good medium for delivering large amounts of information.
The transient nature of speech can also provide benefits. Because people can look and listen at the same time, speech is ideal for grabbing attention or for providing an alternate mechanism for feedback. Imagine receiving a notification about the arrival of an e-mail message while working on a spreadsheet. Speech might give the user the opportunity to ask for the sender or the subject of the message. The information can be delivered without forcing the user to switch contexts.
3.3.2 Invisibility: What can I say?
Speech is invisible. The lack of visibility makes it challenging to communicate the functional boundaries of an application to the user. In a graphical application, menus and other screen elements make most or all of the functionality of an application visible to a user. By contrast, in a speech application it is much more difficult to indicate to the user what actions they may perform, and what words and phrases they must say to perform those actions.
3.3.3 Asymmetry
Speech is asymmetric. People can produce speech easily and quickly, but they cannot listen nearly as easily and quickly. This asymmetry means people can speak faster than they can type, but listen much more slowly than they can read.
The asymmetry has design implications for what information to speak and how much to speak. A speech interface designer must balance the need to convey lots of instructions to users with users' limited ability to absorb spoken information.
3.3.4 Speech synthesis quality
Given that today's synthesizers still do not sound entirely natural, the choice to use synthesized output, recorded output, or no speech output is often a difficult one. Although recorded speech is much easier and more pleasant for users to listen to, it is difficult to use when the information being presented is dynamic. For example, recorded speech could not be used to read people their e-mail messages over the telephone. Using recorded speech is best for prompts that don't change, with synthesized speech being used for dynamic text.
Mixing recorded and synthesized speech, however, is not generally a good idea. Although users report not liking the sound of synthesized speech, they are, in fact, able to adapt to the synthesizer better when it is not mixed with recorded speech. Listening is considerably easier when the voice is consistent.
As a rule of thumb, use recorded speech when all the text to be spoken is known in advance, or when it is important to convey a particular personality to the user. Use synthesized speech when the text to be spoken is not known in advance, or when storage space is limited. Recorded audio requires substantially more disk space than synthesized speech.
3.3.5 Speech recognition performance
Speech recognizers are not perfect listeners. They make mistakes. A big challenge in designing speech applications, therefore, is working with imperfect speech recognition technology. While this technology improves constantly, it is unlikely that, in the foreseeable future, it will approach the robustness of computers in science fiction movies.
An application designer should understand the types of errors that speech recognizers make and the common causes of these errors. Refer to Table 2-1 in the previous chapter for a list of common errors and their causes.
Unfortunately, recognition errors cause the user to form an incorrect model of how the system works. For example, if the user says "Read the next message," and the recognizer hears "Repeat the message," the application will repeat the current message, leading the user to believe that "Read the next message" is not a valid way to ask for the next message. If the user then says "Next," and the recognizer returns a rejection error, the user now eliminates "Next" as a valid option for moving forward. Unless there is a display that lists all the valid commands, users cannot know if the words they have spoken should work; therefore, if they don't work, users assume they are invalid.
Some recognition systems adapt to users over time, but good recognition performance still requires cooperative users who are willing and able to adapt their speaking patterns to the needs of the recognition system. This is why providing users with a clear motivation to make speech work for them is essential.
3.3.6 Recognition: flexibility vs. accuracy
A flexible system allows users to speak the same commands in many different ways. The more flexibility an application provides for user input, the more likely errors are to occur. In designing a command-and-control style interface, therefore, the application designer must find a balance between flexibility and recognition accuracy. For example, a calendar application may allow the user to ask about tomorrow's appointments in ways such as:
This may be quite natural in theory, but, if recognition performance is poor, users will not accept the application. On the other hand, applications that provide a small, fixed set of commands also may not be accepted, even if the command phrases are designed to sound natural (e.g., Lookup tomorrow). Users tend to forget the exact wording of fixed commands. What seems natural for one user may feel awkward for another. Section 3.6, "Involving Users," describes a technique for collecting data from users in order to determine the most common ways that people talk about a subject. In this way, applications can offer some flexibility without causing recognition performance to degrade dramatically.
3.4 Design Issues for Speech-Only Applications
A speech-only system is one in which speech input and output are the only options available to the user. Most speech-only systems operate over the telephone.
3.4.1 Feedback & Latency
In conversations, timing is critical. People read meaning into pauses. Unfortunately, processing delays in speech applications often cause pauses in places where they do not naturally belong. For example, users may reply to a prompt and then not hear an immediate response. This leads them to believe that they were not heard, so they speak again. This results in either missing the application's response when it does come (because the user is speaking at the same time) or causing a recognition error.
Giving users adequate feedback is especially important in speech-only interfaces. Processing delays, coupled with the lack of peripheral cues to help the user determine the state of the application, make consistent feedback a key factor in achieving user satisfaction.
When designing feedback, recall that speech is a slow output channel. This speed issue must be balanced with a user's need to know several vital facts:
Verification should be commensurate with the cost of performing an action. Implicitly verify commands that present data and explicitly verify commands that destroy data or trigger actions. For example, it would be important to give the user plenty of feedback before authorizing a large payment, while it would not be as vital to ensure that a date is correct before checking a weather forecast. In the case of the payment, the feedback should be explicit (e.g., "Do you want to make a payment of $1,000 to Boston Electric? Say yes or no."), The feedback for the forecast query can be implicit (e.g., "Tomorrow's weather forecast for Boston is...."). In this case, the word "Tomorrow" serves as feedback that the date was correctly (or incorrectly) recognized. If correct, the interaction moves forward with minimal wasted time.
3.4.2 Prompting
Well designed prompts lead users smoothly through a successful interaction with a speech-only application. Many factors must be considered when designing prompts, but the most important is assessing the trade-off between flexibility and performance. The more you constrain what the user can say to an application, the less likely they are to encounter recognition errors. On the other hand, allowing users to enter information flexibly can often speed the interaction (if recognition succeeds), feel more natural, and avoid forcing users to memorize commands. Here are some tips for creating useful prompts.
- Use explicit prompts when the user input must be tightly constrained. For example, after recording a message, the prompt might be "Say cancel, send, or review." This sort of prompt directs the user to say just one of those three keywords.
- Use implicit prompts when the application is able to accept more flexible input. These prompts rely on conversational conventions to constrain the user input. For example, if the user says "Send mail to Bill," and "Bill" is ambiguous, the system prompt might be "Did you mean Bill Smith or Bill Jones?" Users are likely to respond with input such as "Smith" or "I meant Bill Jones." While possible, conversational convention makes it less likely that they would say "Bill Jones is the one I want."
- When possible, taper prompts to make them shorter. Tapering can be accomplished in one of two ways. If an application is presenting a set of data such as current quotes for a stock portfolio, drop out unnecessary words once a pattern is established. For example:
" As of 15 minutes ago, Sun Microsystems was trading at 45 up 1/2,
Motorola was at 83 up 1/8, and
IBM was at 106 down 1/4"
Tapering can also happen over time. That is, if you need to tell the user the same information more than once, make it shorter each time. For example, you may wish to remind users about the correct way to record a message. The first time they record a message in a session, the instructions might be lengthy. The next time shorter and the third time just a quick reminder. For example:
"Start recording after the tone. Pause for several seconds when done."
"Record after the tone, then pause."
"Record then pause."
- Use incremental prompts to speed interaction for expert users and provide help for less experienced users. This technique involves starting with a short prompt. If the user does not respond within a time-out period, the application prompts again with more detailed instructions. For example, the initial prompt might be: "Which service?" If the user says nothing, then the prompt could be expanded to: "Say banking, address book, or yellow pages."
3.4.3 Handling Errors
How a system handles recognition errors can dramatically affect the quality of a user's experience. If either the application or the user detects an error, an effective speech user interface should provide one or more mechanisms for correcting the error. While this seems obvious, correcting a speech input error is not always easy! If the user speaks a word or phrase again, the same error is likely to reoccur.
Techniques for handling rejection errors are somewhat different than those for handling misrecognitions and misfires. Perhaps the most important advice when handling rejection errors is not to repeat the same error message if the user experiences more than one rejection error in a row. Users find repetition to be hostile. Instead, try to provide progressive assistance. The first message might simply be "What?" If another error occurs, then perhaps, "Sorry. Please rephrase" will get the user to say something different. A third message might provide a tip on how to speak, "Still no luck. Speak clearly, but don't overemphasize."
Another technique is to reprompt with a more explicit prompt (such as a yes/ no question) and switch to a more constrained grammar. If possible, provide an alternate input modality. For example, prompt the user to press a key on the telephone pad as an alternative to speaking.
As mentioned above, misrecognitions and misfires are harder to detect, and therefore harder to handle. One good strategy is to filter recognition results for unlikely user input. For example, a scheduling application might assume that an error has occurred if the user appears to want to schedule a meeting for 3am.
Flexible correction mechanisms that allow a user to correct a portion of the input are helpful. For example, if the user asks for a weather forecast for Boston for Tuesday, the system might respond "Tomorrow's weather for Boston is..." A flexible correction mechanism would allow the user to just correct the day: "No, I said Tuesday."
3.5 Design Issues for Multi-Modal Applications
Multi-modal applications include other input and output modalities along with speech. For example, speech integrated with a desktop application would be multi-modal, as would speech augmenting the controls of a personal note taker or a radio. While many of the design issues for a multi-modal application are the same as for a speech-only one, some specific issues are unique to applications that provide users with multiple input mechanisms, particularly graphical interfaces driven by keyboard and mouse.
3.5.1 Feedback & Latency
As in speech-only systems, performance delays can cause confusion for users. Fortunately, a graphic display can show the user the state of the recognizer (processing or waiting for input) which a speech-only interface cannot. If a screen is available, displaying the results of the recognizer makes it obvious if the recognizer has heard and if the results were accurate.
As mentioned earlier, the transient nature of speech sometimes causes people to forget what they just said. When dictating, particularly when dictating large amounts of text, this problem is compounded by recognition errors. When a user looks at dictated text and sees it is different from what they recall saying, making a correction is not always easy since they will not necessarily remember what they said or even what they were thinking. Access to a recording of the original speech is extremely helpful in aiding users in the correction of dictated text.
The decision of whether or not to show unfinalized results is a problem in continuous dictation applications. Unfinalized results are words that the recognizer is hypothesizing that the user has said, but for which it has not yet committed a decision. As the user says more, these words may change. Unfinalized text can be hidden from the user, displayed in the text stream in reverse video (or some other highlighted fashion), or shown in a separate window. Eventually, the recognizer makes its best guess and finalizes the words. An application designer makes a trade-off between showing users words that may change and having a delay before the recognizer is able to provide the finalized results. Showing the unfinalized results can be confusing, but not showing any words can lead the user to believe that the system has not heard them.
3.5.2 Prompting
Prompts in multi-modal systems can be spoken or printed. Deciding on an appropriate strategy depends greatly on the content and context of the application. If privacy is an issue, it is probably better not to have the computer speak out loud. On the other hand, even a little bit of spoken output can enable eyes-free interaction and can provide the user with the sense of having a conversational partner rather than speaking to an inanimate object.
With a screen available, explicit prompts usually involve providing the user with a list of valid spoken commands. These lists can become cumbersome unless they are organized hierarchically.
Another strategy is to let users speak any text they see on the screen, whether it is menu text or button text or field names. In applications that support more than simple spoken commands, one strategy is to list examples of what the user can say next, rather than a complete laundry list of every possible utterance.
3.5.3 Handling Errors
Multi-modal speech systems that display recognition results make it easier for users to detect errors. If a rejection error occurs, no text will appear in the area where recognition results are displayed. If the recognizer makes a misrecognition or misfire error, the user can see what the recognizer thinks was said and correct any errors.
Even with feedback displayed, an application should not assume that users will always catch errors. Filtering for unexpected input is still helpful, as is allowing the user to switch to a different input modality if recognition is not working reliably.
3.6 Involving Users
Involving users in the design process throughout the lifecycle of a speech application is crucial. A natural, effective interface can only be achieved by understanding how and where and why target users will interact with the application.
3.6.1 Natural Dialog Studies
At the very early stages of design, users can help to define application functionality and, critical to speech interface design, provide input on how humans carry out conversations in the domain of the application. This information can be collected by performing a natural dialog study, which involves asking target users to talk with each other while working through a scenario. For example, if you are designing a telephone-based e-mail program, you might work with pairs of study participants. Put the participants in two separate rooms. Give one participant a telephone and a computer with an e-mail program. Give the other only a telephone. Have the participant with only the telephone call the participant with the computer and ask to have his or her mail read aloud. Leave the task open ended, but add a few guidelines such as "be sure to answer all messages that require a response."
In some natural dialog studies it is advantageous to include a subject matter expert. For example, if you wish to automate a telephone-based financial service, study participants might call up and speak with an expert customer service representative from the financial service company.
Natural dialog studies are an effective technique for collecting vocabulary, establishing commonly used grammatical patterns, and providing ideas for prompt and feedback design. When a subject matter expert is involved, prompt and feedback design can be based on phrases and responses the expert uses when speaking with customers.
In general, natural dialog studies are quick and inexpensive. It is not necessary to include large numbers of participants.
3.6.2 Wizard-of-Oz Studies
Once a preliminary application design is complete, but before the speech application is implemented, a wizard-of-oz study can help test and refine the interface. In these studies, a human wizard - usually using software tools - simulates the speech interface. Major usability problems are often uncovered with these types of simulations. (The term "Wizard of Oz" comes from the classic movie in which the wizard controls an impressive display while hidden behind a curtain.)
Continuing the e-mail example, a wizard-of-oz study might involve bringing in study participants and telling them that the computer is going to read them their e-mail. When they call a telephone number, the human wizard answers, but manipulates the computer so that a synthesized voice speaks to the participant. As the participant asks to navigate through the mailbox, hear messages, or reply to messages, the wizard carries out the operations and has the computer speak the responses.
Since computer tools are usually necessary to carry out a convincing simulation, wizard-of-oz studies are more time-consuming and complicated to run than natural dialog studies. If a prototype of the final application can be built quickly, it may be more cost-effective to move directly to a usability study.
3.6.3 Usability Studies
A usability study assesses how well users are able to carry out the primary tasks that an application is designed to support. Conducting such a study requires at least a preliminary software implementation. The application need not be complete, but some of the core functionality must be working. Usability studies can be conducted either in a laboratory or in the field. Study participants are typically presented with one or more tasks that they must figure out how to accomplish using the application.
With speech applications, usability studies are particularly important for uncovering problems due to recognition errors, which are difficult to simulate effectively in a wizard-of-oz study, but are a leading cause of usability problems. The effectiveness of an application's error recovery functionality must be tested in the environments in which real users will use the application.
Conducting usability tests of speech applications can be a bit tricky. Two standard techniques used in tests of graphical applications -- facilitated discussions and speak-aloud protocols -- cannot be used effectively for speech applications. A facilitated discussion involves having a facilitator in the room with the study participant. Any human-human conversation, however, can interfere with the human-computer conversation, causing recognition errors. Speak-aloud protocols involve asking the study participant to verbalize their thoughts as they work with the software. Obviously this is not desirable when dealing with a speech recognizer. It is best, therefore, to have study participants work in isolation, speaking only into a telephone or microphone. A tester should not intervene unless the participant becomes completely stuck. A follow-up interview can be used to collect the participant's comments and reactions.
3.7 Summary
An effective speech application is one that uses speech to enhance a user's performance of a task or enable an activity that cannot be done without it. Designing an application with speech in mind from the outset is a key success factor. Basing the dialog design on a natural dialog study ensures that the input grammar will match the phrasing actually used by people when speaking in the domain of the application. A natural dialog study also assures that prompts and feedback follow conversational conventions that users expect in a cooperative interaction. Once an application is designed, wizard-of-oz and usability studies provide opportunities to test interaction techniques and refine application behavior based on feedback from prototypical users.
3.8 For More Information
The following sources provide additional information on speech user interface design.
- Fraser, N.M. and G.N. Gilbert, "Simulating Speech Systems," Computer Speech and Language, Vol. 5, Academic Press Limited, 1991.
- Raman, T.V. Auditory User Interfaces: Towards the Speaking Computer. Kluwer Academic Publishers, Boston, MA, 1997.
- Roe, D.B. and N.M. Wilpon, editors. Voice Communication Between Humans and Machines. National Academy Press, Washington D.C., 1994.
- Schmandt, C. Voice Communication with Computers: Conversational Systems . Van Nostrand Reinhold, New York, 1994.
- Yankelovich, N, G.A. Levow, and M. Marx, "Designing SpeechActs: Issues in Speech User Interfaces," CHI '95 Conference on Human Factors in Computing Systems, Denver, CO, May 7-11, 1995.
Contents
|
Previous
|
Next
|
JavaTM
Speech API Programmer's Guide
Copyright © 1997-1998
Sun Microsystems, Inc.
All rights reserved
Send comments or corrections to javaspeech-comments@sun.com