Contents
|
Previous
|
Next
|
Chapter 6
Speech Recognition: javax.speech.recognition
A speech recognizer is a speech engine that converts speech to text. The
javax.speech.recognitionpackage defines theRecognizerinterface to support speech recognition plus a set of supporting classes and interfaces. The basic functional capabilities of speech recognizers, some of the uses of speech recognition and some of the limitations of speech recognizers are described in Section 2.2.As a type of speech engine, much of the functionality of a
Recognizeris inherited from theEngineinterface in thejavax.speechpackage and from other classes and interfaces in that package. Thejavax.speechpackage and generic speech engine functionality are described in Chapter 4.The Java Speech API is designed to keep simple speech applications simple Þ and to make advanced speech applications possible for non-specialist developers. This chapter covers both the simple and advanced capabilities of the
javax.speech.recognitionpackage. Where appropriate, some of the more advanced sections are marked so that you can choose to skip them. We begin with a simple code example, and then review the speech recognition capabilities of the API in more detail through the following sections:
- "Hello World!": a simple example of speech recognition
6.1 "Hello World!"
The following example shows a simple application that uses speech recognition. For this application we need to define a grammar of everything the user can say, and we need to write the Java software that performs the recognition task.
A grammar is provided by an application to a speech recognizer to define the words that a user can say, and the patterns in which those words can be spoken. In this example, we define a grammar that allows a user to say "Hello World" or a variant. The grammar is defined using the Java Speech Grammar Format. This format is documented in the Java Speech Grammar Format Specification.
Place this grammar into a file.
grammar javax.speech.demo; public <sentence> = hello world | good morning | hello mighty computer;This trivial grammar has a single public rule called "
sentence". A rule defines what may be spoken by a user. A public rule is one that may be activated for recognition.The following code shows how to create a recognizer, load the grammar, and then wait for the user to say something that matches the grammar. When it gets a match, it deallocates the engine and exits.
import javax.speech.*; import javax.speech.recognition.*; import java.io.FileReader; import java.util.Locale; public class HelloWorld extends ResultAdapter { static Recognizer rec; // Receives RESULT_ACCEPTED event: print it, clean up, exit public void resultAccepted(ResultEvent e) { Result r = (Result)(e.getSource()); ResultToken tokens[] = r.getBestTokens(); for (int i = 0; i < tokens.length; i++) System.out.print(tokens[i].getSpokenText() + " "); System.out.println(); // Deallocate the recognizer and exit rec.deallocate(); System.exit(0); } public static void main(String args[]) { try { // Create a recognizer that supports English. rec = Central.createRecognizer( new EngineModeDesc(Locale.ENGLISH)); // Start up the recognizer rec.allocate(); // Load the grammar from a file, and enable it FileReader reader = new FileReader(args[0]); RuleGrammar gram = rec.loadJSGF(reader); gram.setEnabled(true); // Add the listener to get results rec.addResultListener(new HelloWorld()); // Commit the grammar rec.commitChanges(); // Request focus and start listening rec.requestFocus(); rec.resume(); } catch (Exception e) { e.printStackTrace(); } } }This example illustrates the basic steps which all speech recognition applications must perform. Let's examine each step in detail.
- Create: The
Centralclass ofjavax.speechpackage is used to obtain a speech recognizer by calling thecreateRecognizermethod. TheEngineModeDescargument provides the information needed to locate an appropriate recognizer. In this example we requested a recognizer that understands English (since the grammar is written for English).
- Load and enable grammars: The
loadJSGFmethod reads in a JSGF document from a reader created for the file that contains the javax.speech.demo grammar. (Alternatively, theloadJSGFmethod can load a grammar from a URL.) Next, the grammar is enabled. Once the recognizer receives focus (see below), an enabled grammar is activated for recognition: that is, the recognizer compares incoming audio to the active grammars and listens for speech that matches those grammars.
- Attach a ResultListener: The
HelloWorldclass extends theResultAdapterclass which is a trivial implementation of theResultListenerinterface. An instance of theHelloWorldclass is attached to the Recognizer to receive result events. These events indicate progress as the recognition of speech takes place. In this implementation, we process theRESULT_ACCEPTEDevent, which is provided when the recognizer completes recognition of input speech that matches an active grammar.
- Commit changes: Any changes in grammars and the grammar enabled status needed to be committed to take effect (that includes creation of a new grammar). The reasons for this are described in Section 6.4.2.
- Request focus and resume: For recognition of the grammar to occur, the recognizer must be in the
RESUMEDstate and must have the speech focus. TherequestFocusandresumemethods achieve this.
- Process result: Once the
mainmethod is completed, the application waits until the user speaks. When the user speaks something that matches the loaded grammar, the recognizer issues aRESULT_ACCEPTEDevent to the listener we attached to the recognizer. The source of this event is aResultobject that contains information about what the recognizer heard. ThegetBestTokensmethod returns an array ofResultTokens, each of which represents a single spoken word. These words are printed.
6.2 Recognizer as an Engine
The basic functionality provided by a
Recognizerincludes grammar management and the production of results when a user says things that match active grammars. TheRecognizerinterface extends theEngineinterface to provide this functionality.The following is a list of the functionality that the
javax.speech.recognitionpackage inherits from thejavax.speechpackage and outlines some of the ways in which that functionality is specialized.
- The properties of a speech engine defined by the
EngineModeDescclass apply to recognizers. TheRecognizerModeDescclass adds information about dictation capabilities of a recognizer and about users who have trained the engine. BothEngineModeDescandRecognizerModeDescare described in Section 4.2.
- Recognizers are searched, selected and created through the
Centralclass in thejavax.speechpackage as described in Section 4.3. That section explains default creation of a recognizer, recognizer selection according to defined properties, and advanced selection and creation mechanisms.
- Recognizers inherit the basic state systems of an engine from the
Engineinterface, including the four allocation states, the pause and resume state, the state monitoring methods and the state update events. The engine state systems are described in Section 4.4. The two state systems added by recognizers are described in Section 6.3.
- Recognizers produce all the standard engine events (see Section 4.5). The
javax.speech.recognitionpackage also extends theEngineListenerinterface asRecognizerListenerto provide events that are specific to recognizers.
- Other engine functionality inherited as an engine includes the runtime properties (see Section 4.6.1 and Section 6.8), audio management (see Section 4.6.2) and vocabulary management (see Section 4.6.3).
6.3 Recognizer State Systems
6.3.1 Inherited States
As mentioned above, a
Recognizerinherits the basic state systems defined in thejavax.speechpackage, particularly through theEngineinterface. The basic engine state systems are described in Section 4.4. In this section the two state systems added for recognizers are described. These two states systems represent the status of recognition processing of audio input against grammars, and the recognizer focus.As a summary, the following state system functionality is inherited from the
javax.speechpackage.
- The basic engine state system represents the current allocation state of the engine: whether resources have been obtained for the engine. The four allocation states are
ALLOCATED,DEALLOCATED,ALLOCATING_RESOURCESandDEALLOCATING_RESOURCES.
- The
PAUSEDandRESUMEDstates are sub-states of theALLOCATEDstate. The paused and resumed states of a recognizer indicate whether audio input is on or off. Pausing a recognizer is analogous to turning off the input microphone: input audio is lost. Section 4.4.7 describes the effect of pausing and resuming a recognizer in more detail.
- The
getEngineStatemethod of theEngineinterface returns alongvalue representing the current engine state. The value has a bit set for each of the current states of the recognizer. For example andALLOCATEDrecognizer in theRESUMEDstate will have both theALLOCATEDandRESUMEDbits set.
- The
testEngineStateandwaitEngineStatemethods are convenience methods for monitoring engine state. The test method tests for presence in a specified state. The wait method blocks until a specific state is reached.
- An
EngineEventis issued toEngineListenerseach time an engine changes state. The event class includes the new and old engine states.The recognizer adds two sub-state systems to the
ALLOCATEDstate: that's in addition to the inherited pause and resume sub-state system. The two new sub- state systems represent the current activities of the recognizer's internal processing (theLISTENING,PROCESSINGandSUSPENDEDstates) and the current recognizer focus (theFOCUS_ONandFOCUS_OFFstates).These new sub-state systems are parallel states to the
PAUSEDandRESUMEDstates and operate nearly independently as shown in Figure 6-1 (an extension of Figure 4-2).![]()
6.3.2 Recognizer Focus
The
FOCUS_ONandFOCUS_OFFstates indicate whether this instance of theRecognizercurrently has the speech focus. Recognizer focus is a major determining factor in grammar activation, which, in turn, determines what the recognizer is listening for at any time. The role of recognizer focus in activation and deactivation of grammars is described in Section 6.4.3.A change in engine focus is indicated by a
RecognizerEvent(which extendsEngineEvent) being issued toRecognizerListeners. AFOCUS_LOSTevent indicates a change in state fromFOCUS_ONtoFOCUS_OFF. AFOCUS_GAINEDevent indicates a change in state fromFOCUS_OFFtoFOCUS_ON.When a
Recognizerhas focus, theFOCUS_ONbit is set in the engine state. When aRecognizerdoes not have focus, theFOCUS_OFFbit is set. The following code examples monitor engine state:
Recognizer rec; if (rec.testEngineState(Recognizer.FOCUS_ON)) { // we have focus so release it rec.releaseFocus(); } // wait until we lose it rec.waitEngineState(Recognizer.FOCUS_OFF);Recognizer focus is relevant to computing environments in which more than one application is using an underlying recognition. For example, in a desktop environment a user might be running a single speech recognition product (the underlying engine), but have multiple applications using the speech recognizer as a resource. These applications may be a mixture of Java and non-Java applications. Focus is not usually relevant in a telephony environment or in other speech application contexts in which there is only a single application processing the audio input stream.
The recognizer's focus should track the application to which the user is currently talking. When a user indicates that it wants to talk to an application (e.g., by selecting the application window, or explicitly saying "switch to application X"), the application requests speech focus by calling the
requestFocusmethod of theRecognizer.When speech focus is no longer required (e.g., the application has been iconized) it should call
releaseFocusmethod to free up focus for other applications.Both methods are asynchronous Þ- the methods may return before the focus is gained or lost - since focus change may be deferred. For example, if a recognizer is in the middle of recognizing some speech, it will typically defer the focus change until the result is completed. The focus events and the engine state monitoring methods can be used to determine when focus is actually gained or lost.
The focus policy is determined by the underlying recognition engine - it is not prescribed by the
java.speech.recognitionpackage. In most operating environments it is reasonable to assume a policy in which the last application to request focus gets the focus.Well-behaved applications adhere to the following convention to maximize recognition performance, to minimize their impact upon other applications and to maintain a satisfactory user interface experience. An application should only request focus when it is confident that the user's speech focus (attention) is directed towards it, and it should release focus when it is not required.
6.3.3 Recognition States
The most important (and most complex) state system of a recognizer represents the current recognition activity of the recognizer. An
ALLOCATEDRecognizeris always in one of the following three states:
LISTENINGstate: TheRecognizeris listening to incoming audio for speech that may match an active grammar but has not detected speech yet. A recognizer remains in this state while listening to silence and when audio input runs out because the engine is paused.
PROCESSINGstate: TheRecognizeris processing incoming speech that may match an active grammar. While in this state, the recognizer is producing a result.
SUSPENDEDstate: TheRecognizeris temporarily suspended while grammars are updated. While suspended, audio input is buffered for processing once the recognizer returns to theLISTENINGandPROCESSINGstates.This sub-state system is shown in Figure 6-1. The typical state cycle of a recognizer is triggered by user speech. The recognizer starts in the
LISTENINGstate, moves to thePROCESSINGstate while a user speaks, moves to theSUSPENDEDstate once recognition of that speech is completed and while grammars are updates in response to user input, and finally returns to theLISTENINGstate.In this first event cycle a
Resultis typically produced that represents what the recognizer heard. EachResulthas a state system and theResultstate system is closely coupled to thisRecognizerstate system. TheResultstate system is discussed in Section 6.7. Many applications (including the "Hello World!" example) do not care about the recognition state but do care about the simplerResultstate system.The other typical event cycle also starts in the
LISTENINGstate. Upon receipt of a non-speech event (e.g., keyboard event, mouse click, timer event) the recognizer is suspended temporarily while grammars are updated in response to the event, and then the recognizer returns to listening.Applications in which grammars are affected by more than speech events need to be aware of the recognition state system.
The following sections explain these event cycles in more detail and discuss why speech input events are different in some respects from other event types.
6.3.3.1 Speech Events vs. Other Events
A keyboard event, a mouse event, a timer event, a socket event are all instantaneous in time - there is a defined instant at which they occur. The same is not true of speech for two reasons.
Firstly, speech is a temporal activity. Speaking a sentence takes time. For example, a short command such as "reload this web page" will take a second or two to speak, thus, it is not instantaneous. At the start of the speech the recognizer changes state, and as soon as possible after the end of the speech the recognizer produces a result containing the spoken words.
Secondly, recognizers cannot always recognize words immediately when they are spoken and cannot determine immediately when a user has stopped speaking. The reasons for these technical constraints upon recognition are outside the scope of this guide, but knowing about them is helpful in using a recognizer. (Incidentally, the same principals are generally true of human perception of speech.)
A simple example of why recognizers cannot always respond might be listening to a currency amount. If the user says "two dollars" or says "two dollars, fifty seconds" with a short pause after the word "dollars" the recognizer can't know immediately whether the user has finished speaking after the "dollars". What a recognizer must do is wait a short period - usually less than a second Þ- to see if the user continues speaking. A second is a long time for a computer and complications can arise if the user clicks a mouse or does something else in that waiting period. (Section 6.8 explains the time-out parameters that affect this delay.)
A further complication is introduced by the input audio buffering described in Section 6.3.
Putting all this together, there is a requirement for the recognizers to explicitly represent internal state through the
LISTENING,PROCESSINGandSUSPENDEDstates.6.3.3.2 Speech Input Event Cycle
The typical recognition state cycle for a
Recognizeroccurs as speech input occurs. Technically speaking, this cycle represents the recognition of a singleResult. The result state system and result events are described in detail in Section 6.7. The cycle described here is a clockwise trip through theLISTENING,PROCESSINGandSUSPENDEDstates of anALLOCATEDrecognizer as shown in Figure 6-1.The
Recognizerstarts in theLISTENINGstate with a certain set of grammars enabled and active. When incoming audio is detected that may match an active grammar, theRecognizertransitions from theLISTENINGstate to thePROCESSINGstate with aRECOGNIZER_PROCESSINGevent.The
Recognizerthen creates a newResultobject and issues aRESULT_CREATEDevent (aResultEvent) to provide the result to the application. At this point the result is usually empty: it does not contain any recognized words. As recognition proceeds words are added to the result along with other useful information.The
Recognizerremains in thePROCESSINGstate until it completes recognition of the result. While in thePROCESSINGstate theResultmay be updated with new information.The recognizer indicates completion of recognition by issuing a
RECOGNIZER_SUSPENDEDevent to transition from thePROCESSINGstate to theSUSPENDEDstate. Once in that state, the recognizer issues a result finalization event toResultListeners(RESULT_ACCEPTEDorRESULT_REJECTEDevent) to indicate that all information about the result is finalized (words, grammars, audio etc.).The
Recognizerremains in theSUSPENDEDstate until processing of the result finalization event is completed. Applications will often make grammar changes during the result finalization because the result causes a change in application state or context.In the
SUSPENDEDstate theRecognizerbuffers incoming audio. This buffering allows a user to continue speaking without speech data being lost. Once theRecognizerreturns to theLISTENINGstate the buffered audio is processed to give the user the perception of real-time processing.Once the result finalization event has been issued to all listeners, the
Recognizerautomatically commits all grammar changes and issues aCHANGES_COMMITTEDevent to return to theLISTENINGstate. (It also issuesGRAMMAR_CHANGES_COMMITTEDevents toGrammarListenersof changed grammars.) The commit applies all grammar changes made at any point up to the end of result finalization, such as changes made in the result finalization events.The
Recognizeris now back in theLISTENINGstate listening for speech that matches the new grammars.In this event cycle the first two recognizer state transitions (marked by
RECOGNIZER_PROCESSINGandRECOGNIZER_SUSPENDEDevents) are triggered by user actions: starting and stopping speaking. The third state transition (CHANGES_COMMITTEDevent) is triggered programmatically some time after theRECOGNIZER_SUSPENDEDevent.The
SUSPENDEDstate serves as a temporary state in which recognizer configuration can be updated without loosing audio data.6.3.3.3 Non-Speech Event Cycle
For applications that deal only with spoken input the state cycle described above handles most normal speech interactions. For applications that handle other asynchronous input, additional state transitions are possible. Other types of asynchronous input include graphical user interface events (e.g.,
AWTEvent), timer events, multi-threading events, socket events and so on.The cycle described here is temporary transition from the
LISTENINGstate to theSUSPENDEDand back as shown in Figure 6-1.When a non-speech event occurs which changes the application state or application data it may be necessary to update the recognizer's grammars. The
suspendandcommitChangesmethods of aRecognizerare used to handle non- speech asynchronous events. The typical cycle for updating grammars in response to a non-speech asynchronous events is as follows.Assume that the
Recognizeris in theLISTENINGstate (the user is not currently speaking). As soon as the event is received, the application callssuspendto indicate that it is about to change grammars. In response, the recognizer issues aRECOGNIZER_SUSPENDEDevent and transitions from theLISTENINGstate to theSUSPENDEDstate.With the
Recognizerin theSUSPENDEDstate, the application makes all necessary changes to the grammars. (The grammar changes affected by this event cycle and the pending commit are described in Section 6.4.2.)Once all grammar changes are completed the application calls the
commitChangesmethod. In response, the recognizer applies the new grammars and issues aCHANGES_COMMITTEDevent to transition from theSUSPENDEDstate back to theLISTENINGstate. (It also issuesGRAMMAR_CHANGES_COMMITTEDevents to all changed grammars.)Finally, the
Recognizerresumes recognition of the buffered audio and then live audio with the new grammars.The suspend and commit process is designed to provide a number of features to application developers which help give users the perception of a responsive recognition system.
Because audio is buffered from the time of the asynchronous event to the time at which the
CHANGES_COMMITTEDoccurs, the audio is processed as if the new grammars were applied exactly at the time of the asynchronous event. The user has the perception of real-time processing.Although audio is buffered in the
SUSPENDEDstate, applications should make grammar changes and callcommitChangesas quickly as possible. This minimizes the amount of data in the audio buffer and hence the amount of time it takes for the recognizer to "catch up". It also minimizes the possibility of a buffer overrun.Technically speaking, an application is not required to call
suspendprior to callingcommitChanges. If thesuspendcall is committed theRecognizerbehaves as if suspend had been called immediately prior to callingcommitChanges. However, an application that does not callsuspendrisks a commit occurring unexpectedly while it updates grammars with the effect of leaving grammars in an inconsistent state.6.3.4 Interactions of State Systems
The three sub-state systems of an allocated recognizer (shown in Figure 6-1) normally operate independently. There are, however, some indirect interactions.
When a recognizer is paused, audio input is stopped. However, recognizers have a buffer between audio input and the internal process that matches audio against grammars, so recognition can continue temporarily after a recognizer is paused. In other words, a
PAUSEDrecognizer may be in thePROCESSINGstate.Eventually the audio buffer will empty. If the recognizer is in the
PROCESSINGstate at that time then the result it is working on is immediately finalized and the recognizer transitions to theSUSPENDEDstate. Since a well-behaved application treatsSUSPENDEDstate as a temporary state, the recognizer will eventually leave theSUSPENDEDstate by committing grammar changes and will return to theLISTENINGstate.The
PAUSED/RESUMEDstate of an engine is shared by multiple applications, so it is possible for a recognizer to be paused and resumed because of the actions of another application. Thus, an application should always leave its grammars in a state that would be appropriate for aRESUMEDrecognizer.The focus state of a recognizer is independent of the
PAUSEDandRESUMEDstates. For instance, it is possible for a pausedRecognizerto haveFOCUS_ON. When the recognizer is resumed, it will have the focus and its grammars will be activated for recognition.The focus state of a recognizer is very loosely coupled with the recognition state. An application that has no
GLOBALgrammars (described in Section 6.4.3) will not receive any recognition results unless it has recognition focus.
6.4 Recognition Grammars
A grammar defines what a recognizer should listen for in incoming speech. Any grammar defines the set of tokens a user can say (a token is typically a single word) and the patterns in which those words are spoken.
The Java Speech API supports two types of grammars: rule grammars and dictation grammars. These grammars differ in how patterns of words are defined. They also differ in their programmatic use: a rule grammar is defined by an application, whereas a dictation grammar is defined by a recognizer and is built into the recognizer.
A rule grammar is provided by an application to a recognizer to define a set of rules that indicates what a user may say. Rules are defined by tokens, by references to other rules and by logical combinations of tokens and rule references. Rule grammars can be defined to capture a wide range of spoken input from users by the progressive combination of simple grammars and rules.
A dictation grammar is built into a recognizer. It defines a set of words (possibly tens of thousands of words) which may be spoken in a relatively unrestricted way. Dictation grammars are closest to the goal of unrestricted natural speech input to computers. Although dictation grammars are more flexible than rule grammars, recognition of rule grammars is typically faster and more accurate.
Support for a dictation Þgrammar is optional for a recognizer. As Section 4.2 explains, an application that requires dictation functionality can request it when creating a recognizer.
A recognizer may have many rule grammars loaded at any time. However, the current
Recognizerinterface restricts a recognizer to a single dictation grammar. The technical reasons for this restriction are outside the scope of this guide.6.4.1 Grammar Interface
The
Grammarinterface is the root interface that is extended by all grammars. The grammar functionality that is shared by all grammars is presented through this interface.The
RuleGrammarinterface is an extension of theGrammarinterface to support rule grammars. TheDictationGrammarinterface is an extension of theGrammarinterface to support dictation grammars.The following are the capabilities presented by the grammar interface:
- Grammar naming: Every grammar loaded into a recognizer must have a unique name. The
getNamemethod returns that name. Grammar names allow references to be made between grammars. The grammar naming convention is described in the Java Speech Grammar Format Specification Briefly, the grammar naming convention is very similar to the class naming convention for the Java programming language. For example, a grammar from Acme Corp. for dates might be called "com.acme.speech.dates".
- Enabling and disabling: Grammars may be enabled or disabled using the
setEnabledmethod. When a grammar is enabled and when specified activation conditions are met, the grammar is activated. Once a grammar is active a recognizer will listen to incoming audio for speech that matches that grammar. Enabling and activation are described in more detail below (Section 6.4.3).
- Activation mode: This is the property of a grammar that determines which conditions need to be met for a grammar to be activated. The activation mode is managed through the
getActivationModeandsetActivationModemethods (described in Section 6.4.3). The three available activation modes are defined as constants of theGrammarinterface:RECOGNIZER_FOCUS,RECOGNIZER_MODALandGLOBAL.
- Activation: the
isActivemethod returns abooleanvalue that indicates whether aGrammaris currently active for recognition.
- GrammarListener: the
addGrammarListenerandremoveGrammarListenermethods allow aGrammarListenerto be attached to and removed from aGrammar. TheGrammarEventsissued to the listener indicate when grammar changes have been committed and whenever the grammar activation state changes.
- ResultListener: the
addResultListenerandremoveResultListenermethods allow aResultListenerto be attached to and removed from aGrammar. This listener receives notification of all events for any result that matches the grammar.6.4.2 Committing Changes
The Java Speech API supports dynamic grammars; that is, it supports the ability for an application to modify grammars at runtime. In the case of rule grammars any aspect of any grammar can be changed at any time.
After making any change to a grammar through the
Grammar,RuleGrammarorDictationGrammarinterfaces an application must commit the changes. This applies to changes in definitions of rules in aRuleGrammar, to changing context for aDictationGrammar, to changing the enabled state, or to changing the activation mode. (It does not apply to adding or removing aGrammarListenerorResultListener.)Changes are committed by calling the
commitChangesmethod of theRecognizer. The commit is required for changes to affect the recognition process: that is, the processing of incoming audio.The commit changes mechanism has two important properties:
- Updates to grammar definitions and the enabled property take effect atomically (all changes take effect at once). There are no intermediate states in which some, but not all, changes have been applied.
- The
commitChangesmethod is a method ofRecognizerso all changes to all grammars are committed at once. Again, there are no intermediate states in which some, but not all, changes have been applied.There is one instance in which changes are committed without an explicit call to the
commitChangesmethod. Whenever a recognition result is finalized (completed), an event is issued toResultListeners(it is either aRESULT_ACCEPTEDorRESULT_REJECTEDevent). Once processing of that event is completed changes are normally committed. This supports the common situation in which changes are often made to grammars in response to something a user says.The event-driven commit is closely linked to the underlying state system of a
Recognizer. The state system for recognizers is described in detail in Section 6.3.6.4.3 Grammar Activation
A grammar is active when the recognizer is matching incoming audio against that grammar to determine whether the user is saying anything that matches that grammar. When a grammar is inactive it is not being used in the recognition process.
Applications to do not directly activate and deactivate grammars. Instead they provided methods for (1) enabling and disabling a grammar, (2) setting the activation mode for each grammar, and (3) requesting and releasing the speech focus of a recognizer (as described in Section 6.3.2.)
The enabled state of a grammar is set with the
setEnabledmethod and tested with theisEnabledmethod. For programmers familiar with AWT or Swing, enabling a speech grammar is similar to enabling a graphical component.Once enabled, certain conditions must be met for a grammar to be activated. The activation mode indicates when an application wants the grammar to be active. There are three activation modes:
RECOGNIZER_FOCUS,RECOGNIZER_MODALandGLOBAL. For each mode a certain set of activation conditions must be met for the grammar to be activated for recognition. The activation mode is managed with thesetActivationModeandgetActivationModemethods.The enabled flag and the activation mode are both parameters of a grammar that need to be committed to take effect. As Section 6.4.2 described, changes need to be committed to affect the recognition processes.
Recognizer focus is a major determining factor in grammar activation and is relevant in computing environments in which more than one application is using an underlying recognition (e.g., desktop computing with multiple speech-enabled applications). Section 6.3.2 describes how applications can request and release focus and monitor focus through
RecognizerEventsand the engine state methods.Recognizer focus is used to turn on and off activation of grammars. The roll of focus depends upon the activation mode. The three activation modes are described here in order from highest priority to lowest. An application should always use the lowest priority mode that is appropriate to its user interface functionality.
GLOBALactivation mode: if enabled, theGrammaris always active irrespective of whether theRecognizerof this application has focus.
RECOGNIZER_MODALactivation mode: if enabled, theGrammaris always active when the application'sRecognizerhas focus. Furthermore, enabling a modal grammar deactivates any grammars in the sameRecognizerwith theRECOGNIZER_FOCUSactivation mode. (The term "modal" is analogous to "modal dialog boxes" in graphical programming.)
RECOGNIZER_FOCUSactivation mode (default mode): if enabled, theGrammaris active when theRecognizerof this application has focus. The exception is that if any other grammar of this application is enabled withRECOGNIZER_MODALactivation mode, then this grammar is not activated.The current activation state of a grammar can be tested with the
isActivemethod. Whenever a grammar's activation changes either aGRAMMAR_ACTIVATEDorGRAMMAR_DEACTIVATEDevent is issued to each attachedGrammarListener. A grammar activation event typically follows aRecognizerEventthat indicates a change in focus (FOCUS_GAINEDorFOCUS_LOST), or aCHANGES_COMMMITTEDRecognizerEventthat indicates that a change in the enabled setting of a grammar has been applied to the recognition process.An application may have zero, one or many grammars enabled at any time. Thus, an application may have zero, one or many grammars active at any time. As the conventions below indicate, well-behaved applications always minimize the number of active grammars.
The activation and deactivation of grammars is independent of
PAUSEDandRESUMEDstates of theRecognizer. For instance, a grammar can be active even when a recognizer isPAUSED. However, when aRecognizeris paused, audio input to theRecognizeris turned off, so speech won't be detected. This is useful, however, because when the recognizer is resumed, recognition against the active grammars immediately (and automatically) resumes.Activating too many grammars and, in particular, activating multiple complex grammars has an adverse impact upon a recognizer's performance. In general terms, increasing the number of active grammars and increasing the complexity of those grammars can both lead to slower recognition response time, greater CPU load and reduced recognition accuracy (i.e., more mistakes).
Well-behaved applications adhere to the following conventions to maximize recognition performance and minimize their impact upon other applications:
- Never apply the
GLOBALactivation mode to aDictationGrammar(most recognizers will throw an exception if this is attempted).
- Always use the default activation mode
RECOGNIZER_FOCUSunless there is a good reason to use another mode.
- Only use the
RECOGNIZER_MODALwhen it is certain that deactivating theRECOGNIZER_FOCUSgrammars will not adversely affect the user interface.
- Minimize the complexity and the number of
RuleGrammarswithGLOBALactivation mode. As a general rule, one very simpleGLOBALrule grammar should be sufficient for nearly all applications.
- Only enable a grammar when it is appropriate for a user to say something matching that grammar. Otherwise disable the grammar to improve recognition response time and recognition accuracy for other grammars.
- Only request focus when confident that the user's speech focus (attention) is directed to grammars of your application. Release focus when it is not required.
6.5 Rule Grammars
6.5.1 Rule Definitions
A rule grammar is defined by a set of rules. These rules are defined by logical combinations of tokens to be spoken and references to other rules. The references may refer to other rules defined in the same rule grammar or to rules imported from other grammars.
Rule grammars follow the style and conventions of grammars in the Java Speech Grammar Format (defined in the Java Speech Grammar Format Specification). Any grammar defined in the JSGF can be converted to a
RuleGrammarobject. AnyRuleGrammarobject can be printed out in JSGF. (Note that conversion from JSGF to aRuleGrammarand back to JSGF will preserve the logic of the grammar but may lose comments and may change formatting.)Since the
RuleGrammarinterface extends theGrammarinterface, aRuleGrammarinherits the basic grammar functionality described in the previous sections (naming, enabling, activation etc.).The easiest way to load a
RuleGrammar, or set ofRuleGrammarobjects is from a Java Speech Grammar Format file or URL. TheloadJSGFmethods of theRecognizerperform this task. If multiple grammars must be loaded (where a grammar references one or more imported grammars), importing by URL is most convenient. The application must specify the base URL and the name of the root grammar to be loaded.
Recognizer rec; URL base = new URL("http://www.acme.com/app"); String grammarName = "com.acme.demo"; Grammar gram = rec.loadURL(base, grammarName);The recognizer converts the base URL and grammar name to a URL using the same conventions as
ClassLoader(the Java platform mechanism for loading class files). By converting the periods in the grammar name to slashes ('/'), appending a".gram"suffix and combining with the base URL, the location is "http:// www.acme.com/app/com/acme/demo.gram".If the demo grammar imports sub-grammars, they will be loaded automatically using the same location mechanism.
Alternatively, a
RuleGrammarcan be created by calling thenewRuleGrammarmethod of aRecognizer. This method creates an empty grammar with a specified grammar name.Once a
RuleGrammarhas been loaded, or has been created with thenewRuleGrammarmethod, the following methods of aRuleGrammarare used to create, modify and manage the rules of the grammar.
Any of the methods of
RuleGrammarthat affect the grammar (setRule,deleteRule,setEnabledetc.) take effect only after they are committed (as described in Section 6.4.2).The rule definitions of a
RuleGrammarcan be considered as a collection of namedRuleobjects. EachRuleobject is referenced by its rulename (aString). The different types ofRuleobject are described in Section 6.5.3.Unlike most collections in Java, the
RuleGrammaris a collection that does not share objects with the application. This is because recognizers often need to perform special processing of the rule objects and store additional information internally. The implication for applications is that a call tosetRuleis required to change any rule. The following code shows an example where changing a rule object does not affect the grammar.
RuleGrammar gram; // Create a rule for the word blue // Add the rule to the RuleGrammar and make it public RuleToken word = new RuleToken("blue"); gram.setRule("ruleName", word, true); // Change the word word.setText("green"); // getRule returns blue (not green) System.out.println(gram.getRule("ruleName"));To ensure that the changed
"green"token is loaded into the grammar, the application must callsetRuleagain after changing the word to"green". Furthermore, for either change to take effect in the recognition process, the changes need to be committed (see Section 6.4.2).6.5.2 Imports
Complex systems of rules are most easily built by dividing the rules into multiple grammars. For example, a grammar could be developed for recognizing numbers. That grammar could then be imported into two separate grammars that defines dates and currency amounts. Those two grammars could then be imported into a travel booking application and so on. This type of hierarchical grammar construction is similar in many respects to object oriented and shares the advantage of easy reusage of grammars.
An import declaration in JSGF and an import in a
RuleGrammarare most similar to the import statement of the Java programming language. Unlike a "#include" in the C programming language, the imported grammar is not copied, it is simply referencable. (A full specification of import semantics is provided in the Java Speech Grammar Format specification.)The
RuleGrammarinterface defines three methods for handling imports as shown in Table 6-2.
The
resolvemethod of theRuleGrammarinterface is useful in managing imports. Given any rulename, theresolvemethod returns an object that represents the fully-qualified rulename for the rule that it references.6.5.3 Rule Classes
A
RuleGrammaris primarily a collection of defined rules. The programmatic rule structure used to controlRecognizersfollows exactly the definition of rules in the Java Speech Grammar Format. Any rule is defined by aRuleobject. It may be any one of theRuleclasses described Table 6-3. The exceptions are theRuleParseclass, which is returned by theparsemethod ofRuleGrammar, and theRuleclass which is an abstract class and the parent of all otherRuleobjects.
The following is an example of a grammar in Java Speech Grammar Format. The "Hello World!" example shows how this JSGF grammar can be loaded from a text file. Below we consider how to create the same grammar programmatically.
grammar com.sun.speech.test; public <test> = [a] test {TAG} | another <rule>; <rule> = word;The following code shows the simplest way to create this grammar. It uses the
ruleForJSGFmethod to convert partial JSGF text to aRuleobject. Partial JSGF is defined as any legal JSGF text that may appear on the right hand side of a rule definition - technically speaking, any legal JSGF rule expansion.
Recognizer rec; // Create a new grammar RuleGrammar gram = rec.newRuleGrammar("com.sun.speech.test"); // Create the <test> rule Rule test = gram.ruleForJSGF("[a] test {TAG} | another <rule>"); gram.setRule("test", // rulename test, // rule definition true); // true -> make it public // Create the <rule> rule gram.setRule("rule", gram.ruleForJSGF("word"), false); // Commit the grammar rec.commitChanges();6.5.3.1 Advanced Rule Programming
In advanced programs there is often a need to define rules using the set of
Ruleobjects described above. For these applications, using rule objects is more efficient than creating a JSGF string and using theruleForJSGFmethod.To create a rule by code, the detailed structure of the rule needs to be understood. At the top level of our example grammar, the
<test>rule is an alternative: the user may say something that matches"[a] test {TAG}"or say something matching"another <rule>". The two alternatives are each sequences containing two items. In the first alternative, the brackets around the token"a"indicate it is optional. The"{TAG}"following the second token ("test") attaches a tag to the token. The second alternative is a sequence with a token ("another") and a reference to another rule ("<rule>").The code to construct this
Grammarfollows (this code example is not compact - it is written for clarity of details).
Recognizer rec; RuleGrammar gram = rec.newRuleGrammar("com.sun.speech.test"); // Rule we are building RuleAlternatives test; // Temporary rules RuleCount r1; RuleTag r2; RuleSequence seq1, seq2; // Create "[a]" r1 = new RuleCount(new RuleToken("a"), RuleCount.OPTIONAL); // Create "test {TAG}" - a tagged token r2 = new RuleTag(new RuleToken("test"), "TAG"); // Join "[a]" and "test {TAG}" into a sequence "[a] test {TAG}" seq1 = new RuleSequence(r1); seq1.append(r2); // Create the sequence "another <rule>"; seq2 = new RuleSequence(new RuleToken("another")); seq2.append(new RuleName("rule")); // Build "[a] test {TAG} | another <rule>" test = new RuleAlternatives(seq1); test.append(seq2); // Add <test> to the RuleGrammar as a public rule gram.setRule("test", test, true); // Provide the definition of <rule>, a non-public RuleToken gram.setRule("rule", new RuleToken("word"), false); // Commit the grammar changes rec.commitChanges();6.5.4 Dynamic Grammars
Grammars may be modified and updated. The changes allow an application to account for shifts in the application's context, changes in the data available to it, and so on. This flexibility allows application developers considerable freedom in creating dynamic and natural speech interfaces.
For example, in an email application the list of known users may change during the normal operation of the program. The
<sendEmail>command,
<sendEmail> = send email to <user>;references the
<user>rule which may need to be changed as new email arrives. This code snippet shows the update and commit of a change in users.
Recognizer rec; RuleGrammar gram; String names[] = {"amy", "alan", "paul"}; Rule userRule = new RuleAlternatives(names); gram.setRule("user", userRule); // apply the changes rec.commitChanges();Committing grammar changes can, in certain cases, be a slow process. It might take a few tenths of seconds or up to several seconds. The time to commit changes depends on a number of factors. First, recognizers have different mechanisms for committing changes making some recognizers faster than others. Second, the time to commit changes may depend on the extent of the changes - more changes may require more time to commit. Thirdly, the time to commit may depend upon the type of changes. For example, some recognizers optimize for changes to lists of tokens (e.g. name lists). Finally, faster computers make changes more quickly.
The other factor which influences dynamic changes is the timing of the commit. As Section 6.4.2 describes, grammar changes are not always committed instantaneously. For example, if the recognizer is busy recognizing speech (in the
PROCESSINGstate), then the commit of changes is deferred until the recognition of that speech is completed.6.5.5 Parsing
Parsing is the process of matching text to a grammar. Applications use parsing to break down spoken input into a form that is more easily handled in software. Parsing is most useful when the structure of the grammars clearly separates the parts of spoken text that an application needs to process. Examples are given below of this type of structuring.
The text may be in the form of a
Stringor array ofStringobjects (oneStringper token), or in the form of aFinalRuleResultobject that represents what a recognizer heard a user say. TheRuleGrammarinterface defines three forms of theparsemethod - one for each form of text.The
parsemethod returns aRuleParseobject (a descendent ofRule) that represents how the text matches theRuleGrammar. The structure of theRuleParseobject mirrors the structure of rules defined in theRuleGrammar. EachRuleobject in the structure of the rule being parsed against is mirrored by a matchingRuleobject in the returnedRuleParseobject.The difference between the structures comes about because the text being parsed defines a single phrase that a user has spoken whereas a
RuleGrammardefines all the phrases the user could say. Thus the text defines a single path through the grammar and all the choices in the grammar (alternatives, and rules that occur optionally or occur zero or more times) are resolvable.The mapping between the objects in the rules defined in the
RuleGrammarand the objects in theRuleParsestructure is shown in Table 6-4. Note that except for theRuleCountandRuleNameobjects, the object in the parse tree are of the same type as rule object being parsed against (marked with "**"), but the internal data may differ.
As an example, take the following simple extract from a grammar. The public rule,
<command>, may be spoken in many ways. For example, "open", "move that door" or "close that door please".
public <command> = <action> [<object>] [<polite>]; <action> = open {OP} | close {CL} | move {MV}; <object> = [<this_that_etc>] window | door; <this_that_etc> = a | the | this | that | the current; <polite> = please | kindly;Note how the rules are defined to clearly separate the segments of spoken input that an application must process. Specifically, the
<action>and<object>rules indicate how an application must respond to a command. Furthermore, anything said that matches the<polite>rule can be safely ignored, and usually the<this_that_etc>rule can be ignored too.The parse for "open" against
<command>has the following structure which matches the structure of the grammar above.
RuleParse(<command> = RuleSequence( RuleParse(<action> = RuleAlternatives( RuleTag( RuleToken("open"), "OP")))))The match of the
<command>rule is represented by aRuleParseobject. Because the definition of<command>is a sequence of 3 items (2 of which are optional), the parse of<command>is a sequence. Because only one of the 3 items is spoken (in "open"), the sequence contains a single item. That item is the parse of the<action>rule.The reference to
<action>in the definition of<command>is represented by aRuleNameobject in the grammar definition, and this maps to aRuleParseobject when parsed. The<action>rule is defined by a set of three alternatives (RuleAlternativesobject) which maps to anotherRuleAlternativesobject in the parse but with only the single spoken alternative represented. Since the phrase spoken was "open", the parse matches the first of the three alternatives which is a tagged token. Therefore the parse includes aRuleTagobject which contains aRuleTokenobject for "open".The following is the parse for "close that door please".
RuleParse(<command> = RuleSequence( RuleParse(<action> = RuleAlternatives( RuleTag( RuleToken("close"), "CL"))) RuleSequence( RuleParse(<object> = RuleSequence( RuleSequence( RuleParse(<this_that_etc> = RuleAlternatives( RuleToken("that")))) RuleAlternatives( RuleToken("door")))) RuleSequence( RuleParse(<polite> = RuleAlternatives( RuleToken("please")))) ))There are three parsing issues that application developers should consider.
- There may be several legal ways to parse the text against the grammar. This is known as an ambiguous parse. In this instance the
parsemethod will return one of the legal parses but the application is not informed of the ambiguity. As a general rule, most developers will want to avoid ambiguous parses by proper grammar design. Advanced applications will use specialized parsers if they need to handle ambiguity.
- If a
FinalRuleResultis parsed against theRuleGrammarand the rule within that grammar that it matched, then it should successfully parse. However, it is not guaranteed to parse if theRuleGrammarhas been modified of if theFinalRuleResultis aREJECTEDresult. (Result rejection is described in Section 6.7.)
6.6 Dictation Grammars
Dictation grammars come closest to the ultimate goal of a speech recognition system that takes natural spoken input and transcribes it as text. Dictation grammars are used for free text entry in applications such as email and word processing.
A
Recognizerthat supports dictation provides a singleDictationGrammarwhich is obtained from the recognizer'sgetDictationGrammarmethod. A recognizer that supports the Java Speech API is not required to provide aDictationGrammar. Applications that require a recognizer with dictation capability can explicitly request dictation when creating a recognizer by setting theDictationGrammarSupportedproperty of theRecognizerModeDescto true (see Section 4.2 for details).A
DictationGrammaris more complex than a rule grammar, but fortunately, aDictationGrammaris often easier to use than an rule grammar. This is because theDictationGrammaris built into the recognizer so most of the complexity is handled by the recognizer and hidden from the application. However, recognition of a dictation grammar is typically more computationally expensive and less accurate than that of simple rule grammars.The
DictationGrammarinherits its basic functionality from theGrammarinterface. That functionality is detailed in Section 6.4 and includes grammar naming, enabling, activation, committing and so on.As with all grammars, changes to a
DictationGrammarneed to be committed before they take effect. Commits are described in Section 6.4.2.In addition to the specific functionality described below, a
DictationGrammaris typically adaptive. In an adaptive system, a recognizer improves its performance (accuracy and possibly speed) by adapting to the style of language used by a speaker. The recognizer may adapt to the specific sounds of a speaker (the way they say words). Equally importantly for dictation, a recognizer can adapt to a user's normal vocabulary and to the patterns of those words. Such adaptation (technically known as language model adaptation) is a part of the recognizer's implementation of theDictationGrammarand does not affect an application. The adaptation data for a dictation grammar is maintained as part of a speaker profile (see Section 6.9).The
DictationGrammarextends and specializes theGrammarinterface by adding the following functionality:The following methods provided by the DictationGrammar interface allow an application to manage word lists and text context.
6.6.1 Dictation Context
Dictation recognizers use a range of information to improve recognition accuracy. Learning the words a user speaks and the patterns of those words can substantially improve accuracy.
Because patterns of words are important, context is important. The context of a word is simply the set of surrounding words. As an example, consider the following sentence "If I have seen further it is by standing on the shoulders of Giants" (Sir Isaac Newton). If we are editing this sentence and place the cursor after the word "standing" then the preceding context is "...further it is by standing" and the following context is "on the shoulders of Giants...".
Given this context, the recognizer is able to more reliably predict what a user might say, and greater predictability can improve recognition accuracy. In this example, the user might insert the word "up" but is less likely to insert the word "JavaBeans".
Through the
setContextmethod of theDictationGrammarinterface, an application should tell the recognizer the current textual context. Furthermore, if the context changes (for example, due to a mouse click to move the cursor) the application should update the context.Different recognizers process context differently. The main consideration for the application is the amount of context to provide to the recognizer. As a minimum, a few words of preceding and following context should be provided. However, some recognizers may take advantage of several paragraphs or more.
There are two
setContextmethods:
void setContext(String preceding, String following);
void setContext(String preceding[], String following[]);The first form takes plain text context strings. The second version should be used when the result tokens returned by the recognizer are available. Internally, the recognizer processes context according to tokens so providing tokens makes the use of context more efficient and more reliable because it does not have to guess the tokenization.
6.7 Recognition Results
A recognition result is provided by a
Recognizerto an application when the recognizer "hears" incoming speech that matches an active grammar. The result tells the application what words the user said and provides a range of other useful information, including alternative guesses and audio data.In this section, both the basic and advanced capabilities of the result system in the Java Speech API are described. The sections relevant to basic rule grammar-based applications are those that cover result finalization (Section 6.7.1), the hierarchy of result interfaces (Section 6.7.2), the data provided through those interfaces (Section 6.7.3), and common techniques for handling finalized rule results (Section 6.7.9).
For dictation applications the relevant sections include those listed above plus the sections covering token finalization (Section 6.7.8), handling of finalized dictation results (Section 6.7.10) and result correction and training (Section 6.7.12).
For more advanced applications relevant sections might include the result life cycle (Section 6.7.4), attachment of ResultListeners (Section 6.7.5), the relationship of recognizer and result states (Section 6.7.6), grammar finalization (Section 6.7.7), result audio (Section 6.7.11), rejected results (Section 6.7.13), result timing (Section 6.7.14), and the loading and storing of vendor formatted results (Section 6.7.15).
6.7.1 Result Finalization
The "Hello World!" example illustrates the simplest way to handle results. In that example, a
RuleGrammarwas loaded, committed and enabled, and aResultListenerwas attached to aRecognizerto receive events associated with every result that matched that grammar. In other words, theResultListenerwas attached to receive information about words spoken by a user that is heard by the recognizer.The following is a modified extract of the "Hello World!" example to illustrate the basics of handling results. In this case, a
ResultListeneris attached to aGrammar(instead of aRecognizer) and it prints out every thing the recognizer hears that matches that grammar. (There are, in fact, three ways in which aResultListenercan be attached: see Section 6.7.5.)
import javax.speech.*; import javax.speech.recognition.*; public class MyResultListener extends ResultAdapter { // Receives RESULT_ACCEPTED event: print it public void resultAccepted(ResultEvent e) { Result r = (Result)(e.getSource()); ResultToken tokens[] = r.getBestTokens(); for (int i = 0; i < tokens.length; i++) System.out.print(tokens[i].getSpokenText() + " "); System.out.println(); } // somewhere in app, add a ResultListener to a grammar { RuleGrammar gram = ...; gram.addResultListener(new MyResultListener()); } }The code shows the
MyResultListenerclass which is as an extension of theResultAdapterclass. TheResultAdapterclass is a convenience implementation of theResultListenerinterface (provided in thejavax.speech.recognitionpackage). When extending theResultAdapterclass we simply implement the methods for the events that we care about.In this case, the
RESULT_ACCEPTEDevent is handled. This event is issued to theresultAcceptedmethod of theResultListenerand is issued when a result is finalized. Finalization of a result occurs after a recognizer completed processing of a result. More specifically, finalization occurs when all information about a result has been produced by the recognizer and when the recognizer can guarantee that the information will not change. (Result finalization should not be confused with object finalization in the Java programming language in which objects are cleaned up before garbage collection.)There are actually two ways to finalize a result which are signalled by the
RESULT_ACCEPTEDandRESULT_REJECTEDevents. A result is accepted when a recognizer is confidently that it has correctly heard the words spoken by a user (i.e., the tokens in theResultexactly represent what a user said).Rejection occurs when a
Recognizeris not confident that it has correctly recognized a result: that is, the tokens and other information in the result do not necessarily match what a user said. Many applications will ignore theRESULT_REJECTEDevent and most will ignore the detail of a result when it is rejected. In some applications, aRESULT_REJECTEDevent is used simply to provide users with feedback that something was heard but no action was taken, for example, by displaying "???" or sounding an error beep. Rejected results and the differences between accepted and rejected results are described in more detail in Section 6.7.13 .An accepted result is not necessarily a correct result. As is pointed out in Section 2.2.3, recognizers make errors when recognizing speech for a range of reasons. The implication is that even for an accepted result, application developers should consider the potential impact of a misrecognition. Where a misrecognition could cause an action with serious consequences or could make changes that can't be undone (e.g., "delete all files"), the application should check with users before performing the action. As recognition systems continue to improve the number of errors is steadily decreasing, but as with human speech recognition there will always be a chance of a misunderstanding.
6.7.2 Result Interface Hierarchy
A finalized result can include a considerable amount of information. This information is provided through four separate interfaces and through the implementation of these interfaces by a recognition system.
// Result: the root result interface interface Result; // FinalResult: info on all finalized results interface FinalResult extends Result; // FinalRuleResult: a finalized result matching a RuleGrammar interface FinalRuleResult extends FinalResult; // FinalDictationResult: a final result for a DictationGrammar interface FinalDictationResult extends FinalResult; // A result implementation provided by a Recognizer public class EngineResult implements FinalRuleResult, FinalDictationResult;At first sight, the result interfaces may seem complex. The reasons for providing several interfaces are as follows:
- The information available for a result is different in different states of the result. Before finalization, a limited amount of information is available through the
Resultinterface. Once a result is finalized (accepted or rejected), more detailed information is available through theFinalResultinterface and either theFinalRuleResultorFinalDictationResultinterface.
- The type of information available for a finalized result is different for a result that matches a
RuleGrammarthan for a result that matches aDictationGrammar. The differences are explicitly represented by having separate interfaces forFinalRuleResultandFinalDictationResult.
- Once a result object is created as a specific Java class it cannot change be changed to another class. Therefore, because a result object must eventually support the final interface it must implement them when first created. Therefore, every result implements all three final interfaces when it is first created:
FinalResult,FinalRuleResultandFinalDictationResult.
- When a result is first created a recognizer does not always know whether it will eventually match a
RuleGrammaror aDictationGrammar. Therefore, every result object implements both theFinalRuleResultandFinalDictationResultinterfaces.
- A call made to any method of any of the final interfaces before a result is finalized causes a
ResultStateException.
- A call made to any method of the
FinalRuleResultinterface for a result that matches aDictationGrammarcauses aResultStateException. Similarly, a call made to any method of theFinalDictationResultinterface for a result that matches aRuleGrammarcauses aResultStateException.
- All the result functionality is provided by interfaces in the
java.speech.recognitionpackage rather than by classes. This is because the Java Speech API can support multiple recognizers from multiple vendors and interfaces allow the vendors greater flexibility in implementing results.The multitude of interfaces is, in fact, designed to simplify application programming and to minimize the chance of introducing bugs into code by allowing compile-time checking of result calls. The two basic principles for calling the result interfaces are the following:
- If it is safe to call the methods of a particular interface then it is safe to call the methods of any of the parent interfaces. For example, for a finalized result matching a
RuleGrammar, the methods of theFinalRuleResultinterface are safe, so the methods of theFinalResultandResultinterfaces are also safe. Similarly, for a finalized result matching aDictationGrammar, the methods ofFinalDictationResult,FinalResultandResultcan all be called safely.- Use type casting of a result object to ensure compile-time checks of method calls. For example, in events to an unfinalized result, cast the result object to the
Resultinterface. For aRESULT_ACCEPTEDfinalization event with a result that matches aDictationGrammar, cast the result to theFinalDictationResultinterface.In the next section the different information available through the different interfaces is described. In all the following sections that deal with result states and result events, details are provided on the appropriate casting of result objects.
6.7.3 Result Information
As the previous section describes, different information is available for a result depending upon the state of the result and, for finalized results, depending upon the type of grammar it matches (
RuleGrammarorDictationGrammar).6.7.3.1 Result Interface
The information available through the
Resultinterface is available for any result in any state - finalized or unfinalized - and matching any grammar.
- Result state: The
getResultStatemethod returns the current state of the result. The three possible state values defined by static values of theResultinterface areUNFINALIZED,ACCEPTEDandREJECTED. (Result states are described in more detail in Section 6.7.4.)
- Grammar: The
getGrammarmethod returns a reference to the matchedGrammar, if it is known. For anACCEPTEDresult, this method will return aRuleGrammaror aDictationGrammar. For aREJECTEDresult, this method may return a grammar, or may returnnullif the recognizer could not identify the grammar for this result. In theUNFINALIZEDstate, this method returnsnullbefore aGRAMMAR_FINALIZEDevent, and non-null afterwards.
- Number of finalized tokens: The
numTokensmethod returns the total number of finalized tokens for a result. For an unfinalized result this may be zero or greater. For a finalized result this number is always greater than zero for anACCEPTEDresult but may be zero or more for aREJECTEDresult. Once a result is finalized this number will not change.
- Finalized tokens: The
getBestTokenandgetBestTokensmethods return either a specified finalized best-guess token of a result or all the finalized best-guess tokens. TheResultTokenobject and token finalization are described in the following sections.
- Unfinalized tokens: In the
UNFINALIZEDstate, thegetUnfinalizedTokensmethod returns a list of unfinalized tokens. An unfinalized token is a recognizer's current guess of what a user has said, but the recognizer may choose to change these tokens at any time and any way. For a finalized result, thegetUnfinalizedTokensmethod always returnsnull.In addition to the information detailed above, the
Resultinterface provides theaddResultListenerandremoveResultListenermethods which allow aResultListenerto be attached to and removed from an individual result.ResultListenerattachment is described in more detail in Section 6.7.5.6.7.3.2 FinalResult Interface
The information available through the
FinalResultinterface is available for any finalized result, including results that match either aRuleGrammarorDictationGrammar.
- Audio data: a
Recognizermay optionally provide audio data for a finalized result. This data is provided asAudioClipfor a token, a sequence of tokens, or for the entire result. Result audio and its management are described in more detail in Section 6.7.11.
- Training data: many recognizer's have the ability to be trained and corrected. By training a recognizer or correcting its mistakes, a recognizer can adapt its recognition processes so that performance (accuracy and speed) improve over time. Several methods of the FinalResult interface support this capability and are described in detail in Section 6.7.12.
6.7.3.3 FinalDictationResult Interface
The
FinalDictationResultinterface contains a single method.
- Alternative tokens: The
getAlternativeTokensmethod allows an application to request a set of alternative guesses for a single token or for a sequence of tokens in that result. In dictation systems, alternative guesses are typically used to facilitate correction of dictated text. Dictation recognizers are designed so that when they do make a misrecognition, the correct word sequence is usually amongst the best few alternative guesses. Section 6.7.106.7.3.4 FinalRuleResult Interface
Like the
FinalDictationResultinterface, theFinalRuleResultinterface provides alternative guesses. TheFinalRuleResultinterface also provides some additional information that is useful in processing results that match aRuleGrammar.
- Alternative tokens: The
getAlternativeTokensmethod allows an application to request a set of alternative guesses for the entire result (not for tokens). ThegetNumberGuessesmethod returns the actual number of alternative guesses available.
- Alternative grammars: The alternative guesses of a result matching a
RuleGrammardo not all necessarily match the same grammar. ThegetRuleGrammarmethod returns a reference to theRuleGrammarmatched by an alternative.
- Rulenames: When a result matches a
RuleGrammar, it matches a specific defined rule of thatRuleGrammar. ThegetRuleNamemethod returns the rulename for the matched rule. Section 6.7.9RuleGrammarresults.
- Tags: A tag is a string attached to a component of a
RuleGrammardefinition. Tags are useful in simplifying the software for processing results matching aRuleGrammar(explained in Section 6.7.9). ThegetTagsmethod returns the tags for the best guess for aFinalRuleResult.6.7.4 Result Life Cycle
A
Resultis produced in response to a user's speech. Unlike keyboard input, mouse input and most other forms of user input, speech is not instantaneous (see Section 6.3.3.1 for more detail). As a consequence, a speech recognition result is not produced instantaneously. Instead, aResultis produced through a sequence of events starting some time after a user starts speaking and usually finishing some time after the user stops speaking.Figure 6-2 shows the state system of a
Resultand the associatedResultEvents. As in the recognizer state diagram (Figure 6-1), the blocks represent states, and the labelled arcs represent transitions that are signalled byResultEvents.![]()
Every result starts in the
UNFINALIZEDstate when aRESULT_CREATEDevent is issued. While unfinalized, the recognizer provides information including finalized and unfinalized tokens and the identity of the grammar matched by the result. As this information is added, theRESULT_UPDATEDandGRAMMAR_FINALIZEDevents are issuedOnce all information associated with a result is finalized, the entire result is finalized. As Section 6.7.1 explained, a result is finalized with either a
RESULT_ACCEPTEDorRESULT_REJECTEDevent placing it in either theACCEPTEDorREJECTEDstate. At that point all information associated with the result becomes available including the best guess tokens and the information provided through the three final result interfaces (see Section 6.7.3).Once finalized the information available through all the result interfaces is fixed. The only exceptions are for the release of audio data and training data. If audio data is released, an
AUDIO_RELEASEDevent is issued (see detail in Section 6.7.11). If training information is released, anTRAINING_INFO_RELEASEDevent is issued (see detail in Section 6.7.12).Applications can track result states in a number of ways. Most often, applications handle result in
ResultListenerimplementation which receivesResultEventsas recognition proceeds.As Section 6.7.3 explains, a recognizer conveys a range of information to an application through the stages of producing a recognition result. However, as the example in Section 6.7.1 shows, many applications only care about the last step and event in that process - the
RESULT_ACCEPTEDevent.The state of a result is also available through the
getResultStatemethod of theResultinterface. That method returns one of the three result states:UNFINALIZED,ACCEPTEDorREJECTED.6.7.5 ResultListener Attachment
A
ResultListenercan be attached in one of three places to receive events associated with results: to aGrammar, to aRecognizeror to an individualResult. The different places of attachment give an application some flexibility in how they handle results.To support
ResultListenerstheGrammar,RecognizerandResultinterfaces all provide theaddResultListenerandremoveResultListenermethods.Depending upon the place of attachment a listener receives events for different results and different subsets of result events.
Grammar: AResultListenerattached to aGrammarreceives allResultEventsfor any result that has been finalized to match that grammar. Because the grammar is known once aGRAMMAR_FINALIZEDevent is produced, aResultListenerattached to aGrammarreceives that event and subsequent events. Since grammars are usually defined for specific functionality it is common for most result handling to be done in the methods of listeners attached to each grammar.
Result: AResultListenerattached to aResultreceives allResultEventsstarting at the time at which the listener is attached to theResult. Note that because a listener cannot be attached until a result has been created with theRESULT_CREATEDevent, it can never receive that event.
Recognizer: AResultListenerattached to aRecognizerreceives allResultEventsfor all results produced by thatRecognizerfor all grammars. This form of listener attachment is useful for very simple applications (e.g., "Hello World!") and when centralized processing of results is required. OnlyResultListenersattached to aRecognizerreceive theRESULT_CREATEDevent.6.7.6 Recognizer and Result States
The state system of a recognizer is tied to the processing of a result. Specifically, the
LISTENING,PROCESSINGandSUSPENDEDstate cycle described in Section 6.3.3 and shown in Figure 6-1 follows the production of an event.The transition of a
Recognizerfrom theLISTENINGstate to thePROCESSINGstate with aRECOGNIZER_PROCESSINGevent indicates that a recognizer has started to produce a result. TheRECOGNIZER_PROCESSINGevent is followed by theRESULT_CREATEDevent toResultListeners.The
RESULT_UPDATEDandGRAMMAR_FINALIZEDevents are issued toResultListenerswhile the recognizer is in thePROCESSINGstate.As soon as the recognizer completes recognition of a result, it makes a transition from the
PROCESSINGstate to theSUSPENDEDstate with aRECOGNIZER_SUSPENDEDevent. Immediately following that recognizer event, the result finalization event (eitherRESULT_ACCEPTEDorRESULT_REJECTED) is issued. While the result finalization event is processed, the recognizer remains suspended. Once result finalization event is completed, the recognizer automatically transitions from theSUSPENDEDstate back to theLISTENINGstate with aCHANGES_COMMITTEDevent. Once back in theLISTENINGstate the recognizer resumes processing of audio input with the grammar committed with theCHANGES_COMMITTEDevent.6.7.6.1 Updating Grammars
In many applications, grammar definitions and grammar activation need to be updated in response to spoken input from a user. For example, if speech is added to a traditional email application, the command "save this message" might result in a window being opened in which a mail folder can be selected. While that window is open, the grammars that control that window need to be activated. Thus during the event processing for the "save this message" command grammars may need be created, updated and enabled. All this would happen during processing of the
RESULT_ACCEPTEDevent.For any grammar changes to take effect they must be committed (see Section 6.4.2). Because this form of grammar update is so common while processing the
RESULT_ACCEPTEDevent (and sometimes theRESULT_REJECTEDevent), recognizers implicitly commit grammar changes after either result finalization event has been processed.This implicit is indicated by the
CHANGES_COMMITTEDevent that is issued when a Recognizer makes a transition from theSUSPENDEDstate to theLISTENINGstate following result finalization and the result finalization event processing (see Section 6.3.3 for details).One desirable effect of this form of commit becomes useful in component systems. If changes in multiple components are triggered by a finalized result event, and if many of those components change grammars, then they do not each need to call the
commitChangesmethod. The downside of multiple calls to thecommitChangesmethod is that a syntax check be performed upon each. Checking syntax can be computationally expensive and so multiple checks are undesirable. With the implicit commit once all components have updated grammars computational costs are reduced.6.7.7 Grammar Finalization
At any time during processing a result a
GRAMMAR_FINALIZEDevent can be issued for that result indicating theGrammarmatched by the result has been determined. This event is issued is issued only once. It is required for anyACCEPTEDresult, but is optional for result that is eventually rejected.As Section 6.7.5 describes, the
GRAMMAR_FINALIZEDevent is the first event received by aResultListenerattached to aGrammar.The
GRAMMAR_FINALIZEDevent behaves the same for results that match either aRuleGrammaror aDictationGrammar.Following the
GRAMMAR_FINALIZEDevent, thegetGrammarmethod of theResultinterface returns a non-null reference to the matched grammar. By issuing aGRAMMAR_FINALIZEDevent theRecognizerguarantees that theGrammarwill not change.Finally, the
GRAMMAR_FINALIZEDevent does not change the result's state. AGRAMMAR_FINALIZEDevent is issued only when a result is in theUNFINALIZEDstate, and leaves the result in that state.6.7.8 Token Finalization
A result is a dynamic object a it is being recognized. One way in which a result can be dynamic is that tokens are updated and finalized as recognition of speech proceeds. The result events allow a recognizer to inform an application of changes in the either or both the finalized and unfinalized tokens of a result.
The finalized and unfinalized tokens can be updated on any of the following result event types:
RESULT_CREATED,RESULT_UPDATED,RESULT_ACCEPTED,RESULT_REJECTED.Finalized tokens are accessed through the
getBestTokensandgetBestTokenmethods of theResultinterface. The unfinalized tokens are accessed through thegetUnfinalizedTokensmethod of theResultinterface. (See Section 6.7.3 for details.)A finalized token is a
ResultTokenin aResultthat has been recognized in the incoming speech as matching a grammar. Furthermore, when a recognizer finalizes a token it indicates that it will not change the token at any point in the future. ThenumTokensmethod returns the number of finalized tokens.Many recognizers do not finalize tokens until recognition of an entire result is complete. For these recognizers, the
numTokensmethod returns zero for a result in theUNFINALIZEDstate.For recognizers that do finalize tokens while a
Resultis in theUNFINALIZEDstate, the following conditions apply:
- The
Resultobject may contain zero or more finalized tokens when theRESULT_CREATEDevent is issued.
- The recognizer issues
RESULT_UPDATEDevents to theResultListenerduring recognition each time one or more tokens are finalized.
- Tokens are finalized strictly in the order in which they are spoken (i.e., left to right in English text).
A result in the
UNFINALIZEDstate may also have unfinalized tokens. An unfinalized token is a token that the recognizer has heard, but which it is not yet ready to finalize. Recognizers are not required to provide unfinalized tokens, and applications can safely choose to ignore unfinalized tokens.For recognizers that provide unfinalized tokens, the following conditions apply:
- The
Resultobject may contain zero or more unfinalized tokens when theRESULT_CREATEDevent is issued.
- The recognizer issues
RESULT_UPDATEDevents to theResultListenerduring recognition each time the unfinalized tokens change.
- For an unfinalized result, unfinalized tokens may be updated at any time and in any way. Importantly, the number of unfinalized tokens may increase, decrease or return to zero and the values of those tokens may change in any way the recognizer chooses.
Unfinalized tokens are highly changeable, so why are they useful? Many applications can provide users with visual feedback of unfinalized tokens - particularly for dictation results. This feedback informs users of the progress of the recognition and helps the user to know that something is happening. However, because these tokens may change and are more likely than finalized tokens to be incorrect, the applications should visually distinguish the unfinalized tokens by using a different font, different color or even a different window.
The following is an example of finalized tokens and unfinalized tokens for the sentence "I come from Australia". The lines indicate the token values after the single
RESULT_CREATEDevent, the multipleRESULT_UPDATEDevents and the finalRESULT_ACCEPTEDevent. The finalized tokens are in bold, the unfinalized tokens are in italics.
RESULT_CREATED: I comeRESULT_UPDATED: I come fromRESULT_UPDATED: I come fromRESULT_UPDATED: I come from a strange landRESULT_UPDATED: I come from AustraliaRESULT_ACCEPTED: I come from AustraliaRecognizers can vary in how they support finalized and unfinalized tokens in a number of ways. For an unfinalized result, a recognizer may provide finalized tokens, unfinalized tokens, both or neither. Furthermore, for a recognizer that does support finalized and unfinalized tokens during recognition, the behavior may depend upon the number of active grammars, upon whether the result is for a
RuleGrammarorDictationGrammar, upon the length of spoken sentences, and upon other more complex factors. Fortunately, unless there is a functional requirement to display or otherwise process intermediate result, an application can safely ignore all but theRESULT_ACCEPTEDevent.6.7.9 Finalized Rule Results
The are some common design patterns for processing accepted finalized results that match a
RuleGrammar. First we review what we know about these results.
- It is safe to cast an accepted result that matches a
RuleGrammarto theFinalRuleResultinterface. It is safe to call any method of theFinalRuleResultinterface or its parents:FinalResultandResult.
- The
getGrammarmethod of theResultinterface return a reference to the matchedRuleGrammar. ThegetRuleGrammarmethod of theFinalRuleResultinterface returns references to theRuleGrammarsmatched by the alternative guesses.
- The
getBestTokenandgetBestTokensmethods of theResultinterface return the recognizer's best guess of what a user said.
- Result audio (see Section 6.7.11) and training information (see Section 6.7.12) are optionally available.
6.7.9.1 Result Tokens
A
ResultTokenin a result matching aRuleGrammarcontains the same information as theRuleTokenobject in theRuleGrammardefinition. This means that the tokenization of the result follows the tokenization of the grammar definition including compound tokens. For example, consider a grammar with the following Java Speech Grammar Format fragment which contains four tokens:
<rule> = I went to "San Francisco";If the user says "I went to New York" then the result will contain the four tokens defined by JSGF: "I", "went", "to", "San Francisco".
The
ResultTokeninterface defines more advanced information. Amongst that information thegetStartTimeandgetEndTimemethods may optionally return time-stamp values (or-1if the recognizer does not provide time-alignment information).The
ResultTokeninterface also defines several methods for a recognizer to provide presentation hints. Those hints are ignored forRuleGrammarresults Þ- they are only used for dictation results (see Section 6.7.10.2).Furthermore, the
getSpokenTextandgetWrittenTextmethods will return an identical string which is equal to the string defined in the matched grammar.6.7.9.2 Alternative Guesses
In a
FinalRuleResult, alternative guesses are alternatives for the entire result, that is, for a complete utterance spoken by a user. (AFinalDictationResultcan provide alternatives for single tokens or sequences of tokens.) Because more than oneRuleGrammarcan be active at a time, an alternative token sequence may match a rule in a differentRuleGrammarthan the best guess tokens, or may match a different rule in the sameRuleGrammaras the best guess. Thus, when processing alternatives for aFinalRuleResult, an application should use thegetRuleGrammarandgetRuleNamemethods to ensure that they analyze the alternatives correctly.Alternatives are numbered from zero up. The 0th alternative is actually the best guess for the result so
FinalRuleResult.getAlternativeTokens(0)returns the same array asResult.getBestTokens(). (The duplication is for programming convenience.) Likewise, theFinalRuleResult.getRuleGrammar(0)call will return the same result asResult.getGrammar().The following code is an implementation of the
ResultListenerinterface that processes theRESULT_ACCEPTEDevent. The implementation assumes that aResultbeing processed matches aRuleGrammar.
class MyRuleResultListener extends ResultAdapter { public void resultAccepted(ResultEvent e) { // Assume that the result matches a RuleGrammar. // Cast the result (source of event) appropriately FinalRuleResult res = (FinalRuleResult) e.getSource(); // Print out basic result information PrintStream out = System.out; out.println("Number guesses: " + res.getNumberGuesses()); // Print out the best result and all alternatives for (int n=0; n < res.getNumberGuesses(); n++) { // Extract the n-best information String gname = res.getRuleGrammar(n).getName(); String rname = res.getRuleName(n); ResultToken[] tokens = res.getAlternativeTokens(n); out.print("Alt " + n + ": "); out.print("<" + gname + "." + rname + "> :"); for (int t=0; t < tokens.length; t++) out.print(" " + tokens[t].getSpokenText()); out.println(); } } }For a grammar with commands to control a windowing system (shown below), a result might look like:
Number guesses: 3 Alt 0: <com.acme.actions.command>: move the window to the back Alt 1: <com.acme.actions.command>: move window to the back Alt 2: <com.acme.actions.command>: open window to the frontIf more than one grammar or more than one public rule was active, the
<grammarName.ruleName>values could vary between the alternatives.6.7.9.3 Result Tags
Processing commands generated from a
RuleGrammarbecomes increasingly difficult as the complexity of the grammar rises. With the Java Speech API, speech recognizers provide two mechanisms to simplify the processing of results: tags and parsing.A tag is a label attached to an entity within a
RuleGrammar. The Java Speech Grammar Format and theRuleTagclass define how tags can be attached to a grammar. The following is a grammar for very simple control of windows which includes tags attached to the important words in the grammar.
grammar com.acme.actions; public <command> = <action> <object> [<where>] <action> = open {ACT_OP}| close {ACT_CL} | move {ACT_MV}; <object> = [a | an | the] (window {OBJ_WIN} | icon {OBJ_ICON}); <where> = [to the] (back {WH_BACK} | front {WH_FRONT});This grammar allows users to speak commands such as
open window
move the icon
move the window to the back
move window backThe italicized words are the ones that are tagged in the grammar - these are the words that the application cares about. For example, in the third and fourth example commands, the spoken words are different but the tagged words are identical. Tags allow an application to ignore trivial words such as "the" and "to".
The
com.acme.actionsgrammar can be loaded and enabled using the code in the "Hello World!" example. Since the grammar has a single public rule,<command>, the recognizer will listen for speech matching that rule, such as the example results given above.The tags for the best result are available through the
getTagsmethod of theFinalRuleResultinterface. This method returns an array of tags associated with the tokens (words) and other grammar entities matched by the result. If the best sequence of tokens is "move the window to the front", the list of tags is the followingStringarray:
String tags[] = {"ACT_MV", "OBJ_WIN", "WH_FRONT"};Note how the order of the tags in the result is preserved (forward in time). These tags are easier for most applications to interpret than the original text of what the user said.
Tags can also be used to handle synonyms - multiple ways of saying the same thing. For example, "programmer", "hacker", "application developer" and "computer dude" could all be given the same tag, say "DEV". An application that looks at the "DEV" tag will not care which way the user spoke the title.
Another use of tags is for internationalization of applications. Maintaining applications for multiple languages and locales is easier if the code is insensitive to the language being used. In the same way that the "DEV" tag isolated an application from different ways of saying "programmer", tags can be used to provide an application with similar input irrespective of the language being recognized.
The following is a grammar for French with the same functionality as the grammar for English shown above.
grammar com.acme.actions.fr; public <command> = <action> <object> [<where>] <action> = ouvrir {ACT_OP}| fermer {ACT_CL} | deplacer {ACT_MV}; <object> = fenetre {OBJ_WIN} | icone {OBJ_ICON}; <where> = au-dessous {WH_BACK} | au-dessus {WH_FRONT};For this simple grammar, there are only minor differences in the structure of the grammar (e.g. the
"[to the]"tokens in the<where>rule for English are absent in French). However, in more complex grammars the syntactic differences between languages become significant and tags provide a clearer improvement.Tags do not completely solve internationalization problems. One issue to be considered is word ordering. A simple command like "open the window" can translate to the form "the window open" in some languages. More complex sentences can have more complex transformations. Thus, applications need to be aware of word ordering, and thus tag ordering when developing international applications.
6.7.9.4 Result Parsing
More advanced applications parse results to get even more information than is available with tags. Parsing is the capability to analyze how a sequence of tokens matches a
RuleGrammar. Parsing of text against aRuleGrammaris discussed in Section 6.5.5 .Parsing a
FinalRuleResultproduces aRuleParseobject. ThegetTagsmethod of aRuleParseobject provides the same tag information as thegetTagsmethod of aFinalRuleResult. However, theFinalRuleResultprovides tag information for only the best-guess result, whereas parsing can be applied to the alternative guesses.An API requirement that simplifies parsing of results that match a
RuleGrammaris that for a such result to beACCEPTED(not rejected) it must exactly match the grammar - technically speaking, it must be possible to parse aFinalRuleResultagainst theRuleGrammarit matches. This is not guaranteed, however, if the result was rejected or if theRuleGrammarhas been modified since it was committed and produced the result.6.7.10 Finalized Dictation Results
The are some common design patterns for processing accepted finalized results that match a
DictationGrammar. First we review what we know about these results.
- It is safe to cast an accepted result that matches a
DictationGrammarto theFinalDictationResultinterface. It is safe to call any method of theFinalDictationResultinterface or its parents:FinalResultandResult.
- The
getBestTokenandgetBestTokensmethods of theResultinterface return the recognizer's best guess of what a user said.
- The
getAlternativeTokensmethod of theFinalDictationResultinterface returns alternative guesses for any token or sequence of tokens.
- Result audio (see Section 6.7.11) and training information (see Section 6.7.12) are optionally available.
The
ResultTokensprovided in aFinalDictationResultcontain specialized information that includes hints on textual presentation of tokens. Section 6.7.10.2 discusses the presentation hints in detail. In this section the methods for obtaining and using alternative tokens are described.6.7.10.1 Alternative Guesses
Alternative tokens for a dictation result are most often used by an application for display to users for correction of dictated text. A typical scenario is that a user speaks some text - perhaps a few words, a few sentences, a few paragraphs or more. The user reviews the text and detects a recognition error. This means that the best guess token sequence is incorrect. However, very often the correct text is one of the top alternative guesses. Thus, an application will provide a user the ability to review a set of alternative guesses and to select one of them if it is the correct text. Such a correction mechanism is often more efficient than typing the correction or dictating the text again. If the correct text is not amongst the alternatives an application must support other means of entering the text.
The
getAlternativeTokensmethod is passed a starting and an endingResultToken. These tokens must have been obtained from the same result either through a call togetBestTokenorgetBestTokensin theResultinterface, or through a previous call togetAlternativeTokens.
ResultToken[][] getAlternativeTokens(
ResultToken fromToken,
ResultToken toToken,
int max);To obtain alternatives for a single token (rather than alternatives for a sequence), set
toTokentonull.The
intparameter allows the application to specify the number of alternatives it wants. The recognizer may choose to return any number of alternatives up to the maximum number including just one alternative (the original token sequence). Applications can indicate in advance the number of alternatives it may request by setting theNumResultAlternativesparameter through the recognizer'sRecognizerPropertiesobject.The two-dimensional array returned by the
getAlternativeTokensmethod is the most difficult aspect of dictation alternatives to understand. The following example illustrates the major features of the return value.Let's consider a dictation example where the user says "he felt alienated today" but the recognizer hears "he felt alien ate Ted today". The user says four words but the recognizer hears six words. In this example, the boundaries of the spoken words and best-guess align nicely: "alienated" aligns with "alien ate Ted" (incorrect tokens don't always align smoothly with the correct tokens).
Users are typically better at locating and fixing recognition errors than recognizers or applications - they provided the original speech. In this example, the user will likely identify the words "alien ate Ted" as incorrect (tokens 2 to 4 in the best-guess result). By an application-provided method such as selection by mouse and a pull-down menu, the user will request alternative guesses for the three incorrect tokens. The application calls the
getAlternativeTokensmethod of theFinalDictationResultto obtain the recognizer's guess at the alternatives.
// Get 6 alternatives for for tokens 2 through 4. FinalDictationResult r = ...; ResultToken tok2 = r.getBestToken(2); ResultToken tok4 = r.getBestToken(4); String[][] alt = r.getAlternativeTokens(tok2, tok4, 6);The return array might look like the following. Each line represents a sequence of alternative tokens to "alien ate Ted". Each word in each alternative sequence represents a
ResultTokenobject in an array.
alt[0] = alien ate Ted // the best guess alt[1] = alienate Ted // the 1st alternative alt[2] = alienated // the 2nd alternative alt[3] = alien hated // the 3rd alternative alt[4] = a lion ate Ted // the 4th alternative
- The first alternative is the best guess. This is usually the case if the
toTokenandfromTokenvalues are from the best-guess sequence. (From an user perspective it's not really an alternative.)
- Only five alternative sequences were returned even though six were requested. This is because a recognizer will only return alternatives it considers to reasonable guesses. It is legal for this call to return only the best guess with no alternatives if can't find any reasonable alternatives.
- The number of tokens is not the same in all the alternative sequences (3, 2, 1, 2, 4 tokens respectively). This return array is known as a ragged array. From a speech perspective is easy to see why different lengths are needed, but application developers do need to be careful processing a ragged array.
A complex issue to understand is that the alternatives vary according to how the application (or user) requests them. The 1st alternative to "alien ate Ted" is "alienate Ted". However, the 1st alternative to "alien" might be "a lion", the 1st alternative to "alien ate" might be "alien eight", and the 1st alternative to "alien ate Ted today" might be "align ate Ted to day".
Fortunately for application developers, users learn to select sequences that are likely to give reasonable alternatives, and recognizers are developed to make the alternatives as useful and accurate as possible.
6.7.10.2 Result Tokens
A
ResultTokenobject represents a single token in a result. A token is most often a single word, but multi-word tokens are possible (e.g., "New York") as well as formatting characters and language-specific constructs. For aDictationGrammarthe set of tokens is built into the recognizer.Each
ResultTokenin aFinalDictationResultprovides the following information.
- The spoken form of the token which provides a transcript of what the user says (
getSpokenTextmethod). In a dictation system, the spoken form is typically used when displaying unfinalized tokens.
- The written form of the token which indicates how to visually present the token (
getWrittenTextmethod). In a dictation system, the written form of finalized tokens is typically placed into the text edit window after applying the following presentation hints.
- A capitalization hint indicating whether the written form of the following token should be capitalized (first letter only), all uppercase, all lowercase, or left as-is (
getCapitalizationHintmethod).
- An spacing hint indicating how the written form should be spaced with the previous and following tokens.
The presentation hints in a
ResultTokenare important for the processing of dictation results. Dictation results are typically displayed to the user, so using the written form and the capitalization and spacing hints for formatting is important. For example, when dictation is used in word processing, the user will want the printed text to be correctly formatted.The capitalization hint indicates how the written form of the following token should be formatted. The capitalization hint takes one of four mutually exclusive values.
CAP_FIRSTindicates that the first character of the following token should be capitalized. TheUPPERCASEandLOWERCASEvalues indicate that the following token should be either all uppercase or lowercase.CAP_AS_ISindicates that there should be no change in capitalization of the following token.The spacing hint deals with spacing around a token. It is an
intvalue containing three flags which are or'ed together (using the '|' operator). If none of the three spacing hint flags are set true, thengetSpacingHintmethod returns the valueSEPARATEwhich is the value zero.
- The
ATTACH_PREVIOUSbit is set if the token should be attached to the previous token: no space between this token and the previous token. In English, some punctuation characters have this flag set true. For example, periods, commas and colons are typically attached to the previous token.
- The
ATTACH_FOLLOWINGbit is set if the token should be attached to the following token: no space between this token and the following token. For example, in English, opening quotes, opening parentheses and dollar signs typically attach to the following token.
- The
ATTACH_GROUPbit is set if the token should be attached to previous or following tokens if they also have theATTACH_GROUPflag set to true. In other words, tokens in an attachment group should be attached together. In English, a common use of the group flag is for numbers, digits and currency amounts. For example, the sequence of four spoken-form tokens,"3" "point" "1" "4", should have the group flag set true, so the presentation form should not have separating spaces:"3.14".Every language has conventions for textual representation of a spoken language. Since recognizers are language-specific and understand many of these presentation conventions, they provide the presentation hints (written form, capitalization hint and spacing hint) to simplify applications. However, applications may choose to override the recognizer's hints or may choose to do additional processing.
Table 6-6 shows examples of tokens in which the spoken and written forms are different:
"New line", "new paragraph", "space bar", "no space" and "capitalize next" are all examples of conversion of an implicit command (e.g. "start a new paragraph"). For three of these, the written form is a single Unicode character. Most programmers are familiar with the new-line character '\n' and space ' ', but fewer are familiar with the Unicode character for new paragraph '\u2029'. For convenience and consistency, the
ResultTokenincludes static variables calledNEW_LINEandNEW_PARAGRAPH.Some applications will treat a paragraph boundary as two new-line characters, others will treat it differently. Each of these commands provides hints for capitalization. For example, in English the first letter of the first word of a new paragraph is typically capitalized.
The punctuation characters, "period", "comma", "open parentheses", "exclamation mark" and the three currency symbols convert to a single Unicode character and have special presentation hints.
An important feature of the written form for most of the examples is that the application does not need to deal with synonyms (multiple ways of saying the same thing). For example, "open parentheses" may also be spoken as "open paren" or "begin paren" but in all cases the same written form is generated.
The following is an example sequence of result tokens.
This sequence of tokens should be converted to the following string:
"\nThe INDEX is 7-2."Conversion of spoken text to a written form is a complex task and is complicated by the different conventions of different languages and often by different conventions for the same language. The spoken form, written form and presentation hints of the
ResultTokeninterface handle most simple conversions. Advanced applications should consider filtering the results to process more complex patterns, particularly cross-token patterns. For example "nineteen twenty eight" is typically converted to "1928" and "twenty eight dollars" to "$28" (note the movement of the dollar sign to before the numbers).6.7.11 Result Audio
If requested by an application, some recognizers can provide audio data for results. Audio data has a number of uses. In dictation applications, providing audio feedback to users aids correction of text because the audio reminds users of what they said (it's not always easy to remember exactly what you dictate, especially in long sessions). Audio data also allows storage for future evaluation and debugging.
Audio data is provided for finalized results through the following methods of the
FinalResultinterface.
There are two
getAudiomethods in theFinalResultinterface. One method accepts no parameters and returns anAudioClipfor an entire result ornullif audio data is not available for this result. The othergetAudiomethod takes a start and endResultTokenas input and returns anAudioClipfor the segment of the result including the start and end token ornullif audio data is not available.In both forms of the
getAudiomethod, the recognizer will attempt to return the specified audio data. However, it is not always possible to exactly determine the start and end of words or even complete results. Sometimes segments are "clipped" and sometimes surrounding audio is included in theAudioClip.Not all recognizers provide access to audio for results. For recognizers that do provide audio data, it is not necessarily provided for all results. For example, a recognizer might only provide audio data for dictation results. Thus, applications should always check for a null return value on a
getAudiocall.The storage of audio data for results potentially requires large amounts of memory, particularly for long sessions. Thus, result audio requires special management. An application that wishes to use result audio should:
- Set the
ResultAudioProvidedparameter ofRecognizerPropertiestotrue. Recognizers that do not support audio data ignore this call.
- Test the availability of audio for a result using the
isAudioAvailablemethod of theFinalResultinterface.
- Use the
getAudiomethods to obtain audio data. These methods returnnullif audio data is not available.
- Once the application has finished use of the audio for a
Result, it should call thereleaseAudiomethod ofFinalResultto free up resources.A recognizer may choose to release audio data for a result if it is necessary to reclaim memory or other system resources.
When audio is released by either a call to
releaseAudioor by the recognizer aAUDIO_RELEASEDevent is issued to theaudioReleasedmethod of theResultListener.6.7.12 Result Correction
Recognition results are not always correct. Some recognizers can be trained by informing of the correct tokens for a result - usually when a user corrects a result.
Recognizers are not required to support correction capabilities. If a recognizer does support correction, it does not need to support correction for every result. For example, some recognizers support correction only for dictation results.
Applications are not required to provide recognizers with correction information. However, if the information is available to an application and the recognizer supports correction then it is good practice to inform the recognizer of the correction so that it can improve its future recognition performance.
The
FinalResultinterface provides the methods that handle correction.
Often, but certainly not always, a correction is triggered when a user corrects a recognizer by selecting amongst the alternative guesses for a result. Other instances when an application is informed of the correct result are when the user types a correction to dictated text, or when a user corrects a misrecognized command with a follow-up command.
Once an application has obtained the correct result text, it should inform the recognizer. The correction information is provided by a call to the
tokenCorrectionmethod of theFinalResultinterface. This method indicates a correction of one token sequence to another token sequence. Either token sequence may contain one or more tokens. Furthermore, the correct token sequence may contain zero tokens to indicate deletion of tokens.The
tokenCorrectionmethod accepts acorrectionTypeparameter that indicates the reason for the correction. The legal values are defined by constants of theFinalResultinterface:
MISRECOGNITIONindicates that the new tokens are known to be the tokens actually spoken by the user: a correction of a recognition error. Applications can be confident that a selection of an alternative token sequence implies aMISRECOGNITIONcorrection.
USER_CHANGEindicates that the new tokens are not the tokens originally spoken by the user but instead the user has changed his/her mind. This is a "speako" (a spoken version of a "typo"). AUSER_CHANGEmay be indicated if a user types over the recognized result, but sometimes the user may choose to type in the correct result.
DONT_KNOWthe application does not know whether the new tokens are correcting a recognition error or indicating a change by the user. Applications should indicate this type of correction whenever unsure of the type of correction.Why is it useful to tell a recognizer about a
USER_CHANGE? Recognizers adapt to both the sounds and the patterns of words of users. AUSER_CHANGEcorrection allows the recognizer to learn about a user's word patterns. AMISRECOGNITIONcorrection allows the recognizer to learn about both the user's voice and the word patterns. In both cases, correcting the recognizer requests it to re-train itself based on the new information.Training information needs to be managed because it requires substantial memory and possibly other system resources to maintain it for a result. For example, in long dictation sessions, correction data can begin to use excessive amounts of memory.
Recognizers maintain training information only when the recognizer's
TrainingProvidedparameter is set to true through theRecognizerPropertiesinterface. Recognizers that do not support correction will ignore calls to thesetTrainingProvidedmethod.If the
TrainingProvidedparameter is set to true, a result may include training information when it is finalized. Once an application believes the training information is no longer required for a specificFinalResult, it should call thereleaseTrainingInfomethod ofFinalResultto indicate the recognizer can release the resources used to store the information.At any time, the availability of training information for a result can be tested by calling the
isTrainingInfoAvailablemethod.Recognizers can choose to release training information even without a request to do so by the application. This does not substantially affect an application because performing correction on a result which does not have training information is not an error.
A
TRAINING_INFO_RELEASEDevent is issued to theResultListenerwhen the training information is released. The event is issued identically whether the application or recognizer initiated the release.6.7.13 Rejected Results
First, a warning: ignore rejected results unless you really understand them!
Like humans, recognizers don't have perfect hearing and so they make mistakes (recognizers still tend to make more mistakes than people). An application should never completely trust a recognition result. In particular, applications should treat important results carefully, for example, "delete all files".
Recognizers try to determine whether they have made a mistake. This process is known as rejection. But recognizers also make mistakes in rejection! In short, a recognizer cannot always tell whether or not it has made a mistake.
A recognizer may reject incoming speech for a number of reasons:
- Detected speech that only partially matched an active grammar (e.g. user spoke only half a command).
- Speech matched an active grammar but the recognizer was not confident that it was an accurate match.
Rejection is controlled by the
ConfidenceLevelparameter ofRecognizerProperties(see Section 6.8). The confidence value is a floating point number between 0.0 and 1.0. A value of 0.0 indicates weak rejection - the recognizer doesn't need to be very confident to accept a result. A value of 1.0 indicates strongest rejection, implying that the recognizer will reject a result unless it is very confident that the result is correct. A value of 0.5 is the recognizer's default.6.7.13.1 Rejection Timing
A result may be rejected with a
RESULT_REJECTEDevent at any time while it isUNFINALIZED: that is, any time after aRESULT_CREATEDevent but without aRESULT_ACCEPTEDevent occurring. (For a description of result events see Section 6.7.4.)This means that the sequence of result events that produce a
REJECTEDresult:
- While in the
UNFINALIZEDstate, zero or moreRESULT_UPDATEDevents may be issued to update finalized and/or unfinalized tokens. Also, a single optionalGRAMMAR_FINALIZEDevent may be issued to indicate that the matched grammar has been identified.When a result is rejected, there is a strong probability that the information about a result normally provided through
Result,FinalResult,FinalRuleResultandFinalDictationResultinterfaces is inaccurate, or more typically, not available.Some possibilities that an application must consider:
- The
GRAMMAR_FINALIZEDevent was not issued, so thegetGrammarmethod returnsnull. In this case, all the methods of theFinalRuleResultandFinalDictationResultinterfaces throw exceptions.
- If the result does match a
RuleGrammar, there is not a guarantee that the tokens can be parsed successfully against the grammar.Finally, a repeat of the warning. Only use rejected results if you really know what you are doing!
6.7.14 Result Timing
Recognition of speech is not an instant process. There are intrinsic delays between the time the user starts or ends speaking a word or sentence and the time at which the corresponding result event is issued by the speech recognizer.
The most significant delay for most applications is the time between when the user stops speaking and the
RESULT_ACCEPTEDorRESULT_REJECTEDevent that indicates the recognizer has finalized the result.The minimum finalization time is determined by the
CompleteTimeoutparameter that is set through theRecognizerPropertiesinterface. This time-out indicates the period of silence after speech that the recognizer should process before finalizing a result. If the time-out is too long, the response of the recognizer (and the application) is unnecessarily delayed. If the time-out is too short, the recognizer may inappropriately break up a result (e.g. finalize a result while the user is taking a quick breath). Typically values are less than a second, but not usually less than 0.3sec.There is also an
IncompleteTimeoutparameter that indicates the period of silence a recognizer should process if the user has said something that may only partially matches an active grammar. This time-out indicates how long a recognizer should wait before rejecting an incomplete sentence. This time-out also indicates how long a recognizer should wait mid-sentence if a result could be accepted, but could also be continued and accepted after more words. TheIncompleteTimeoutis usually longer than the complete time-out.Latency is the overall delay between a user finishing speaking and a result being produced. There are many factors that can affect latency. Some effects are temporary, others reflect the underlying design of speech recognizers. Factors that can increase latency include:
- Computer power (especially CPU speed and memory): less powerful computers may process speech slower than real-time. Most systems try to catch up while listening to background silence (which is easier to process than real speech).
- Grammar complexity: larger and more complex grammars tend to require more time to process. In most cases, rule grammars are processed more quickly than dictation grammars.
- Suspending: while a recognizer is in the
SUSPENDEDstate, it must buffer of incoming audio. When it returns to theLISTENINGstate it must catch up by processing the buffered audio. The longer the recognizer is suspended, the longer it can take to catch up to real time and the more latency increases.
- Client/server latencies: in client/server architectures, communication of the audio data, results, and other information between the client and server can introduce delays.
6.7.15 Storing Results
Result objects can be stored for future processing. This is particularly useful for dictation applications in which the correction information, audio data and alternative token information is required in future sessions on the same document because that stored information can assist document editing.
The
Resultobject is recognizer-specific. This is because each recognizer provides an implementation of theResultinterface. The implications are that (a) recognizers do not usually understand each other's results, and (b) a special mechanism is required to store and load result objects (standard Java object serialization is not sufficient).The
Recognizerinterface defines the methodswriteVendorResultandreadVendorResultto perform this function. These methods write to anOutputStreamand read from anInputStreamrespectively. If the correction information and audio data for a result are available, then they will be stored by this call. Applications that do not need to store this extra data should explicitly release it before storing a result.
{ Recognizer rec; OutputStream stream; Result result; ... try { rec.writeVendorResult(stream, result); } catch (Exception e) { e.printStackTrace(); } }A limitation of storing vendor-specific results is that a compatible recognizer must be available to read the file. Applications that need to ensure a file containing a result can be read, even if no recognizer is available, should wrap the result data when storing it to the file. When re-loading the file at a later time, the application will unwrap the result data and provide it to a recognizer only if a suitable recognizer is available. One way to perform the wrapping is to provide the
writeVendorResultmethod with aByteArrayOutputStreamto temporarily place the result in a byte array before storing to a file.
6.8 Recognizer Properties
A speech engine has both persistent and run-time adjustable properties. The persistent properties are defined in the
RecognizerModeDescwhich includes properties inherited fromEngineModeDesc(see Section 4.2). The persistent properties are used in the selection and creation of a speech recognizer. Once a recognizer has been created, the same property information is available through thegetEngineModeDescmethod of aRecognizer(inherited from theEngineinterface).A recognizer also has seven run-time adjustable properties. Applications get and set these properties through
RecognizerPropertieswhich extends theEnginePropertiesinterface. TheRecognizerPropertiesfor a recognizer are provided by thegetEnginePropertiesmethod that theRecognizerinherits from theEngineinterface. For convenience agetRecognizerPropertiesmethod is also provided in theRecognizerinterface to return a correctly cast object.The get and set methods of
EnginePropertiesandRecognizerPropertiesfollow the JavaBeans conventions with the form:
Type getPropertyName();
void setPropertyName(Type);A recognizer can choose to ignore unreasonable values provided to a set method, or can provide upper and lower bounds.
6.9 Speaker Management
A
Recognizermay, optionally, provide aSpeakerManagerobject. TheSpeakerManagerallows an application to manage theSpeakerProfilesof thatRecognizer. TheSpeakerManagerfor is obtained throughgetSpeakerManagermethod of theRecognizerinterface. Recognizers that do not maintain speaker profiles - known as speaker-independent recognizers - returnnullfor this method.A
SpeakerProfileobject represents a single enrollment to a recognizer. One user may have multipleSpeakerProfilesin a single recognizer, and one recognizer may store the profiles of multiple users.The
SpeakerProfileclass is a reference to data stored with the recognizer. A profile is identified by three values all of which areStringobjects:
id: A unique identifier for a profile (per-recognizer unique). The string may be automatically generated but should be printable.
name: An identifier for a user. This may be an account name or any other name that could be entered by a user.
variant: The variant identifies a particular enrollment of a user and becomes useful when one user has more than one enrollment.The
SpeakerProfileobject is a handle to all the stored data the recognizer has about a speaker in a particular enrollment. Except for the three values defined above, the speaker data stored with a profile is internal to the recognizer.Typical data stored by a recognizer with the profile might include:
- Speaker preferences: Settings such as those provided through the
RecognizerProperties(see Section 6.8).The primary role of stored profiles is in maintaining information that enables a recognition to adapt to characteristics of the speaker. The goal of this adaptation is to improve the performance of the speech recognizer including both recognition accuracy and speed.
The
SpeakerManagerprovides management of all the profiles stored in the recognizer. Most often, the functionality of theSpeakerManageris used as a direct consequence of user actions, typically by providing an enrollment window to the user. The functionality provided includes:
- Current speaker: The
getCurrentSpeakerandsetCurrentSpeakermethods determine which speaker profile is currently being used to recognize incoming speech.
- Listing profiles: The
listKnownSpeakersmethod returns an array of all theSpeakerProfilesknown to the recognizer. A common procedure is to display that list to a user to allow the user to select a profile.
- Creation and deletion: The
newSpeakerProfileandnewSpeakerProfilemethods create a new profile or delete a profile in the recognizer.
- Read and write: The
readVendorSpeakerProfileandwriteVendorSpeakerProfilemethods allow a speaker profile and all the recognizer's associated data to be read from or stored to aStream. The data format will typically be proprietary.
- Save and revert: During normal operation, a recognizer will maintain and update the speaker profile as new information becomes available. Some of the events that may modify the profile include changing the
RecognizerProperties, making a correction to a result, producing any result that allows the recognizer to adapt its models, and more many activities. It is normal to save the updated profile at the end of any session by callingsaveCurrentSpeakerProfile. In some cases, however, a user's data may be corrupted (e.g., because they loaned their computer to another user). In this case, the application may be requested by a user to revert the profile to the last stored version by callingrevertCurrentSpeaker.
- Display component: The
getControlComponentmethod optionally returns an AWTComponentobject that can be displayed to a user. If supported, this component should expose the vendor's speaker management capabilities which may be more detailed than those provided by theSpeakerManagerinterface. The vendor functionality may also be proprietary.An individual speaker profile may be large (perhaps several MByte) so storing, loading, creating and otherwise manipulating these objects can be slow.
The
SpeakerManageris one of the capabilities of aRecognizerthat is available in the deallocated state. The purpose is to allow an application to indicate the initial speaker profile to be loaded when the recognizer is allocated. To achieve this, thelistKnownSpeakers,getCurrentSpeakerandsetCurrentSpeakermethods can be called before calling theallocatemethod.To facilitate recognizer selection, the list of speaker profiles is also a property of a recognizer presented through the
RecognizerModeDescclass. This allows an application to select a recognizer that has already been trained by a user, if one is available.In most cases, a
Recognizerpersistently restores the last used speaker profile when allocating a recognizer, unless asked to do otherwise.
6.10 Recognizer Audio
The current audio functionality of the Java Speech API is incompletely specified. Once a standard mechanism is established for streaming input and output audio on the Java platform the API will be revised to incorporate that functionality.
In this release of the API, the only established audio functionality is provided through the
RecognizerAudioListenerinterface and theRecognizerAudioEventclass. Audio events issued by a recognizer are intended to support simple feedback mechanisms for a user. The three types ofRecognizerAudioEventare as follows:
SPEECH_STARTEDandSPEECH_STOPPED: These events are issued when possible speech input is detected in the audio input stream. These events are usually based on a crude mechanism for speech detection so aSPEECH_STARTEDevent is not always followed by output of a result. Furthermore, oneSPEECH_STARTEDmay be followed by multiple results, and one result might cover multipleSPEECH_STARTEDevents.
AUDIO_LEVEL: This event is issued periodically to indicate the volume of audio input to the recognizer. The level is afloatand varies on a scale from 0.0 to 1.0: silence to maximum volume. The audio level is often displayed visually as a "VU Meter" - the scale on a stereo system that goes up and down with the volume.All the
RecognizerAudioEventsare produced as audio reaches the input to the recognizer. Because recognizers use internal buffers between audio input and the recognition process, the audio events can run ahead of the recognition process.
Contents
|
Previous
|
Next
|
JavaTM
Speech API Programmer's Guide
Copyright © 1997-1998
Sun Microsystems, Inc.
All rights reserved
Send comments or corrections to javaspeech-comments@sun.com