Contents
|
Previous
|
Next
|
Chapter 6
Speech Recognition: javax.speech.recognition
A speech recognizer is a speech engine that converts speech to text. The
javax.speech.recognitionpackage defines theRecognizerinterface to support speech recognition plus a set of supporting classes and interfaces. The basic functional capabilities of speech recognizers, some of the uses of speech recognition and some of the limitations of speech recognizers are described in Section 2.2.As a type of speech engine, much of the functionality of a
Recognizeris inherited from theEngineinterface in thejavax.speechpackage and from other classes and interfaces in that package. Thejavax.speechpackage and generic speech engine functionality are described in Chapter 4.The Java Speech API is designed to keep simple speech applications simple Þ and to make advanced speech applications possible for non-specialist developers. This chapter covers both the simple and advanced capabilities of the
javax.speech.recognitionpackage. Where appropriate, some of the more advanced sections are marked so that you can choose to skip them. We begin with a simple code example, and then review the speech recognition capabilities of the API in more detail through the following sections:
- "Hello World!": a simple example of speech recognition
6.1 "Hello World!"
The following example shows a simple application that uses speech recognition. For this application we need to define a grammar of everything the user can say, and we need to write the Java software that performs the recognition task.
A grammar is provided by an application to a speech recognizer to define the words that a user can say, and the patterns in which those words can be spoken. In this example, we define a grammar that allows a user to say "Hello World" or a variant. The grammar is defined using the Java Speech Grammar Format. This format is documented in the Java Speech Grammar Format Specification.
Place this grammar into a file.
grammar javax.speech.demo; public <sentence> = hello world | good morning | hello mighty computer;This trivial grammar has a single public rule called "
sentence". A rule defines what may be spoken by a user. A public rule is one that may be activated for recognition.The following code shows how to create a recognizer, load the grammar, and then wait for the user to say something that matches the grammar. When it gets a match, it deallocates the engine and exits.
import javax.speech.*; import javax.speech.recognition.*; import java.io.FileReader; import java.util.Locale; public class HelloWorld extends ResultAdapter { static Recognizer rec; // Receives RESULT_ACCEPTED event: print it, clean up, exit public void resultAccepted(ResultEvent e) { Result r = (Result)(e.getSource()); ResultToken tokens[] = r.getBestTokens(); for (int i = 0; i < tokens.length; i++) System.out.print(tokens[i].getSpokenText() + " "); System.out.println(); // Deallocate the recognizer and exit rec.deallocate(); System.exit(0); } public static void main(String args[]) { try { // Create a recognizer that supports English. rec = Central.createRecognizer( new EngineModeDesc(Locale.ENGLISH)); // Start up the recognizer rec.allocate(); // Load the grammar from a file, and enable it FileReader reader = new FileReader(args[0]); RuleGrammar gram = rec.loadJSGF(reader); gram.setEnabled(true); // Add the listener to get results rec.addResultListener(new HelloWorld()); // Commit the grammar rec.commitChanges(); // Request focus and start listening rec.requestFocus(); rec.resume(); } catch (Exception e) { e.printStackTrace(); } } }This example illustrates the basic steps which all speech recognition applications must perform. Let's examine each step in detail.
- Create: The
Centralclass ofjavax.speechpackage is used to obtain a speech recognizer by calling thecreateRecognizermethod. TheEngineModeDescargument provides the information needed to locate an appropriate recognizer. In this example we requested a recognizer that understands English (since the grammar is written for English).
- Load and enable grammars: The
loadJSGFmethod reads in a JSGF document from a reader created for the file that contains the javax.speech.demo grammar. (Alternatively, theloadJSGFmethod can load a grammar from a URL.) Next, the grammar is enabled. Once the recognizer receives focus (see below), an enabled grammar is activated for recognition: that is, the recognizer compares incoming audio to the active grammars and listens for speech that matches those grammars.
- Attach a ResultListener: The
HelloWorldclass extends theResultAdapterclass which is a trivial implementation of theResultListenerinterface. An instance of theHelloWorldclass is attached to the Recognizer to receive result events. These events indicate progress as the recognition of speech takes place. In this implementation, we process theRESULT_ACCEPTEDevent, which is provided when the recognizer completes recognition of input speech that matches an active grammar.
- Commit changes: Any changes in grammars and the grammar enabled status needed to be committed to take effect (that includes creation of a new grammar). The reasons for this are described in Section 6.4.2.
- Request focus and resume: For recognition of the grammar to occur, the recognizer must be in the
RESUMEDstate and must have the speech focus. TherequestFocusandresumemethods achieve this.
- Process result: Once the
mainmethod is completed, the application waits until the user speaks. When the user speaks something that matches the loaded grammar, the recognizer issues aRESULT_ACCEPTEDevent to the listener we attached to the recognizer. The source of this event is aResultobject that contains information about what the recognizer heard. ThegetBestTokensmethod returns an array ofResultTokens, each of which represents a single spoken word. These words are printed.
6.2 Recognizer as an Engine
The basic functionality provided by a
Recognizerincludes grammar management and the production of results when a user says things that match active grammars. TheRecognizerinterface extends theEngineinterface to provide this functionality.The following is a list of the functionality that the
javax.speech.recognitionpackage inherits from thejavax.speechpackage and outlines some of the ways in which that functionality is specialized.
- The properties of a speech engine defined by the
EngineModeDescclass apply to recognizers. TheRecognizerModeDescclass adds information about dictation capabilities of a recognizer and about users who have trained the engine. BothEngineModeDescandRecognizerModeDescare described in Section 4.2.
- Recognizers are searched, selected and created through the
Centralclass in thejavax.speechpackage as described in Section 4.3. That section explains default creation of a recognizer, recognizer selection according to defined properties, and advanced selection and creation mechanisms.
- Recognizers inherit the basic state systems of an engine from the
Engineinterface, including the four allocation states, the pause and resume state, the state monitoring methods and the state update events. The engine state systems are described in Section 4.4. The two state systems added by recognizers are described in Section 6.3.
- Recognizers produce all the standard engine events (see Section 4.5). The
javax.speech.recognitionpackage also extends theEngineListenerinterface asRecognizerListenerto provide events that are specific to recognizers.
- Other engine functionality inherited as an engine includes the runtime properties (see Section 4.6.1 and Section 6.8), audio management (see Section 4.6.2) and vocabulary management (see Section 4.6.3).
6.3 Recognizer State Systems
6.3.1 Inherited States
As mentioned above, a
Recognizerinherits the basic state systems defined in thejavax.speechpackage, particularly through theEngineinterface. The basic engine state systems are described in Section 4.4. In this section the two state systems added for recognizers are described. These two states systems represent the status of recognition processing of audio input against grammars, and the recognizer focus.As a summary, the following state system functionality is inherited from the
javax.speechpackage.
- The basic engine state system represents the current allocation state of the engine: whether resources have been obtained for the engine. The four allocation states are
ALLOCATED,DEALLOCATED,ALLOCATING_RESOURCESandDEALLOCATING_RESOURCES.
- The
PAUSEDandRESUMEDstates are sub-states of theALLOCATEDstate. The paused and resumed states of a recognizer indicate whether audio input is on or off. Pausing a recognizer is analogous to turning off the input microphone: input audio is lost. Section 4.4.7 describes the effect of pausing and resuming a recognizer in more detail.
- The
getEngineStatemethod of theEngineinterface returns alongvalue representing the current engine state. The value has a bit set for each of the current states of the recognizer. For example andALLOCATEDrecognizer in theRESUMEDstate will have both theALLOCATEDandRESUMEDbits set.
- The
testEngineStateandwaitEngineStatemethods are convenience methods for monitoring engine state. The test method tests for presence in a specified state. The wait method blocks until a specific state is reached.
- An
EngineEventis issued toEngineListenerseach time an engine changes state. The event class includes the new and old engine states.The recognizer adds two sub-state systems to the
ALLOCATEDstate: that's in addition to the inherited pause and resume sub-state system. The two new sub- state systems represent the current activities of the recognizer's internal processing (theLISTENING,PROCESSINGandSUSPENDEDstates) and the current recognizer focus (theFOCUS_ONandFOCUS_OFFstates).These new sub-state systems are parallel states to the
PAUSEDandRESUMEDstates and operate nearly independently as shown in Figure 6-1 (an extension of Figure 4-2).![]()
6.3.2 Recognizer Focus
The
FOCUS_ONandFOCUS_OFFstates indicate whether this instance of theRecognizercurrently has the speech focus. Recognizer focus is a major determining factor in grammar activation, which, in turn, determines what the recognizer is listening for at any time. The role of recognizer focus in activation and deactivation of grammars is described in Section 6.4.3.A change in engine focus is indicated by a
RecognizerEvent(which extendsEngineEvent) being issued toRecognizerListeners. AFOCUS_LOSTevent indicates a change in state fromFOCUS_ONtoFOCUS_OFF. AFOCUS_GAINEDevent indicates a change in state fromFOCUS_OFFtoFOCUS_ON.When a
Recognizerhas focus, theFOCUS_ONbit is set in the engine state. When aRecognizerdoes not have focus, theFOCUS_OFFbit is set. The following code examples monitor engine state:
Recognizer rec; if (rec.testEngineState(Recognizer.FOCUS_ON)) { // we have focus so release it rec.releaseFocus(); } // wait until we lose it rec.waitEngineState(Recognizer.FOCUS_OFF);Recognizer focus is relevant to computing environments in which more than one application is using an underlying recognition. For example, in a desktop environment a user might be running a single speech recognition product (the underlying engine), but have multiple applications using the speech recognizer as a resource. These applications may be a mixture of Java and non-Java applications. Focus is not usually relevant in a telephony environment or in other speech application contexts in which there is only a single application processing the audio input stream.
The recognizer's focus should track the application to which the user is currently talking. When a user indicates that it wants to talk to an application (e.g., by selecting the application window, or explicitly saying "switch to application X"), the application requests speech focus by calling the
requestFocusmethod of theRecognizer.When speech focus is no longer required (e.g., the application has been iconized) it should call
releaseFocusmethod to free up focus for other applications.Both methods are asynchronous Þ- the methods may return before the focus is gained or lost - since focus change may be deferred. For example, if a recognizer is in the middle of recognizing some speech, it will typically defer the focus change until the result is completed. The focus events and the engine state monitoring methods can be used to determine when focus is actually gained or lost.
The focus policy is determined by the underlying recognition engine - it is not prescribed by the
java.speech.recognitionpackage. In most operating environments it is reasonable to assume a policy in which the last application to request focus gets the focus.Well-behaved applications adhere to the following convention to maximize recognition performance, to minimize their impact upon other applications and to maintain a satisfactory user interface experience. An application should only request focus when it is confident that the user's speech focus (attention) is directed towards it, and it should release focus when it is not required.
6.3.3 Recognition States
The most important (and most complex) state system of a recognizer represents the current recognition activity of the recognizer. An
ALLOCATEDRecognizeris always in one of the following three states:
LISTENINGstate: TheRecognizeris listening to incoming audio for speech that may match an active grammar but has not detected speech yet. A recognizer remains in this state while listening to silence and when audio input runs out because the engine is paused.
PROCESSINGstate: TheRecognizeris processing incoming speech that may match an active grammar. While in this state, the recognizer is producing a result.
SUSPENDEDstate: TheRecognizeris temporarily suspended while grammars are updated. While suspended, audio input is buffered for processing once the recognizer returns to theLISTENINGandPROCESSINGstates.This sub-state system is shown in Figure 6-1. The typical state cycle of a recognizer is triggered by user speech. The recognizer starts in the
LISTENINGstate, moves to thePROCESSINGstate while a user speaks, moves to theSUSPENDEDstate once recognition of that speech is completed and while grammars are updates in response to user input, and finally returns to theLISTENINGstate.In this first event cycle a
Resultis typically produced that represents what the recognizer heard. EachResulthas a state system and theResultstate system is closely coupled to thisRecognizerstate system. TheResultstate system is discussed in Section 6.7. Many applications (including the "Hello World!" example) do not care about the recognition state but do care about the simplerResultstate system.The other typical event cycle also starts in the
LISTENINGstate. Upon receipt of a non-speech event (e.g., keyboard event, mouse click, timer event) the recognizer is suspended temporarily while grammars are updated in response to the event, and then the recognizer returns to listening.Applications in which grammars are affected by more than speech events need to be aware of the recognition state system.
The following sections explain these event cycles in more detail and discuss why speech input events are different in some respects from other event types.
6.3.3.1 Speech Events vs. Other Events
A keyboard event, a mouse event, a timer event, a socket event are all instantaneous in time - there is a defined instant at which they occur. The same is not true of speech for two reasons.
Firstly, speech is a temporal activity. Speaking a sentence takes time. For example, a short command such as "reload this web page" will take a second or two to speak, thus, it is not instantaneous. At the start of the speech the recognizer changes state, and as soon as possible after the end of the speech the recognizer produces a result containing the spoken words.
Secondly, recognizers cannot always recognize words immediately when they are spoken and cannot determine immediately when a user has stopped speaking. The reasons for these technical constraints upon recognition are outside the scope of this guide, but knowing about them is helpful in using a recognizer. (Incidentally, the same principals are generally true of human perception of speech.)
A simple example of why recognizers cannot always respond might be listening to a currency amount. If the user says "two dollars" or says "two dollars, fifty seconds" with a short pause after the word "dollars" the recognizer can't know immediately whether the user has finished speaking after the "dollars". What a recognizer must do is wait a short period - usually less than a second Þ- to see if the user continues speaking. A second is a long time for a computer and complications can arise if the user clicks a mouse or does something else in that waiting period. (Section 6.8 explains the time-out parameters that affect this delay.)
A further complication is introduced by the input audio buffering described in Section 6.3.
Putting all this together, there is a requirement for the recognizers to explicitly represent internal state through the
LISTENING,PROCESSINGandSUSPENDEDstates.6.3.3.2 Speech Input Event Cycle
The typical recognition state cycle for a
Recognizeroccurs as speech input occurs. Technically speaking, this cycle represents the recognition of a singleResult. The result state system and result events are described in detail in Section 6.7. The cycle described here is a clockwise trip through theLISTENING,PROCESSINGandSUSPENDEDstates of anALLOCATEDrecognizer as shown in Figure 6-1.The
Recognizerstarts in theLISTENINGstate with a certain set of grammars enabled and active. When incoming audio is detected that may match an active grammar, theRecognizertransitions from theLISTENINGstate to thePROCESSINGstate with aRECOGNIZER_PROCESSINGevent.The
Recognizerthen creates a newResultobject and issues aRESULT_CREATEDevent (aResultEvent) to provide the result to the application. At this point the result is usually empty: it does not contain any recognized words. As recognition proceeds words are added to the result along with other useful information.The
Recognizerremains in thePROCESSINGstate until it completes recognition of the result. While in thePROCESSINGstate theResultmay be updated with new information.The recognizer indicates completion of recognition by issuing a
RECOGNIZER_SUSPENDEDevent to transition from thePROCESSINGstate to theSUSPENDEDstate. Once in that state, the recognizer issues a result finalization event toResultListeners(RESULT_ACCEPTEDorRESULT_REJECTEDevent) to indicate that all information about the result is finalized (words, grammars, audio etc.).The
Recognizerremains in theSUSPENDEDstate until processing of the result finalization event is completed. Applications will often make grammar changes during the result finalization because the result causes a change in application state or context.In the
SUSPENDEDstate theRecognizerbuffers incoming audio. This buffering allows a user to continue speaking without speech data being lost. Once theRecognizerreturns to theLISTENINGstate the buffered audio is processed to give the user the perception of real-time processing.Once the result finalization event has been issued to all listeners, the
Recognizerautomatically commits all grammar changes and issues aCHANGES_COMMITTEDevent to return to theLISTENINGstate. (It also issuesGRAMMAR_CHANGES_COMMITTEDevents toGrammarListenersof changed grammars.) The commit applies all grammar changes made at any point up to the end of result finalization, such as changes made in the result finalization events.The
Recognizeris now back in theLISTENINGstate listening for speech that matches the new grammars.In this event cycle the first two recognizer state transitions (marked by
RECOGNIZER_PROCESSINGandRECOGNIZER_SUSPENDEDevents) are triggered by user actions: starting and stopping speaking. The third state transition (CHANGES_COMMITTEDevent) is triggered programmatically some time after theRECOGNIZER_SUSPENDEDevent.The
SUSPENDEDstate serves as a temporary state in which recognizer configuration can be updated without loosing audio data.6.3.3.3 Non-Speech Event Cycle
For applications that deal only with spoken input the state cycle described above handles most normal speech interactions. For applications that handle other asynchronous input, additional state transitions are possible. Other types of asynchronous input include graphical user interface events (e.g.,
AWTEvent), timer events, multi-threading events, socket events and so on.The cycle described here is temporary transition from the
LISTENINGstate to theSUSPENDEDand back as shown in Figure 6-1.When a non-speech event occurs which changes the application state or application data it may be necessary to update the recognizer's grammars. The
suspendandcommitChangesmethods of aRecognizerare used to handle non- speech asynchronous events. The typical cycle for updating grammars in response to a non-speech asynchronous events is as follows.Assume that the
Recognizeris in theLISTENINGstate (the user is not currently speaking). As soon as the event is received, the application callssuspendto indicate that it is about to change grammars. In response, the recognizer issues aRECOGNIZER_SUSPENDEDevent and transitions from theLISTENINGstate to theSUSPENDEDstate.With the
Recognizerin theSUSPENDEDstate, the application makes all necessary changes to the grammars. (The grammar changes affected by this event cycle and the pending commit are described in Section 6.4.2.)Once all grammar changes are completed the application calls the
commitChangesmethod. In response, the recognizer applies the new grammars and issues aCHANGES_COMMITTEDevent to transition from theSUSPENDEDstate back to theLISTENINGstate. (It also issuesGRAMMAR_CHANGES_COMMITTEDevents to all changed grammars.)Finally, the
Recognizerresumes recognition of the buffered audio and then live audio with the new grammars.The suspend and commit process is designed to provide a number of features to application developers which help give users the perception of a responsive recognition system.
Because audio is buffered from the time of the asynchronous event to the time at which the
CHANGES_COMMITTEDoccurs, the audio is processed as if the new grammars were applied exactly at the time of the asynchronous event. The user has the perception of real-time processing.Although audio is buffered in the
SUSPENDEDstate, applications should make grammar changes and callcommitChangesas quickly as possible. This minimizes the amount of data in the audio buffer and hence the amount of time it takes for the recognizer to "catch up". It also minimizes the possibility of a buffer overrun.Technically speaking, an application is not required to call
suspendprior to callingcommitChanges. If thesuspendcall is committed theRecognizerbehaves as if suspend had been called immediately prior to callingcommitChanges. However, an application that does not callsuspendrisks a commit occurring unexpectedly while it updates grammars with the effect of leaving grammars in an inconsistent state.6.3.4 Interactions of State Systems
The three sub-state systems of an allocated recognizer (shown in Figure 6-1) normally operate independently. There are, however, some indirect interactions.
When a recognizer is paused, audio input is stopped. However, recognizers have a buffer between audio input and the internal process that matches audio against grammars, so recognition can continue temporarily after a recognizer is paused. In other words, a
PAUSEDrecognizer may be in thePROCESSINGstate.Eventually the audio buffer will empty. If the recognizer is in the
PROCESSINGstate at that time then the result it is working on is immediately finalized and the recognizer transitions to theSUSPENDEDstate. Since a well-behaved application treatsSUSPENDEDstate as a temporary state, the recognizer will eventually leave theSUSPENDEDstate by committing grammar changes and will return to theLISTENINGstate.The
PAUSED/RESUMEDstate of an engine is shared by multiple applications, so it is possible for a recognizer to be paused and resumed because of the actions of another application. Thus, an application should always leave its grammars in a state that would be appropriate for aRESUMEDrecognizer.The focus state of a recognizer is independent of the
PAUSEDandRESUMEDstates. For instance, it is possible for a pausedRecognizerto haveFOCUS_ON. When the recognizer is resumed, it will have the focus and its grammars will be activated for recognition.The focus state of a recognizer is very loosely coupled with the recognition state. An application that has no
GLOBALgrammars (described in Section 6.4.3) will not receive any recognition results unless it has recognition focus.
6.4 Recognition Grammars
A grammar defines what a recognizer should listen for in incoming speech. Any grammar defines the set of tokens a user can say (a token is typically a single word) and the patterns in which those words are spoken.
The Java Speech API supports two types of grammars: rule grammars and dictation grammars. These grammars differ in how patterns of words are defined. They also differ in their programmatic use: a rule grammar is defined by an application, whereas a dictation grammar is defined by a recognizer and is built into the recognizer.
A rule grammar is provided by an application to a recognizer to define a set of rules that indicates what a user may say. Rules are defined by tokens, by references to other rules and by logical combinations of tokens and rule references. Rule grammars can be defined to capture a wide range of spoken input from users by the progressive combination of simple grammars and rules.
A dictation grammar is built into a recognizer. It defines a set of words (possibly tens of thousands of words) which may be spoken in a relatively unrestricted way. Dictation grammars are closest to the goal of unrestricted natural speech input to computers. Although dictation grammars are more flexible than rule grammars, recognition of rule grammars is typically faster and more accurate.
Support for a dictation Þgrammar is optional for a recognizer. As Section 4.2 explains, an application that requires dictation functionality can request it when creating a recognizer.
A recognizer may have many rule grammars loaded at any time. However, the current
Recognizerinterface restricts a recognizer to a single dictation grammar. The technical reasons for this restriction are outside the scope of this guide.6.4.1 Grammar Interface
The
Grammarinterface is the root interface that is extended by all grammars. The grammar functionality that is shared by all grammars is presented through this interface.The
RuleGrammarinterface is an extension of theGrammarinterface to support rule grammars. TheDictationGrammarinterface is an extension of theGrammarinterface to support dictation grammars.The following are the capabilities presented by the grammar interface:
- Grammar naming: Every grammar loaded into a recognizer must have a unique name. The
getNamemethod returns that name. Grammar names allow references to be made between grammars. The grammar naming convention is described in the Java Speech Grammar Format Specification Briefly, the grammar naming convention is very similar to the class naming convention for the Java programming language. For example, a grammar from Acme Corp. for dates might be called "com.acme.speech.dates".
- Enabling and disabling: Grammars may be enabled or disabled using the
setEnabledmethod. When a grammar is enabled and when specified activation conditions are met, the grammar is activated. Once a grammar is active a recognizer will listen to incoming audio for speech that matches that grammar. Enabling and activation are described in more detail below (Section 6.4.3).
- Activation mode: This is the property of a grammar that determines which conditions need to be met for a grammar to be activated. The activation mode is managed through the
getActivationModeandsetActivationModemethods (described in Section 6.4.3). The three available activation modes are defined as constants of theGrammarinterface:RECOGNIZER_FOCUS,RECOGNIZER_MODALandGLOBAL.
- Activation: the
isActivemethod returns abooleanvalue that indicates whether aGrammaris currently active for recognition.
- GrammarListener: the
addGrammarListenerandremoveGrammarListenermethods allow aGrammarListenerto be attached to and removed from aGrammar. TheGrammarEventsissued to the listener indicate when grammar changes have been committed and whenever the grammar activation state changes.
- ResultListener: the
addResultListenerandremoveResultListenermethods allow aResultListenerto be attached to and removed from aGrammar. This listener receives notification of all events for any result that matches the grammar.6.4.2 Committing Changes
The Java Speech API supports dynamic grammars; that is, it supports the ability for an application to modify grammars at runtime. In the case of rule grammars any aspect of any grammar can be changed at any time.
After making any change to a grammar through the
Grammar,RuleGrammarorDictationGrammarinterfaces an application must commit the changes. This applies to changes in definitions of rules in aRuleGrammar, to changing context for aDictationGrammar, to changing the enabled state, or to changing the activation mode. (It does not apply to adding or removing aGrammarListenerorResultListener.)Changes are committed by calling the
commitChangesmethod of theRecognizer. The commit is required for changes to affect the recognition process: that is, the processing of incoming audio.The commit changes mechanism has two important properties:
- Updates to grammar definitions and the enabled property take effect atomically (all changes take effect at once). There are no intermediate states in which some, but not all, changes have been applied.
- The
commitChangesmethod is a method ofRecognizerso all changes to all grammars are committed at once. Again, there are no intermediate states in which some, but not all, changes have been applied.There is one instance in which changes are committed without an explicit call to the
commitChangesmethod. Whenever a recognition result is finalized (completed), an event is issued toResultListeners(it is either aRESULT_ACCEPTEDorRESULT_REJECTEDevent). Once processing of that event is completed changes are normally committed. This supports the common situation in which changes are often made to grammars in response to something a user says.The event-driven commit is closely linked to the underlying state system of a
Recognizer. The state system for recognizers is described in detail in Section 6.3.6.4.3 Grammar Activation
A grammar is active when the recognizer is matching incoming audio against that grammar to determine whether the user is saying anything that matches that grammar. When a grammar is inactive it is not being used in the recognition process.
Applications to do not directly activate and deactivate grammars. Instead they provided methods for (1) enabling and disabling a grammar, (2) setting the activation mode for each grammar, and (3) requesting and releasing the speech focus of a recognizer (as described in Section 6.3.2.)
The enabled state of a grammar is set with the
setEnabledmethod and tested with theisEnabledmethod. For programmers familiar with AWT or Swing, enabling a speech grammar is similar to enabling a graphical component.Once enabled, certain conditions must be met for a grammar to be activated. The activation mode indicates when an application wants the grammar to be active. There are three activation modes:
RECOGNIZER_FOCUS,RECOGNIZER_MODALandGLOBAL. For each mode a certain set of activation conditions must be met for the grammar to be activated for recognition. The activation mode is managed with thesetActivationModeandgetActivationModemethods.The enabled flag and the activation mode are both parameters of a grammar that need to be committed to take effect. As Section 6.4.2 described, changes need to be committed to affect the recognition processes.
Recognizer focus is a major determining factor in grammar activation and is relevant in computing environments in which more than one application is using an underlying recognition (e.g., desktop computing with multiple speech-enabled applications). Section 6.3.2 describes how applications can request and release focus and monitor focus through
RecognizerEventsand the engine state methods.Recognizer focus is used to turn on and off activation of grammars. The roll of focus depends upon the activation mode. The three activation modes are described here in order from highest priority to lowest. An application should always use the lowest priority mode that is appropriate to its user interface functionality.
GLOBALactivation mode: if enabled, theGrammaris always active irrespective of whether theRecognizerof this application has focus.
RECOGNIZER_MODALactivation mode: if enabled, theGrammaris always active when the application'sRecognizerhas focus. Furthermore, enabling a modal grammar deactivates any grammars in the sameRecognizerwith theRECOGNIZER_FOCUSactivation mode. (The term "modal" is analogous to "modal dialog boxes" in graphical programming.)
RECOGNIZER_FOCUSactivation mode (default mode): if enabled, theGrammaris active when theRecognizerof this application has focus. The exception is that if any other grammar of this application is enabled withRECOGNIZER_MODALactivation mode, then this grammar is not activated.The current activation state of a grammar can be tested with the
isActivemethod. Whenever a grammar's activation changes either aGRAMMAR_ACTIVATEDorGRAMMAR_DEACTIVATEDevent is issued to each attachedGrammarListener. A grammar activation event typically follows aRecognizerEventthat indicates a change in focus (FOCUS_GAINEDorFOCUS_LOST), or aCHANGES_COMMMITTEDRecognizerEventthat indicates that a change in the enabled setting of a grammar has been applied to the recognition process.An application may have zero, one or many grammars enabled at any time. Thus, an application may have zero, one or many grammars active at any time. As the conventions below indicate, well-behaved applications always minimize the number of active grammars.
The activation and deactivation of grammars is independent of
PAUSEDandRESUMEDstates of theRecognizer. For instance, a grammar can be active even when a recognizer isPAUSED. However, when aRecognizeris paused, audio input to theRecognizeris turned off, so speech won't be detected. This is useful, however, because when the recognizer is resumed, recognition against the active grammars immediately (and automatically) resumes.Activating too many grammars and, in particular, activating multiple complex grammars has an adverse impact upon a recognizer's performance. In general terms, increasing the number of active grammars and increasing the complexity of those grammars can both lead to slower recognition response time, greater CPU load and reduced recognition accuracy (i.e., more mistakes).
Well-behaved applications adhere to the following conventions to maximize recognition performance and minimize their impact upon other applications:
- Never apply the
GLOBALactivation mode to aDictationGrammar(most recognizers will throw an exception if this is attempted).
- Always use the default activation mode
RECOGNIZER_FOCUSunless there is a good reason to use another mode.
- Only use the
RECOGNIZER_MODALwhen it is certain that deactivating theRECOGNIZER_FOCUSgrammars will not adversely affect the user interface.
- Minimize the complexity and the number of
RuleGrammarswithGLOBALactivation mode. As a general rule, one very simpleGLOBALrule grammar should be sufficient for nearly all applications.
- Only enable a grammar when it is appropriate for a user to say something matching that grammar. Otherwise disable the grammar to improve recognition response time and recognition accuracy for other grammars.
- Only request focus when confident that the user's speech focus (attention) is directed to grammars of your application. Release focus when it is not required.
6.5 Rule Grammars
6.5.1 Rule Definitions
A rule grammar is defined by a set of rules. These rules are defined by logical combinations of tokens to be spoken and references to other rules. The references may refer to other rules defined in the same rule grammar or to rules imported from other grammars.
Rule grammars follow the style and conventions of grammars in the Java Speech Grammar Format (defined in the Java Speech Grammar Format Specification). Any grammar defined in the JSGF can be converted to a
RuleGrammarobject. AnyRuleGrammarobject can be printed out in JSGF. (Note that conversion from JSGF to aRuleGrammarand back to JSGF will preserve the logic of the grammar but may lose comments and may change formatting.)Since the
RuleGrammarinterface extends theGrammarinterface, aRuleGrammarinherits the basic grammar functionality described in the previous sections (naming, enabling, activation etc.).The easiest way to load a
RuleGrammar, or set ofRuleGrammarobjects is from a Java Speech Grammar Format file or URL. TheloadJSGFmethods of theRecognizerperform this task. If multiple grammars must be loaded (where a grammar references one or more imported grammars), importing by URL is most convenient. The application must specify the base URL and the name of the root grammar to be loaded.
Recognizer rec; URL base = new URL("http://www.acme.com/app"); String grammarName = "com.acme.demo"; Grammar gram = rec.loadURL(base, grammarName);The recognizer converts the base URL and grammar name to a URL using the same conventions as
ClassLoader(the Java platform mechanism for loading class files). By converting the periods in the grammar name to slashes ('/'), appending a".gram"suffix and combining with the base URL, the location is "http:// www.acme.com/app/com/acme/demo.gram".If the demo grammar imports sub-grammars, they will be loaded automatically using the same location mechanism.
Alternatively, a
RuleGrammarcan be created by calling thenewRuleGrammarmethod of aRecognizer. This method creates an empty grammar with a specified grammar name.Once a
RuleGrammarhas been loaded, or has been created with thenewRuleGrammarmethod, the following methods of aRuleGrammarare used to create, modify and manage the rules of the grammar.
Any of the methods of
RuleGrammarthat affect the grammar (setRule,deleteRule,setEnabledetc.) take effect only after they are committed (as described in Section 6.4.2).The rule definitions of a
RuleGrammarcan be considered as a collection of namedRuleobjects. EachRuleobject is referenced by its rulename (aString). The different types ofRuleobject are described in Section 6.5.3.Unlike most collections in Java, the
RuleGrammaris a collection that does not share objects with the application. This is because recognizers often need to perform special processing of the rule objects and store additional information internally. The implication for applications is that a call tosetRuleis required to change any rule. The following code shows an example where changing a rule object does not affect the grammar.
RuleGrammar gram; // Create a rule for the word blue // Add the rule to the RuleGrammar and make it public RuleToken word = new RuleToken("blue"); gram.setRule("ruleName", word, true); // Change the word word.setText("green"); // getRule returns blue (not green) System.out.println(gram.getRule("ruleName"));To ensure that the changed
"green"token is loaded into the grammar, the application must callsetRuleagain after changing the word to"green". Furthermore, for either change to take effect in the recognition process, the changes need to be committed (see Section 6.4.2).6.5.2 Imports
Complex systems of rules are most easily built by dividing the rules into multiple grammars. For example, a grammar could be developed for recognizing numbers. That grammar could then be imported into two separate grammars that defines dates and currency amounts. Those two grammars could then be imported into a travel booking application and so on. This type of hierarchical grammar construction is similar in many respects to object oriented and shares the advantage of easy reusage of grammars.
An import declaration in JSGF and an import in a
RuleGrammarare most similar to the import statement of the Java programming language. Unlike a "#include" in the C programming language, the imported grammar is not copied, it is simply referencable. (A full specification of import semantics is provided in the Java Speech Grammar Format specification.)The
RuleGrammarinterface defines three methods for handling imports as shown in Table 6-2.
The
resolvemethod of theRuleGrammarinterface is useful in managing imports. Given any rulename, theresolvemethod returns an object that represents the fully-qualified rulename for the rule that it references.6.5.3 Rule Classes
A
RuleGrammaris primarily a collection of defined rules. The programmatic rule structure used to controlRecognizersfollows exactly the definition of rules in the Java Speech Grammar Format. Any rule is defined by aRuleobject. It may be any one of theRuleclasses described Table 6-3. The exceptions are theRuleParseclass, which is returned by theparsemethod ofRuleGrammar, and theRuleclass which is an abstract class and the parent of all otherRuleobjects.
The following is an example of a grammar in Java Speech Grammar Format. The "Hello World!" example shows how this JSGF grammar can be loaded from a text file. Below we consider how to create the same grammar programmatically.
grammar com.sun.speech.test; public <test> = [a] test {TAG} | another <rule>; <rule> = word;The following code shows the simplest way to create this grammar. It uses the
ruleForJSGFmethod to convert partial JSGF text to aRuleobject. Partial JSGF is defined as any legal JSGF text that may appear on the right hand side of a rule definition - technically speaking, any legal JSGF rule expansion.
Recognizer rec; // Create a new grammar RuleGrammar gram = rec.newRuleGrammar("com.sun.speech.test"); // Create the <test> rule Rule test = gram.ruleForJSGF("[a] test {TAG} | another <rule>"); gram.setRule("test", // rulename test, // rule definition true); // true -> make it public // Create the <rule> rule gram.setRule("rule", gram.ruleForJSGF("word"), false); // Commit the grammar rec.commitChanges();6.5.3.1 Advanced Rule Programming
In advanced programs there is often a need to define rules using the set of
Ruleobjects described above. For these applications, using rule objects is more efficient than creating a JSGF string and using theruleForJSGFmethod.To create a rule by code, the detailed structure of the rule needs to be understood. At the top level of our example grammar, the
<test>rule is an alternative: the user may say something that matches"[a] test {TAG}"or say something matching"another <rule>". The two alternatives are each sequences containing two items. In the first alternative, the brackets around the token"a"indicate it is optional. The"{TAG}"following the second token ("test") attaches a tag to the token. The second alternative is a sequence with a token ("another") and a reference to another rule ("<rule>").The code to construct this
Grammarfollows (this code example is not compact - it is written for clarity of details).
Recognizer rec; RuleGrammar gram = rec.newRuleGrammar("com.sun.speech.test"); // Rule we are building RuleAlternatives test; // Temporary rules RuleCount r1; RuleTag r2; RuleSequence seq1, seq2; // Create "[a]" r1 = new RuleCount(new RuleToken("a"), RuleCount.OPTIONAL); // Create "test {TAG}" - a tagged token r2 = new RuleTag(new RuleToken("test"), "TAG"); // Join "[a]" and "test {TAG}" into a sequence "[a] test {TAG}" seq1 = new RuleSequence(r1); seq1.append(r2); // Create the sequence "another <rule>"; seq2 = new RuleSequence(new RuleToken("another")); seq2.append(new RuleName("rule")); // Build "[a] test {TAG} | another <rule>" test = new RuleAlternatives(seq1); test.append(seq2); // Add <test> to the RuleGrammar as a public rule gram.setRule("test", test, true); // Provide the definition of <rule>, a non-public RuleToken gram.setRule("rule", new RuleToken("word"), false); // Commit the grammar changes rec.commitChanges();6.5.4 Dynamic Grammars
Grammars may be modified and updated. The changes allow an application to account for shifts in the application's context, changes in the data available to it, and so on. This flexibility allows application developers considerable freedom in creating dynamic and natural speech interfaces.
For example, in an email application the list of known users may change during the normal operation of the program. The
<sendEmail>command,
<sendEmail> = send email to <user>;references the
<user>rule which may need to be changed as new email arrives. This code snippet shows the update and commit of a change in users.
Recognizer rec; RuleGrammar gram; String names[] = {"amy", "alan", "paul"}; Rule userRule = new RuleAlternatives(names); gram.setRule("user", userRule); // apply the changes rec.commitChanges();Committing grammar changes can, in certain cases, be a slow process. It might take a few tenths of seconds or up to several seconds. The time to commit changes depends on a number of factors. First, recognizers have different mechanisms for committing changes making some recognizers faster than others. Second, the time to commit changes may depend on the extent of the changes - more changes may require more time to commit. Thirdly, the time to commit may depend upon the type of changes. For example, some recognizers optimize for changes to lists of tokens (e.g. name lists). Finally, faster computers make changes more quickly.
The other factor which influences dynamic changes is the timing of the commit. As Section 6.4.2 describes, grammar changes are not always committed instantaneously. For example, if the recognizer is busy recognizing speech (in the
PROCESSINGstate), then the commit of changes is deferred until the recognition of that speech is completed.6.5.5 Parsing
Parsing is the process of matching text to a grammar. Applications use parsing to break down spoken input into a form that is more easily handled in software. Parsing is most useful when the structure of the grammars clearly separates the parts of spoken text that an application needs to process. Examples are given below of this type of structuring.
The text may be in the form of a
Stringor array ofStringobjects (oneStringper token), or in the form of aFinalRuleResultobject that represents what a recognizer heard a user say. TheRuleGrammarinterface defines three forms of theparsemethod - one for each form of text.The
parsemethod returns aRuleParseobject (a descendent ofRule) that represents how the text matches theRuleGrammar. The structure of theRuleParseobject mirrors the structure of rules defined in theRuleGrammar. EachRuleobject in the structure of the rule being parsed against is mirrored by a matchingRuleobject in the returnedRuleParseobject.The difference between the structures comes about because the text being parsed defines a single phrase that a user has spoken whereas a
RuleGrammardefines all the phrases the user could say. Thus the text defines a single path through the grammar and all the choices in the grammar (alternatives, and rules that occur optionally or occur zero or more times) are resolvable.The mapping between the objects in the rules defined in the
RuleGrammarand the objects in theRuleParsestructure is shown in Table 6-4. Note that except for theRuleCountandRuleNameobjects, the object in the parse tree are of the same type as rule object being parsed against (marked with "**"), but the internal data may differ.
As an example, take the following simple extract from a grammar. The public rule,
<command>, may be spoken in many ways. For example, "open", "move that door" or "close that door please".
public <command> = <action> [<object>] [<polite>]; <action> = open {OP} | close {CL} | move {MV}; <object> = [<this_that_etc>] window | door; <this_that_etc> = a | the | this | that | the current; <polite> = please | kindly;Note how the rules are defined to clearly separate the segments of spoken input that an application must process. Specifically, the
<action>and<object>rules indicate how an application must respond to a command. Furthermore, anything said that matches the<polite>rule can be safely ignored, and usually the<this_that_etc>rule can be ignored too.The parse for "open" against
<command>has the following structure which matches the structure of the grammar above.
RuleParse(<command> = RuleSequence( RuleParse(<action> = RuleAlternatives( RuleTag( RuleToken("open"), "OP")))))The match of the
<command>rule is represented by aRuleParseobject. Because the definition of<command>is a sequence of 3 items (2 of which are optional), the parse of<command>is a sequence. Because only one of the 3 items is spoken (in "open"), the sequence contains a single item. That item is the parse of the<action>rule.The reference to
<action>in the definition of<command>is represented by aRuleNameobject in the grammar definition, and this maps to aRuleParseobject when parsed. The<action>rule is defined by a set of three alternatives (RuleAlternativesobject) which maps to anotherRuleAlternativesobject in the parse but with only the single spoken alternative represented. Since the phrase spoken was "open", the parse matches the first of the three alternatives which is a tagged token. Therefore the parse includes aRuleTagobject which contains aRuleTokenobject for "open".The following is the parse for "close that door please".
RuleParse(<command> = RuleSequence( RuleParse(<action> = RuleAlternatives( RuleTag( RuleToken("close"), "CL"))) RuleSequence( RuleParse(<object> = RuleSequence( RuleSequence( RuleParse(<this_that_etc> = RuleAlternatives( RuleToken("that")))) RuleAlternatives( RuleToken("door")))) RuleSequence( RuleParse(<polite> = RuleAlternatives( RuleToken("please")))) ))There are three parsing issues that application developers should consider.
- There may be several legal ways to parse the text against the grammar. This is known as an ambiguous parse. In this instance the
parsemethod will return one of the legal parses but the application is not informed of the ambiguity. As a general rule, most developers will want to avoid ambiguous parses by proper grammar design. Advanced applications will use specialized parsers if they need to handle ambiguity.
- If a
FinalRuleResultis parsed against theRuleGrammarand the rule within that grammar that it matched, then it should successfully parse. However, it is not guaranteed to parse if theRuleGrammarhas been modified of if theFinalRuleResultis aREJECTEDresult. (Result rejection is described in Section 6.7.)
6.6 Dictation Grammars
Dictation grammars come closest to the ultimate goal of a speech recognition system that takes natural spoken input and transcribes it as text. Dictation grammars are used for free text entry in applications such as email and word processing.
A
Recognizerthat supports dictation provides a singleDictationGrammarwhich is obtained from the recognizer'sgetDictationGrammarmethod. A recognizer that supports the Java Speech API is not required to provide aDictationGrammar. Applications that require a recognizer with dictation capability can explicitly request dictation when creating a recognizer by setting theDictationGrammarSupportedproperty of theRecognizerModeDescto true (see Section 4.2 for details).A
DictationGrammaris more complex than a rule grammar, but fortunately, aDictationGrammaris often easier to use than an rule grammar. This is because theDictationGrammaris built into the recognizer so most of the complexity is handled by the recognizer and hidden from the application. However, recognition of a dictation grammar is typically more computationally expensive and less accurate than that of simple rule grammars.The
DictationGrammarinherits its basic functionality from theGrammarinterface. That functionality is detailed in Section 6.4 and includes grammar naming, enabling, activation, committing and so on.As with all grammars, changes to a
DictationGrammarneed to be committed before they take effect. Commits are described in Section 6.4.2.In addition to the specific functionality described below, a
DictationGrammaris typically adaptive. In an adaptive system, a recognizer improves its performance (accuracy and possibly speed) by adapting to the style of language used by a speaker. The recognizer may adapt to the specific sounds of a speaker (the way they say words). Equally importantly for dictation, a recognizer can adapt to a user's normal vocabulary and to the patterns of those words. Such adaptation (technically known as language model adaptation) is a part of the recognizer's implementation of theDictationGrammarand does not affect an application. The adaptation data for a dictation grammar is maintained as part of a speaker profile (see Section 6.9).The
DictationGrammarextends and specializes theGrammarinterface by adding the following functionality:The following methods provided by the DictationGrammar interface allow an application to manage word lists and text context.
6.6.1 Dictation Context
Dictation recognizers use a range of information to improve recognition accuracy. Learning the words a user speaks and the patterns of those words can substantially improve accuracy.
Because patterns of words are important, context is important. The context of a word is simply the set of surrounding words. As an example, consider the following sentence "If I have seen further it is by standing on the shoulders of Giants" (Sir Isaac Newton). If we are editing this sentence and place the cursor after the word "standing" then the preceding context is "...further it is by standing" and the following context is "on the shoulders of Giants...".
Given this context, the recognizer is able to more reliably predict what a user might say, and greater predictability can improve recognition accuracy. In this example, the user might insert the word "up" but is less likely to insert the word "JavaBeans".
Through the
setContextmethod of theDictationGrammarinterface, an application should tell the recognizer the current textual context. Furthermore, if the context changes (for example, due to a mouse click to move the cursor) the application should update the context.Different recognizers process context differently. The main consideration for the application is the amount of context to provide to the recognizer. As a minimum, a few words of preceding and following context should be provided. However, some recognizers may take advantage of several paragraphs or more.
There are two
setContextmethods:
void setContext(String preceding, String following);
void setContext(String preceding[], String following[]);The first form takes plain text context strings. The second version should be used when the result tokens returned by the recognizer are available. Internally, the recognizer processes context according to tokens so providing tokens makes the use of context more efficient and more reliable because it does not have to guess the tokenization.
6.7 Recognition Results
A recognition result is provided by a
Recognizerto an application when the recognizer "hears" incoming speech that matches an active grammar. The result tells the application what words the user said and provides a range of other useful information, including alternative guesses and audio data.In this section, both the basic and advanced capabilities of the result system in the Java Speech API are described. The sections relevant to basic rule grammar-based applications are those that cover result finalization (Section 6.7.1), the hierarchy of result interfaces (Section 6.7.2), the data provided through those interfaces (Section 6.7.3), and common techniques for handling finalized rule results (Section 6.7.9).
For dictation applications the relevant sections include those listed above plus the sections covering token finalization (Section 6.7.8), handling of finalized dictation results (Section 6.7.10) and result correction and training (Section 6.7.12).
For more advanced applications relevant sections might include the result life cycle (Section 6.7.4), attachment of ResultListeners (Section 6.7.5), the relationship of recognizer and result states (Section 6.7.6), grammar finalization (Section 6.7.7), result audio (Section 6.7.11), rejected results (Section 6.7.13), result timing (Section 6.7.14), and the loading and storing of vendor formatted results (Section 6.7.15).
6.7.1 Result Finalization
The "Hello World!" example illustrates the simplest way to handle results. In that example, a
RuleGrammarwas loaded, committed and enabled, and aResultListenerwas attached to aRecognizerto receive events associated with every result that matched that grammar. In other words, theResultListenerwas attached to receive information about words spoken by a user that is heard by the recognizer.The following is a modified extract of the "Hello World!" example to illustrate the basics of handling results. In this case, a
ResultListeneris attached to aGrammar(instead of aRecognizer) and it prints out every thing the recognizer hears that matches that grammar. (There are, in fact, three ways in which aResultListenercan be attached: see Section 6.7.5.)
import javax.speech.*; import javax.speech.recognition.*; public class MyResultListener extends ResultAdapter { // Receives RESULT_ACCEPTED event: print it public void resultAccepted(ResultEvent e) { Result r = (Result)(e.getSource()); ResultToken tokens[] = r.getBestTokens(); for (int i = 0; i < tokens.length; i++) System.out.print(tokens[i].getSpokenText() + " "); System.out.println(); } // somewhere in app, add a ResultListener to a grammar { RuleGrammar gram = ...; gram.addResultListener(new MyResultListener()); } }The code shows the
MyResultListenerclass which is as an extension of theResultAdapterclass. TheResultAdapterclass is a convenience implementation of theResultListenerinterface (provided in thejavax.speech.recognitionpackage). When extending theResultAdapterclass we simply implement the methods for the events that we care about.In this case, the
RESULT_ACCEPTEDevent is handled. This event is issued to theresultAcceptedmethod of theResultListenerand is issued when a result is finalized. Finalization of a result occurs after a recognizer completed processing of a result. More specifically, finalization occurs when all information about a result has been produced by the recognizer and when the recognizer can guarantee that the information will not change. (Result finalization should not be confused with object finalization in the Java programming language in which objects are cleaned up before garbage collection.)There are actually two ways to finalize a result which are signalled by the
RESULT_ACCEPTEDandRESULT_REJECTEDevents. A result is accepted when a recognizer is confidently that it has correctly heard the words spoken by a user (i.e., the tokens in theResultexactly represent what a user said).Rejection occurs when a
Recognizeris not confident that it has correctly recognized a result: that is, the tokens and other information in the result do not necessarily match what a user said. Many applications will ignore theRESULT_REJECTEDevent and most will ignore the detail of a result when it is rejected. In some applications, aRESULT_REJECTEDevent is used simply to provide users with feedback that something was heard but no action was taken, for example, by displaying "???" or sounding an error beep. Rejected results and the differences between accepted and rejected results are described in more detail in Section 6.7.13 .An accepted result is not necessarily a correct result. As is pointed out in Section 2.2.3, recognizers make errors when recognizing speech for a range of reasons. The implication is that even for an accepted result, application developers should consider the potential impact of a misrecognition. Where a misrecognition could cause an action with serious consequences or could make changes that can't be undone (e.g., "delete all files"), the application should check with users before performing the action. As recognition systems continue to improve the number of errors is steadily decreasing, but as with human speech recognition there will always be a chance of a misunderstanding.
6.7.2 Result Interface Hierarchy
A finalized result can include a considerable amount of information. This information is provided through four separate interfaces and through the implementation of these interfaces by a recognition system.
// Result: the root result interface interface Result; // FinalResult: info on all finalized results interface FinalResult extends Result; // FinalRuleResult: a finalized result matching a RuleGrammar interface FinalRuleResult extends FinalResult; // FinalDictationResult: a final result for a DictationGrammar interface FinalDictationResult extends FinalResult; // A result implementation provided by a Recognizer public class EngineResult implements FinalRuleResult, FinalDictationResult;At first sight, the result interfaces may seem complex. The reasons for providing several interfaces are as follows:
- The information available for a result is different in different states of the result. Before finalization, a limited amount of information is available through the
Resultinterface. Once a result is finalized (accepted or rejected), more detailed information is available through theFinalResultinterface and either theFinalRuleResultorFinalDictationResultinterface.
- The type of information available for a finalized result is different for a result that matches a
RuleGrammarthan for a result that matches aDictationGrammar. The differences are explicitly represented by having separate interfaces forFinalRuleResultandFinalDictationResult.
- Once a result object is created as a specific Java class it cannot change be changed to another class. Therefore, because a result object must eventually support the final interface it must implement them when first created. Therefore, every result implements all three final interfaces when it is first created:
FinalResult,FinalRuleResultandFinalDictationResult.
- When a result is first created a recognizer does not always know whether it will eventually match a
RuleGrammaror aDictationGrammar. Therefore, every result object implements both theFinalRuleResultandFinalDictationResultinterfaces.
- A call made to any method of any of the final interfaces before a result is finalized causes a
ResultStateException.
- A call made to any method of the
FinalRuleResultinterface for a result that matches aDictationGrammarcauses aResultStateException. Similarly, a call made to any method of theFinalDictationResultinterface for a result that matches aRuleGrammarcauses aResultStateException.
- All the result functionality is provided by interfaces in the
java.speech.recognitionpackage rather than by classes. This is because the Java Speech API can support multiple recognizers from multiple vendors and interfaces allow the vendors greater flexibility in implementing results.The multitude of interfaces is, in fact, designed to simplify application programming and to minimize the chance of introducing bugs into code by allowing compile-time checking of result calls. The two basic principles for calling the result interfaces are the following:
- If it is safe to call the methods of a particular interface then it is safe to call the methods of any of the parent interfaces. For example, for a finalized result matching a
RuleGrammar, the methods of theFinalRuleResultinterface are safe, so the methods of theFinalResultandResultinterfaces are also safe. Similarly, for a finalized result matching aDictationGrammar, the methods ofFinalDictationResult,FinalResultandResultcan all be called safely.- Use type casting of a result object to ensure compile-time checks of method calls. For example, in events to an unfinalized result, cast the result object to the
Resultinterface. For aRESULT_ACCEPTEDfinalization event with a result that matches aDictationGrammar, cast the result to theFinalDictationResultinterface.In the next section the different information available through the different interfaces is described. In all the following sections that deal with result states and result events, details are provided on the appropriate casting of result objects.
6.7.3 Result Information
As the previous section describes, different information is available for a result depending upon the state of the result and, for finalized results, depending upon the type of grammar it matches (
RuleGrammarorDictationGrammar).6.7.3.1 Result Interface
The information available through the
Resultinterface is available for any result in any state - finalized or unfinalized - and matching any grammar.
- Result state: The
getResultStatemethod returns the current state of the result. The three possible state values defined by static values of theResultinterface areUNFINALIZED,ACCEPTEDandREJECTED. (Result states are described in more detail in Section 6.7.4.)
- Grammar: The
getGrammarmethod returns a reference to the matchedGrammar, if it is known. For anACCEPTEDresult, this method will return aRuleGrammaror aDictationGrammar. For aREJECTEDresult, this method may return a grammar, or may returnnullif the recognizer could not identify the grammar for this result. In theUNFINALIZEDstate, this method returnsnullbefore aGRAMMAR_FINALIZEDevent, and non-null afterwards.
- Number of finalized tokens: The
numTokensmethod returns the total number of finalized tokens for a result. For an unfinalized result this may be zero or greater. For a finalized result this number is always greater than zero for anACCEPTEDresult but may be zero or more for aREJECTEDresult. Once a result is finalized this number will not change.
- Finalized tokens: The
getBestTokenandgetBestTokensmethods return either a specified finalized best-guess token of a result or all the finalized best-guess tokens. TheResultTokenobject and token finalization are described in the following sections.
- Unfinalized tokens: In the
UNFINALIZEDstate, thegetUnfinalizedTokensmethod returns a list of unfinalized tokens. An unfinalized token is a recognizer's current guess of what a user has said, but the recognizer may choose to change these tokens at any time and any way. For a finalized result, thegetUnfinalizedTokensmethod always returnsnull.In addition to the information detailed above, the
Resultinterface provides theaddResultListenerandremoveResultListenermethods which allow aResultListenerto be attached to and removed from an individual result.ResultListenerattachment is described in more detail in Section 6.7.5.6.7.3.2 FinalResult Interface
The information available through the
FinalResultinterface is available for any finalized result, including results that match either aRuleGrammarorDictationGrammar.
- Audio data: a
Recognizermay optionally provide audio data for a finalized result. This data is provided asAudioClipfor a token, a sequence of tokens, or for the entire result. Result audio and its management are described in more detail in Section 6.7.11.
- Training data: many recognizer's have the ability to be trained and corrected. By training a recognizer or correcting its mistakes, a recognizer can adapt its recognition processes so that performance (accuracy and speed) improve over time. Several methods of the FinalResult interface support this capability and are described in detail in Section 6.7.12.
6.7.3.3 FinalDictationResult Interface
The
FinalDictationResultinterface contains a single method.
- Alternative tokens: The
getAlternativeTokensmethod allows an application to request a set of alternative guesses for a single token or for a sequence of tokens in that result. In dictation systems, alternative guesses are typically used to facilitate correction of dictated text. Dictation recognizers are designed so that when they do make a misrecognition, the correct word sequence is usually amongst the best few alternative guesses. Section 6.7.106.7.3.4 FinalRuleResult Interface
Like the
FinalDictationResultinterface, theFinalRuleResultinterface provides alternative guesses. TheFinalRuleResultinterface also provides some additional information that is useful in processing results that match aRuleGrammar.
- Alternative tokens: The
getAlternativeTokensmethod allows an application to request a set of alternative guesses for the entire result (not for tokens). ThegetNumberGuessesmethod returns the actual number of alternative guesses available.
- Alternative grammars: The alternative guesses of a result matching a
RuleGrammardo not all necessarily match the same grammar. ThegetRuleGrammarmethod returns a reference to theRuleGrammarmatched by an alternative.
- Rulenames: When a result matches a
RuleGrammar, it matches a specific defined rule of thatRuleGrammar. ThegetRuleNamemethod returns the rulename for the matched rule. Section 6.7.9RuleGrammarresults.
- Tags: A tag is a string attached to a component of a
RuleGrammardefinition. Tags are useful in simplifying the software for processing results matching aRuleGrammar(explained in Section 6.7.9). ThegetTagsmethod returns the tags for the best guess for aFinalRuleResult.6.7.4 Result Life Cycle
A
Resultis produced in response to a user's speech. Unlike keyboard input, mouse input and most other forms of user input, speech is not instantaneous (see Section 6.3.3.1 for more detail). As a consequence, a speech recognition result is not produced instantaneously. Instead, aResultis produced through a sequence of events starting some time after a user starts speaking and usually finishing some time after the user stops speaking.Figure 6-2 shows the state system of a
Resultand the associatedResultEvents. As in the recognizer state diagram (Figure 6-1), the blocks represent states, and the labelled arcs represent transitions that are signalled byResultEvents.![]()
Every result starts in the
UNFINALIZEDstate when aRESULT_CREATEDevent is issued. While unfinalized, the recognizer provides information including finalized and unfinalized tokens and the identity of the grammar matched by the result. As this information is added, theRESULT_UPDATEDandGRAMMAR_FINALIZEDevents are issuedOnce all information associated with a result is finalized, the entire result is finalized. As Section 6.7.1 explained, a result is finalized with either a
RESULT_ACCEPTEDorRESULT_REJECTEDevent placing it in either theACCEPTEDorREJECTEDstate. At that point all information associated with the result becomes available including the best guess tokens and the information provided through the three final result interfaces (see Section 6.7.3).Once finalized the information available through all the result interfaces is fixed. The only exceptions are for the release of audio data and training data. If audio data is released, an
AUDIO_RELEASEDevent is issued (see detail in Section 6.7.11). If training information is released, anTRAINING_INFO_RELEASEDevent is issued (see detail in Section 6.7.12).Applications can track result states in a number of ways. Most often, applications handle result in
ResultListenerimplementation which receivesResultEventsas recognition proceeds.As Section 6.7.3 explains, a recognizer conveys a range of information to an application through the stages of producing a recognition result. However, as the example in Section 6.7.1 shows, many applications only care about the last step and event in that process - the
RESULT_ACCEPTEDevent.The state of a result is also available through the
getResultStatemethod of theResultinterface. That method returns one of the three result states:UNFINALIZED,ACCEPTEDorREJECTED.6.7.5 ResultListener Attachment
A
ResultListenercan be attached in one of three places to receive events associated with results: to aGrammar, to aRecognizeror to an individualResult. The different places of attachment give an application some flexibility in how they handle results.To support
ResultListenerstheGrammar,RecognizerandResultinterfaces all provide theaddResultListenerandremoveResultListenermethods.Depending upon the place of attachment a listener receives events for different results and different subsets of result events.
Grammar: AResultListenerattached to aGrammarreceives allResultEventsfor any result that has been finalized to match that grammar. Because the grammar is known once aGRAMMAR_FINALIZEDevent is produced, aResultListenerattached to aGrammarreceives that event and subsequent events. Since grammars are usually defined for specific functionality it is common for most result handling to be done in the methods of listeners attached to each grammar.
Result: AResultListenerattached to aResultreceives allResultEventsstarting at the time at which the listener is attached to theResult. Note that because a listener cannot be attached until a result has been created with theRESULT_CREATEDevent, it can never receive that event.
Recognizer: AResultListenerattached to aRecognizerreceives allResultEventsfor all results produced by thatRecognizerfor all grammars. This form of listener attachment is useful for very simple applications (e.g., "Hello World!") and when centralized processing of results is required. OnlyResultListenersattached to aRecognizerreceive theRESULT_CREATEDevent.6.7.6 Recognizer and Result States
The state system of a recognizer is tied to the processing of a result. Specifically, the
LISTENING,PROCESSINGandSUSPENDEDstate cycle described in Section 6.3.3 and shown in Figure 6-1 follows the production of an event.The transition of a
Recognizerfrom theLISTENINGstate to thePROCESSINGstate with aRECOGNIZER_PROCESSINGevent indicates that a recognizer has started to produce a result. TheRECOGNIZER_PROCESSINGevent is followed by theRESULT_CREATEDevent toResultListeners.The
RESULT_UPDATEDandGRAMMAR_FINALIZEDevents are issued toResultListenerswhile the recognizer is in thePROCESSINGstate.As soon as the recognizer completes recognition of a result, it makes a transition from the
PROCESSINGstate to theSUSPENDEDstate with aRECOGNIZER_SUSPENDEDevent. Immediately following that recognizer event, the result finalization event (eitherRESULT_ACCEPTEDorRESULT_REJECTED) is issued. While the result finalization event is processed, the recognizer remains suspended. Once result finalization event is completed, the recognizer automatically transitions from theSUSPENDEDstate back to theLISTENINGstate with aCHANGES_COMMITTEDevent. Once back in theLISTENINGstate the recognizer resumes processing of audio input with the grammar committed with theCHANGES_COMMITTEDevent.6.7.6.1 Updating Grammars
In many applications, grammar definitions and grammar activation need to be updated in response to spoken input from a user. For example, if speech is added to a traditional email application, the command "save this message" might result in a window being opened in which a mail folder can be selected. While that window is open, the grammars that control that window need to be activated. Thus during the event processing for the "save this message" command grammars may need be created, updated and enabled. All this would happen during processing of the
RESULT_ACCEPTEDevent.For any grammar changes to take effect they must be committed (see Section 6.4.2). Because this form of grammar update is so common while processing the
RESULT_ACCEPTEDevent (and sometimes theRESULT_REJECTEDevent), recognizers implicitly commit grammar changes after either result finalization event has been processed.This implicit is indicated by the
CHANGES_COMMITTEDevent that is issued when a Recognizer makes a transition from theSUSPENDEDstate to theLISTENINGstate following result finalization and the result finalization event processing (see Section 6.3.3 for details).One desirable effect of this form of commit becomes useful in component systems. If changes in multiple components are triggered by a finalized result event, and if many of those components change grammars, then they do not each need to call the
commitChangesmethod. The downside of multiple calls to thecommitChangesmethod is that a syntax check be performed upon each. Checking syntax can be computationally expensive and so multiple checks are undesirable. With the implicit commit once all components have updated grammars computational costs are reduced.6.7.7 Grammar Finalization
At any time during processing a result a
GRAMMAR_FINALIZEDevent can be issued for that result indicating theGrammarmatched by the result has been determined. This event is issued is issued only once. It is required for anyACCEPTEDresult, but is optional for result that is eventually rejected.As Section 6.7.5 describes, the
GRAMMAR_FINALIZEDevent is the first event received by aResultListenerattached to aGrammar.The
GRAMMAR_FINALIZEDevent behaves the same for results that match either aRuleGrammaror aDictationGrammar.Following the
GRAMMAR_FINALIZEDevent, thegetGrammarmethod of theResultinterface returns a non-null reference to the matched grammar. By issuing aGRAMMAR_FINALIZEDevent theRecognizerguarantees that theGrammarwill not change.Finally, the
GRAMMAR_FINALIZEDevent does not change the result's state. AGRAMMAR_FINALIZEDevent is issued only when a result is in theUNFINALIZEDstate, and leaves the result in that state.6.7.8 Token Finalization
A result is a dynamic object a it is being recognized. One way in which a result can be dynamic is that tokens are updated and finalized as recognition of speech proceeds. The result events allow a recognizer to inform an application of changes in the either or both the finalized and unfinalized tokens of a result.
The finalized and unfinalized tokens can be updated on any of the following result event types:
RESULT_CREATED,RESULT_UPDATED,RESULT_ACCEPTED,RESULT_REJECTED.Finalized tokens are accessed through the
getBestTokensandgetBestTokenmethods of theResultinterface. The unfinalized tokens are accessed through thegetUnfinalizedTokensmethod of theResultinterface. (See Section 6.7.3 for details.)A finalized token is a
ResultTokenin aResultthat has been recognized in the incoming speech as matching a grammar. Furthermore, when a recognizer finalizes a token it indicates that it will not change the token at any point in the future. ThenumTokensmethod returns the number of finalized tokens.Many recognizers do not finalize tokens until recognition of an entire result is complete. For these recognizers, the
numTokensmethod returns zero for a result in theUNFINALIZEDstate.For recognizers that do finalize tokens while a
Resultis in theUNFINALIZEDstate, the following conditions apply:
- The
Resultobject may contain zero or more finalized tokens when theRESULT_CREATEDevent is issued.
- The recognizer issues
RESULT_UPDATEDevents to theResultListenerduring recognition each time one or more tokens are finalized.
- Tokens are finalized strictly in the order in which they are spoken (i.e., left to right in English text).
A result in the
UNFINALIZEDstate may also have unfinalized tokens. An unfinalized token is a token that the recognizer has heard, but which it is not yet ready to finalize. Recognizers are not required to provide unfinalized tokens, and applications can safely choose to ignore unfinalized tokens.For recognizers that provide unfinalized tokens, the following conditions apply:
- The
Resultobject may contain zero or more unfinalized tokens when theRESULT_CREATEDevent is issued.
- The recognizer issues
RESULT_UPDATEDevents to theResultListenerduring recognition each time the unfinalized tokens change.
- For an unfinalized result, unfinalized tokens may be updated at any time and in any way. Importantly, the number of unfinalized tokens may increase, decrease or return to zero and the values of those tokens may change in any way the recognizer chooses.
Unfinalized tokens are highly changeable, so why are they useful? Many applications can provide users with visual feedback of unfinalized tokens - particularly for dictation results. This feedback informs users of the progress of the recognition and helps the user to know that something is happening. However, because these tokens may change and are more likely than finalized tokens to be incorrect, the applications should visually distinguish the unfinalized tokens by using a different font, different color or even a different window.
The following is an example of finalized tokens and unfinalized tokens for the sentence "I come from Australia". The lines indicate the token values after the single
RESULT_CREATEDevent, the multipleRESULT_UPDATEDevents and the finalRESULT_ACCEPTEDevent. The finalized tokens are in bold, the unfinalized tokens are in italics.
RESULT_CREATED: I comeRESULT_UPDATED: I come fromRESULT_UPDATED: I come fromRESULT_UPDATED: I come from a strange landRESULT_UPDATED: I come from AustraliaRESULT_ACCEPTED: I come from AustraliaRecognizers can vary in how they support finalized and unfinalized tokens in a number of ways. For an unfinalized result, a recognizer may provide finalized tokens, unfinalized tokens, both or neither. Furthermore, for a recognizer that does support finalized and unfinalized tokens during recognition, the behavior may depend upon the number of active grammars, upon whether the result is for a
RuleGrammarorDictationGrammar, upon the length of spoken sentences, and upon other more complex factors. Fortunately, unless there is a functional requirement to display or otherwise process intermediate result, an application can safely ignore all but theRESULT_ACCEPTEDevent.6.7.9 Finalized Rule Results
The are some common design patterns for processing accepted finalized results that match a
RuleGrammar. First we review what we know about these results.
- It is safe to cast an accepted result that matches a
RuleGrammarto theFinalRuleResultinterface. It is safe to call any method of theFinalRuleResultinterface or its parents:FinalResultandResult.
- The
getGrammarmethod of theResultinterface return a reference to the matchedRuleGrammar. ThegetRuleGrammarmethod of theFinalRuleResultinterface returns references to theRuleGrammarsmatched by the alternative guesses.
- The
getBestTokenandgetBestTokensmethods of theResultinterface return the recognizer's best guess of what a user said.
- Result audio (see Section 6.7.11) and training information (see Section 6.7.12) are optionally available.
6.7.9.1 Result Tokens
A
ResultTokenin a result matching aRuleGrammarcontains the same information as theRuleTokenobject in theRuleGrammardefinition. This means that the tokenization of the result follows the tokenization of the grammar definition including compound tokens. For example, consider a grammar with the following Java Speech Grammar Format fragment which contains four tokens:
<rule> = I went to "San Francisco";If the user says "I went to New York" then the result will contain the four tokens defined by JSGF: "I", "went", "to", "San Francisco".
The
ResultTokeninterface defines more advanced information. Amongst that information thegetStartTimeandgetEndTimemethods may optionally return time-stamp values (or-1if the recognizer does not provide time-alignment information).The
ResultTokeninterface also defines several methods for a recognizer to provide presentation hints. Those hints are ignored forRuleGrammarresults Þ- they are only used for dictation results (see Section 6.7.10.2).Furthermore, the
getSpokenTextandgetWrittenTextmethods will return an identical string which is equal to the string defined in the matched grammar.6.7.9.2 Alternative Guesses
In a
FinalRuleResult, alternative guesses are alternatives for the entire result, that is, for a complete utterance spoken by a user. (AFinalDictationResultcan provide alternatives for single tokens or sequences of tokens.) Because more than oneRuleGrammarcan be active at a time, an alternative token sequence may match a rule in a differentRuleGrammarthan the best guess tokens, or may match a different rule in the sameRuleGrammaras the best guess. Thus, when processing alternatives for aFinalRuleResult, an application should use thegetRuleGrammarandgetRuleNamemethods to ensure that they analyze the alternatives correctly.Alternatives are numbered from zero up. The 0th alternative is actually the best guess for the result so
FinalRuleResult.getAlternativeTokens(0)returns the same array asResult.getBestTokens(). (The duplication is for programming convenience.) Likewise, theFinalRuleResult.getRuleGrammar(0)call will return the same result asResult.getGrammar().The following code is an implementation of the
ResultListenerinterface that processes theRESULT_ACCEPTEDevent. The implementation assumes that aResultbeing processed matches aRuleGrammar.
class MyRuleResultListener extends ResultAdapter { public void resultAccepted(ResultEvent e) { // Assume that the result matches a RuleGrammar. // Cast the result (source of event) appropriately FinalRuleResult res = (FinalRuleResult) e.getSource(); // Print out basic result information PrintStream out = System.out; out.println("Number guesses: " + res.getNumberGuesses()); // Print out the best result and all alternatives for (int n=0; n < res.getNumberGuesses(); n++) { // Extract the n-best information String gname = res.getRuleGrammar(n).getName(); String rname = res.getRuleName(n); ResultToken[] tokens = res.getAlternativeTokens(n); out.print("Alt " + n + ": "); out.print("<" + gname + "." + rname + "> :"); for (int t=0; t < tokens.length; t++) out.print(" " + tokens[t].getSpokenText()); out.println(); } } }For a grammar with commands to control a windowing system (shown below), a result might look like:
Number guesses: 3 Alt 0: <com.acme.actions.command>: move the window to the back Alt 1: <com.acme.actions.command>: move window to the back Alt 2: <com.acme.actions.command>: open window to the frontIf more than one grammar or more than one public rule was active, the
<grammarName.ruleName>values could vary between the alternatives.6.7.9.3 Result Tags
Processing commands generated from a
RuleGrammarbecomes increasingly difficult as the complexity of the grammar rises. With the Java Speech API, speech recognizers provide two mechanisms to simplify the processing of results: tags and parsing.A tag is a label attached to an entity within a
RuleGrammar. The Java Speech Grammar Format and theRuleTagclass define how tags can be attached to a grammar. The following is a grammar for very simple control of windows which includes tags attached to the important words in the grammar.
grammar com.acme.actions; public <command> = <action> <object> [<where>] <action> = open {ACT_OP}| close {ACT_CL} | move {ACT_MV}; <object> = [a | an | the] (window {OBJ_WIN} | icon {OBJ_ICON}); <where> = [to the] (back {WH_BACK} | front {WH_FRONT});This grammar allows users to speak commands such as
open window