Contents
|
Previous
|
Next
|
Chapter 1
Introduction
Speech technology, once limited to the realm of science fiction, is now available for use in real applications. The JavaTM Speech API, developed by Sun Microsystems in cooperation with speech technology companies, defines a software interface that allows developers to take advantage of speech technology for personal and enterprise computing. By leveraging the inherent strengths of the Java platform, the Java Speech API enables developers of speech-enabled applications to incorporate more sophisticated and natural user interfaces into Java applications and applets that can be deployed on a wide range of platforms.
1.1 What is the Java Speech API?
The Java Speech API defines a standard, easy-to-use, cross-platform software interface to state-of-the-art speech technology. Two core speech technologies are supported through the Java Speech API: speech recognition and speech synthesis. Speech recognition provides computers with the ability to listen to spoken language and to determine what has been said. In other words, it processes audio input containing speech by converting it to text. Speech synthesis provides the reverse process of producing synthetic speech from text generated by an application, an applet or a user. It is often referred to as text-to-speech technology.
Enterprises and individuals can benefit from a wide range of applications of speech technology using the Java Speech API. For instance, interactive voice response systems are an attractive alternative to touch-tone interfaces over the telephone; dictation systems can be considerably faster than typed input for many users; speech technology improves accessibility to computers for many people with physical limitations.
Speech interfaces give Java application developers the opportunity to implement distinct and engaging personalities for their applications and to differentiate their products. Java application developers will have access to state- of-the-art speech technology from leading speech companies. With a standard API for speech, users can choose the speech products which best meet their needs and their budget.
The Java Speech API was developed through an open development process. With the active involvement of leading speech technology companies, with input from application developers and with months of public review and comment, the specification has achieved a high degree of technical excellence. As a specification for a rapidly evolving technology, Sun will support and enhance the Java Speech API to maintain its leading capabilities.
The Java Speech API is an extension to the Java platform. Extensions are packages of classes written in the Java programming language (and any associated native code) that application developers can use to extend the functionality of the core part of the Java platform.
1.2 Design Goals for the Java Speech API
Along with the other Java Media APIs, the Java Speech API lets developers incorporate advanced user interfaces into Java applications. The design goals for the Java Speech API included:
- Provide support for speech synthesizers and for both command-and-control and dictation speech recognizers.
- Support integration with other capabilities of the Java platform, including the suite of Java Media APIs.
1.3 Speech-Enabled Java Applications
The existing capabilities of the Java platform make it attractive for the development of a wide range of applications. With the addition of the Java Speech API, Java application developers can extend and complement existing user interfaces with speech input and output. For existing developers of speech applications, the Java platform now offers an attractive alternative with:
- Portability: the Java programming language, APIs and virtual machine are available for a wide variety of hardware platforms and operating systems and are supported by major web browsers.
- Powerful and compact environment: the Java platform provides developers with a powerful, object-oriented, garbage collected language which enables rapid development and improved reliability.
- Network aware and secure: from its inception, the Java platform has been network aware and has included robust security.
1.3.1 Speech and other Java APIs
The Java Speech API is one of the Java Media APIs, a suite of software interfaces that provide cross-platform access to audio, video and other multimedia playback, 2D and 3D graphics, animation, telephony, advanced imaging, and more. The Java Speech API, in combination with the other Java Media APIs, allows developers to enrich Java applications and applets with rich media and communication capabilities that meet the expectations of today's users, and can enhance person-to-person communication.
The Java Speech API leverages the capabilities of other Java APIs. The Internationalization features of the Java programming language plus the use of the Unicode character set simplify the development of multi-lingual speech applications. The classes and interfaces of the Java Speech API follow the design patterns of JavaBeansTM. Finally, Java Speech API events integrate with the event mechanisms of AWT, JavaBeans and the Java Foundation Classes (JFC).
1.4 Applications of Speech Technology
Speech technology is becoming increasingly important in both personal and enterprise computing as it is used to improve existing user interfaces and to support new means of human interaction with computers. Speech technology allows hands-free use of computers and supports access to computing capabilities away from the desk and over the telephone. Speech recognition and speech synthesis can improve computer accessibility for users with disabilities and can reduce the risk of repetitive strain injury and other problems caused by current interfaces.
The following sections describe some current and emerging uses of speech technology. The lists of uses are far from exhaustive. New speech products are being introduced on a weekly basis and speech technology is rapidly entering new technical domains and new markets. The coming years should see speech input and output truly revolutionize the way people interact with computers and present new and unforeseen uses of speech technology.
1.4.1 Desktop
Speech technology can augment traditional graphical user interfaces. At its simplest, it can be used to provide audible prompts with spoken "Yes/No/OK" responses that do not distract the user's focus. But increasingly, complex commands are enabling rapid access to features that have traditionally been buried in sub-menus and dialogs. For example, the command "Use 12-point, bold, Helvetica font" replaces multiple menu selections and mouse clicks.
Drawing, CAD and other hands-busy applications can be enhanced by using speech commands in combination with mouse and keyboard actions to improve the speed at which users can manipulate objects. For example, while dragging an object, a speech command could be used to change its color and line type all without moving the pointer to the menu-bar or a tool palette.
Natural language commands can provide improvements in efficiency but are increasingly being used in desktop environments to enhance usability. For many users it's easier and more natural to produce spoken commands than to remember the location of functions in menus and dialog boxes. Speech technology is unlikely to make existing user interfaces redundant any time soon, but spoken commands provide an elegant complement to existing interfaces.
Speech dictation systems are now affordable and widely available. Dictation systems can provide typing rates exceeding 100 words per minute and word accuracy over 95%. These rates substantially exceed the typing ability of most people.
Speech synthesis can enhance applications in many ways. Speech synthesis of text in a word processor is a reliable aid to proof-reading, as many users find it easier to detect grammatical and stylistic problems when listening rather than reading. Speech synthesis can provide background notification of events and status changes, such as printer activity, without requiring a user to lose current context. Applications which currently include speech output using pre-recorded messages can be enhanced by using speech synthesis to reduce the storage space by a factor of up to 1000, and by removing the restriction that the output sentences be defined in advance.
In many situations where keyboard input is impractical and visual displays are restricted, speech may provide the only way to interact with a computer. For example, surgeons and other medical staff can use speech dictation to enter reports when their hands are busy and when touching a keyboard represents a hygiene risk. In vehicle and airline maintenance, warehousing and many other hands-busy tasks, speech interfaces can provide practical data input and output and can enable computer-based training.
1.4.2 Telephony Systems
Speech technology is being used by many enterprises to handle customer calls and internal requests for access to information, resources and services. Speech recognition over the telephone provides a more natural and substantially more efficient interface than touch-tone systems. For example, speech recognition can "flatten out" the deep menu structures used in touch tone systems.
Systems are already available for telephone access to email, calendars and other computing facilities that have previously been available only on the desktop or with special equipment. Such systems allow convenient computer access by telephones in hotels, airports and airplanes.
Universal messaging systems can provide a single point of access to multiple media such as voice-mail, email, fax and pager messages. Such systems rely upon speech synthesis to read out messages over the telephone. For example: "Do I have any email?" "Yes, you have 7 messages including 2 high priority messages from the production manager." "Please read me the mail from the production manager." "Email arrived at 12:30pm...".
1.4.3 Personal and Embedded Devices
Speech technology is being integrated into a range of small-scale and embedded computing devices to enhance their usability and reduce their size. Such devices include Personal Digital Assistants (PDAs), telephone handsets, toys and consumer product controllers.
Speech technology is particularly compelling for such devices and is being used increasingly as the computer power of these devices increases. Speech recognition through a microphone can replace input through a much larger keyboard. A speaker for speech synthesis output is also smaller than most graphical displays.
PersonalJavaTM and EmbeddedJavaTM are the Java application environments targeted at these same devices. PersonalJava and EmbeddedJava are designed to operate on constrained devices with limited computing power and memory, and with more constrained input and output mechanisms for the user interface.
As an extension to the Java platform, the Java Speech API can be provided as an extension to PersonalJava and EmbeddedJava devices, allowing the devices to communicate with users without the need for keyboards or other large peripherals.
1.4.4 Speech and the Internet
The Java Speech API allows applets transmitted over the Internet or intranets to access speech capabilities on the user's machine. This provides the ability to enhance World Wide Web sites with speech and support new ways of browsing. Speech recognition can be used to control browsers, fill out forms, control applets and enhance the WWW/Internet experience in many other ways. Speech synthesis can be used to bring web pages alive, inform users of the progress of applets, and dramatically improve browsing time by reducing the amount of audio sent across the Internet.
The Java Speech API utilizes the security features of the Java platform to ensure that applets cannot maliciously use system resources on a client. For example, explicit permission is required for an applet to access a dictation recognizer since otherwise a recognizer could be used to bug a user's workspace.
1.5 Implementations
The Java Speech API can enable access to the most important and useful state-of- the-art speech technologies. Sun is working with speech technology companies on implementations of the API. Already speech recognition and speech synthesis are available through the Java Speech API on multiple computing platforms.
The following are the primary mechanisms for implementing the API.
- Native implementations: most existing speech technology is implemented in C and C++ and accessed through platform-specific APIs such as the Apple Speech Managers and Microsoft's Speech API (SAPI), or via proprietary vendor APIs. Using the Java Native Interface (JNI) and Java software wrappers, speech vendors can (and have) implemented the Java Speech API on top of their existing speech software.
- Java software implementations: Speech synthesizers and speech recognizers can be written in Java software. These implementations will benefit from the portability of the Java platform and from the continuing improvements in the execution speed of Java virtual machines*.
- Telephony implementations: Enterprise telephony applications are typically implemented with dedicated hardware to support a large number of simultaneous connections, for example, using DSP cards. Speech recognition and speech synthesis capabilities on this dedicated hardware can be wrapped with Java software to support the Java Speech API as a special type of native implementation.
1.6 Requirements
To use the Java Speech API, a user must have certain minimum software and hardware available. The following is a broad sample of requirements. The individual requirements of speech synthesizers and speech recognizers can vary greatly and users should check product requirements closely.
- System requirements: most desktop speech recognizers and some speech synthesizers require relatively powerful computers to run effectively. Check the minimum and recommended requirements for CPU, memory and disk space when purchasing a speech product.
- Audio Hardware: Speech synthesizers require audio output. Speech recognizers require audio input. Most desktop and laptop computers now sold have satisfactory audio support. Most dictation systems perform better with good quality sound cards.
*As used on this web site, the terms "Java virtual machine" or "JVM" mean a virtual machine for the Java platform.
- Microphone: Desktop speech recognition systems get audio input through a microphone. Some recognizers, especially dictation systems, are sensitive to the microphone and most recognition products recommend particular microphones. Headset microphones usually provide best performance, especially in noisy environments. Table-top microphones can be used in some environments for some applications.
Contents
|
Previous
|
Next
|
JavaTM
Speech API Programmer's Guide
Copyright © 1997-1998
Sun Microsystems, Inc.
All rights reserved
Send comments or corrections to javaspeech-comments@sun.com