TOC Prev Next 1.0 IntroductionA speech synthesizer provides a computer with the ability to speak. Users and applications provide text to a speech synthesizer, which is then converted to audio.Figure 1: Text from an application is converted to audio output Speech synthesizers are developed to produce natural-sounding speech output. However, natural human speech is a complex process, and the ability of speech synthesizers to mimic human speech is limited in many ways. For example, speech synthesizers do not "understand" what they say, so they do not always use the right style or phrasing and do not provide the same nuances as people. The Java Speech Markup Language (JSML) allows applications to annotate text with additional information that can improve the quality and naturalness of synthesized speech. JSML documents can include structural information about paragraphs and sentences. JSML allows control of the production of synthesized speech, including the pronunciation of words and phrases, the emphasis of words (stressing or accenting), the placements of boundaries and pauses, and the control of pitch and speaking rate. Finally, JSML allows markers to be embedded in text and allows synthesizer-specific controls. For the example in Figure 1, we might use JSML tags to indicate the start and end of the sentence and to emphasize the word "can": 1.1 Role of JSMLJSML has been developed to support as many types of applications as possible, and to support text markup in many different languages. To make this possible, JSML marks general information about the text and, whenever possible, uses cross-language properties.Although JSML may be used for text in Japanese, Spanish, Tamil, Thai, English, and nearly all modern languages, a single JSML document should contain text for only a single language. Applications are therefore responsible for management and control of speech synthesizers if output of multiple languages is required. JSML can be used by a wide range of applications to speak text from equally varied sources, including email, database information, web pages, and word processor documents. Figure 2 illustrates the basic steps in this process. Figure 2: JSML Process The application is responsible for converting the source information to JSML text using any special knowledge it has about the content and format of the source information. For example, an email application can provide the ability to read email messages aloud by converting messages to JSML. This could involve the conversion of email header information (sender, subject, date, etc.) to a speakable form and might also involve special processing of text in the body of the message (for handling attachments, indented text, special abbreviations, etc.) Here is a sample of an email message converted to JSML: <PARA>Message from <EMP>Alan Schwarz</EMP> about new
synthesis technology.
2.0 Markup in JSML2.1 Basic MarkupThe special text in the following example is the text markup. This style will be familiar to you if you have used HTML, SGML, or XML.<SENT> indicates the start of a sentence element and </SENT> ends that
sentence. Similarly, <EMP> and </EMP> mark a region to be emphasized.
2.2 Container ElementsJSML elements are either container elements or empty elements. A container element has a balanced start tag and end tag (e.g.,<SENT> and </SENT>). The
text appearing between the start and end tags is the contained text as shown in
Figure 3. An element's start-tag defines the type of element and may contain one
or more attributes. All end-tags have the same name as their matching start-tag.
Figure 3: Elements and Attributes 2.2.1 AttributesAttributes are used to provide additional information about an element. Each JSML element has a set of defined attribute names and, in some cases, the attribute value is restricted to certain strings. For example, anEMP element can
mark words with a LEVEL attribute value of strong:
2.2.2 Element NestingSome JSML elements allow the contained text to contain other elements. This is referred to as nesting. Nested elements cannot overlap or intertwine. For example, the following is not legal:2.3 Empty ElementsAn empty element has only one tag and does not contain any text. For example, the following results in a large break/pause in the speech at the point that the element occurs: Because it doesn't mark any text, an empty element likeBREAK doesn't need an
end-tag. Rather, the "/>" marks the end of the start-tag and of the element. Like
the container elements, empty elements can include attributes to provide
additional information (for example, SIZE="large" above).
2.4 NamesAll JSML element and attribute names are uppercase. All JSML attribute values are case sensitive. Furthermore, the naming of elements and attributes and the values of attributes are independent. Consequently, it is possible for an element to have an attribute of the same name (though none currently do).2.5 White SpaceWithin an element's start- and end-tags, single white space characters can optionally be replaced by multiple white space characters without changing the semantics of the element.White space contained between an element's start- and end-tags, or not contained by any element, is passed to the speech synthesizer and may affect speech output. 2.6 Undefined NamesElements or attributes with undefined names are ignored by the speech synthesizer. This feature is useful in automatic generation and processing of JSML. For example, a web browser could generate the following: In this example, theORIG attribute is used to preserve the original URL. The
contained text will be spoken by the speech synthesizer but the URL element tags
will be ignored, because they are not defined in JSML and therefore not known to
the synthesizer.
This mechanism does allow speech synthesizers to extend the JSML element set by interpreting these additional elements specially. However, application developers should be aware that elements not specified in JSML are not portable across synthesizers and platforms. 2.7 JSML Document StructureJSML is a subset of XML1 (Extensible Markup Language), which is a simple dialect of SGML. By being a subset of XML, JSML gains a standardized, extensible syntax that is not tied to the Java Speech API (JSAPI). This means that:
Having a DTD allows the application to use the full power of XML for generating text, for example, entity references that can act as a shorthand for repetitive JSML, and then to use generic text processing tools for generating the JSML. 2.7.1 Splitting JSML DocumentsA JSML document must be syntactically complete. Every start tag must be an empty element (no end tag required) or have a matching end tag. If text is split into multiple JSML documents to be spoken in sequence, then the text should be split between paragraphs or perhaps between sentences. This is because each document will be spoken independently and important phrasing and pitch information will be affected by inappropriate boundaries.2.8 Escaping/Quoting TextIf text to be spoken contains a less-than sign ("<", which is\u003C) or an
ampersand ("&", which is \u0026), then the text needs to be escaped or quoted
to prevent the possibility of some of the text being mistaken for JSML tags. There
are several methods available:
CDATA section has the general form of:
The text that is being escaped can contain any character sequence that is not the
"]]>" sequence.
A CDATA sections by stripping away the <![CDATA[ and ]]>
markup and not parsing the CDATA section's contents for JSML.
2.9 CommentsA JSML comment begins with a<!-- character sequence and ends with a -->
character sequence and may contain any text except the two-character sequence
--.
Comments can be placed within text that is to be spoken (the comments will not be spoken). Comments may not be placed within elements.
3.0 JSML ElementsJSML syntax consists of structural, production, and miscellaneous elements. The following table presents an overview of JSML's elements. These elements are defined in detail in the following sections. The section on structural elements also describes implicit paragraph marking, which is an alternative to thePARA
element.
4.0 Structural Elements4.1 PARAPARA element declares a range of text to be a paragraph. For example:
<PARA>This a short paragraph.</PARA><PARA>The subject has
changed, so this is a new paragraph.</PARA>
PARA elements do not contain other PARA elements; that is, PARA elements do not
embed or nest. For example, the following is not legal:
<PARA>The raven spoke.
4.2 Implicit Paragraph MarkingIn JSML, a blank line (that is, a line that contains only whitespace characters) that separates one block of text from another is treated the same as explicitly marking the block as a paragraph. Strictly speaking, a blank line is not an element, however, it does serve the same function as thePARA element.
The following fragments result in the same speech: and<PARA>She went to school and passed the tests.</PARA>
<PARA>When she returned home, the sun had set.</PARA>
\u0020, horizontal tabulations, \u0009, and
ideographic spaces, \u3000) in any of the following:
4.3 SENTSENT element declares a range of text to be a sentence. For example:
SENT elements do not contain other SENT elements, that is, SENT elements do not
embed or nest. For example, the following is not legal:
5.0 Production Elements5.1 SAYASSAYAS element.
5.1.1 SUB (Substitute)TheSUB attribute defines substitute text to be spoken instead of the contained
text. For example:
5.1.2 CLASSWhen theCLASS attribute value is date, the contained text should be pronounced
as a date. For example:
Note that simply stating that something is a date does not always yield the desired
pronunciation. A SUB attribute may be required. For example, 4/3/97 is
ambiguous in:
It might be spoken as "April third nineteen ninety-seven" or as "March fourth
nineteen ninety-seven." It is unambiguous if a SUB attribute is used:
When the CLASS attribute value is literal, the letters, digits, and other
characters of the contained text should be spoken individually. In English, this is
effectively doing spelling. This is useful for speaking many acronyms and for
speaking numbers as digits. For example:
<SAYAS CLASS="literal">JSML</SAYAS>
CLASS attribute value is number, the contained text should be
pronounced as a number. For example:
5.1.3 PHON (Phonetic Pronunciation)ThePHON attribute uses the International Phonetic Alphabet (IPA) character
subset of Unicode to define a sequence of sounds. IPA characters are represented
by codes from \u0250 to \u02AF, by modifiers from \u02B0 to \u02FF, by
diacritics from \u0300 to \u036F, and by certain Latin, Greek and symbol
characters from the range \u0000 to \u017F. Details of the Unicode IPA support
are provided in The Unicode Standard, Version 2.0 (The Unicode Consortium,
Addison-Wesley Developers Press, 1996).
The following examples are equivalent: <SAYAS PHON="
5.1.4 NestingElements cannot be nested within the contents of aSAYAS.
Illegal example:
5.2 EMPEMP element specifies that a range of text should be spoken with emphasis.
The LEVEL attribute's values are strong (for strong emphasis), moderate (for
some emphasis), none (for no emphasis), and reduced (for a reduction in
emphasis).
The EMP element can also be an empty element, where it specifies that the
immediately following text3 is to be emphasized.
The following examples have the same effect as above: 5.3 BREAKBREAK element is an empty element that is used to mark phrasing boundaries
in the speech output. To indicate what type of break is desired, the element can
include a SIZE attribute or a MSECS attribute, but not both. A SIZE attribute
indicates a break that is relative to the characteristics of the current speech, and a
MSECS attribute indicates a pause for an absolute amount of time.
Where possible, the break should be defined by a 5.4 PROSPROS element provides prosody control for JSML. Prosody is a collection of
features of speech that includes its timing, intonation and phrasing. Proper control
of prosody can improve the understandability and naturalness of speech. They are
better viewed as being "hints" to the synthesizer. Most of the attributes of the
PROS tag accept numeric values. These values are floating point numbers of the
form 23, 10.8, or -0.55.
The VOL attribute can have values of the following forms:The Musically-inclined developers might think of pitch in semitones and octaves. A semitone rise in pitch is approximately +5.9% and a semitone drop is -5.6%. A two-semitone shift is +12.2% or -10.9%. A one-octave shift (12 semitones) is 100% or -50%, that is, doubling or halving pitch.4 While speaking a sentence, pitch moves up and down in natural speech to convey extra information about what is being said. The baseline pitch represents the normal minimum pitch of a sentence. The pitch range represents the amount of variation in pitch above the baseline. Setting the baseline pitch and pitch range can affect whether speech sounds monotonous (small range) or dynamic (large range). Figure 4: Baseline Pitch and Pitch Range Normal baseline pitch for a female voice is between 140Hz and 280Hz, with a pitch range of 80Hz or more. Male voices are typically lower: baseline of 70- 140Hz, with a range of 40-80Hz. Note that in all cases, relative values increase the portability of JSML across speaking voices and synthesizers. Relative settings allow users to apply the same JSML to different voices (e.g., male and female voices with very different pitch ranges) and to set a local preference for speaking rate. For example, some users set the speaking rate very high (300 words per minute or faster) so they can listen to a lot of text very quickly. The <EMP/>ACME Trading Corporation, <PROS
RANGE="-30%">which supplies cartoon goods,</PROS> was
purchased yesterday for <PROS RATE="-20%" VOL="+15%">
$2,060,000 </PROS> by <EMP> Road Runner </EMP>
Incorporated.
6.0 Other Elements6.1 MARKERMARKER element requests a notification from the speech synthesizer to the
application when the MARK is reached during the synthesizer's production of
audio for the text.
6.2 ENGINEENGINE element allows applications to utilize a synthesizer's special
capabilities. The element provides information, the value of the DATA attribute, to
any speech synthesizers that are identified by the ENGID attribute. The
information is generally a command in an engine-specific syntax.
Less-than signs ("<") or ampersands ("&") in a DATA attribute must be escaped to avoid being mistaken for JSML (see Escaping/Quoting Text).
2 Words with the same spelling but different pronunciations. For example, "I will read it." and "I have read it." 3 The meaning of "immediately following text" is language dependent. English speech synthesizers will emphasize the next word. 4
Percentages for 1 to 12 semitone pitch rises are +5.9%, +12.2%, +18.9%, +26.0%, +33.5%,
+41.4%, +50%, +58.7, +68.2%, +78.2%, +88.8%, +100%.
TOC Prev Next | ||||||||||||||||||||||||
Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.
|
| ||||||||||||