[Archive copy mirrored from the URL: http://java.sun.com/products/java-media/speech/forDevelopers/JSML/JSML.html; see this canonical version of the document.]
TOC Prev Next
Figure 1 : Text from an application is converted to audio output
Speech synthesizers are developed to produce natural-sounding speech output. However, natural human speech is a complex process, and the ability of speech synthesizers to mimic human speech is limited in many ways. For example, speech synthesizers do not "understand" what they say, so they do not always use the right style or phrasing and do not provide the same nuances as people.
The JavaTM Speech Markup Language (JSML) allows applications to annotate text with additional information that can improve the quality and naturalness of synthesized speech. JSML documents can include structural information about paragraphs and sentences. JSML allows control of the production of synthesized speech, including the pronunciation of words and phrases, the emphasis of words (stressing or accenting), the placements of boundaries and pauses, and the control of pitch and speaking rate. Finally, JSML allows markers to be embedded in text and allows synthesizer-specific controls.
For the example in Figure 1, we might use JSML tags to indicate the start and end of the sentence and to emphasize the word "can":
<SENT>Computers <EMP>can</EMP> speak.</SENT>
Although JSML may be used for text in Japanese, Spanish, Tamil, Thai, English, and nearly all modern languages, a single JSML document should contain text for only a single language. Applications are therefore responsible for management and control of speech synthesizers if output of multiple languages is required.
JSML can be used by a wide range of applications to speak text from equally varied sources, including email, database information, web pages, and word processor documents. Figure 2 illustrates the basic steps in this process.
Figure 2 : JSML Process
The application is responsible for converting the source information to JSML text using any special knowledge it has about the content and format of the source information. For example, an email application can provide the ability to read email messages aloud by converting messages to JSML. This could involve the conversion of email header information (sender, subject, date, etc.) to a speakable form and might also involve special processing of text in the body of the message (for handling attachments, indented text, special abbreviations, etc.) Here is a sample of an email message converted to JSML:
<PARA>Message from <EMP>Alan Schwarz</EMP> about new
synthesis technology.
Arrived at <SAYAS CLASS="time">2pm</SAYAS> today.</PARA>
<PARA>I've attached a diagram showing the new way we do
speech synthesis.</PARA>
<SENT>Computers <EMP>can</EMP> speak.</SENT>
<SENT>
indicates the start of a sentence element and </SENT>
ends that
sentence. Similarly, <EMP>
and </EMP>
mark a region to be emphasized.
SENT
and EMP
are referred to as elements. JSML defines eight elements. The
following sections describe elements and other JSML markup in more detail.
<SENT>
and </SENT>
). The
text appearing between the start and end tags is the contained text as shown in
Figure 3. An element's start-tag defines the type of element and may contain one
or more attributes. All end-tags have the same name as their matching start-tag.
Figure 3 : Elements and Attributes
EMP
element can
mark words with a LEVEL
attribute value of strong
:
Ich bin ein <EMP LEVEL="strong">Berliner!</EMP>
<PARA> text with <EMP> more text </EMP> </PARA>
<PARA> text with <EMP> more text </PARA> </EMP>
A loud noise was heard, <BREAK SIZE="large"/>and the room
became quiet.
BREAK
doesn't need an
end-tag. Rather, the "/>" marks the end of the start-tag and of the element. Like
the container elements, empty elements can include attributes to provide
additional information (for example, SIZE="large" above).
White space contained between an element's start- and end-tags, or not contained by any element, is passed to the speech synthesizer and may affect speech output.
<URL ORIG="http://acme.com">URL is ACME dot com</URL>
ORIG
attribute is used to preserve the original URL. The
contained text will be spoken by the speech synthesizer but the URL element tags
will be ignored, because they are not defined in JSML and therefore not known to
the synthesizer.
This mechanism does allow speech synthesizers to extend the JSML element set by interpreting these additional elements specially. However, application developers should be aware that elements not specified in JSML are not portable across synthesizers and platforms.
</JSML>
Having a DTD allows the application to use the full power of XML for generating text, for example, entity references that can act as a shorthand for repetitive JSML, and then to use generic text processing tools for generating the JSML.
\u003C
) or an
ampersand ("&", which is \u0026
), then the text needs to be escaped or quoted
to prevent the possibility of some of the text being mistaken for JSML tags. There
are several methods available:
<
", "<
", or
"<
".
&
", "&
", or
"&
".
CDATA
section can be placed around the entire text.
CDATA
section has the general form of:
<![CDATA[the text that is being escaped]]>
A CDATA
section can be used on text that is contained by an element, for example:
<EMP>Joe Doe <!CDATA[<joe.doe@acme.com>]]></EMP>
<![CDATA[X < Y is a boolean expression.]]>
CDATA
sections by stripping away the <![CDATA[
and ]]>
markup and not parsing the CDATA
section's contents for JSML.
<!--
character sequence and ends with a -->
character sequence and may contain any text except the two-character sequence
--
.
Comments can be placed within text that is to be spoken (the comments will not be spoken).
How now brown <!-- This is an example comment --> cow.
PARA
element.
PARA
element declares a range of text to be a paragraph. For example:
<PARA>This a short paragraph.</PARA><PARA>The subject has
changed, so this is a new paragraph.</PARA>
PARA
elements do not contain other PARA
elements; that is, PARA
elements do not
embed or nest. For example, the following is not legal:
<PARA>The raven spoke.
<PARA>I've come from Norway at the command of the king.
He sues for peace.</PARA>
PARA
element.
The following fragments result in the same speech:
and<PARA>She went to school and passed the tests.</PARA>
<PARA>When she returned home, the sun had set.</PARA>
<PARA>She went to school and passed the tests.</PARA>
\u0020
, horizontal tabulations, \u0009
, and
ideographic spaces, \u3000
) in any of the following:
\u000D
\u000A
\u000D
\u000A
)
\u000A
\u000A
)
\u2028
\u2028
)
\u2029
)
SENT
element declares a range of text to be a sentence. For example:
<SENT>C'est la vie.</SENT>
SENT
elements do not contain other SENT
elements, that is, SENT
elements do not
embed or nest. For example, the following is not legal:
<SENT>He said, <SENT>"I leave tomorrow."</SENT></SENT>
SAYAS
element.
SUB
attribute defines substitute text to be spoken instead of the contained
text. For example:
<SAYAS SUB="I triple E">IEEE</SAYAS>
CLASS
attribute value is date
, the contained text should be pronounced
as a date. For example:
<SAYAS CLASS="date">Jan. 1952</SAYAS>
<!--spoken as January nineteen fifty-two -->
SUB
attribute may be required. For example, 4/3/97 is
ambiguous in:
<SAYAS CLASS="date">4/3/97</SAYAS>
SUB
attribute is used:
<SAYAS SUB="March fourth nineteen ninety-seven">4/3/97
</SAYAS>
CLASS
attribute value is literal
, the letters, digits, and other
characters of the contained text should be spoken individually. In English, this is
effectively doing spelling. This is useful for speaking many acronyms and for
speaking numbers as digits. For example:
<SAYAS CLASS="literal">JSML</SAYAS>
<!-spoken as J S M L -->
<SAYAS CLASS="literal">12</SAYAS><!--spoken as one two-->
<SAYAS CLASS="literal">100%</SAYAS> <!--might be spoken
as one zero zero percent sign-->
CLASS
attribute value is number
, the contained text should be
pronounced as a number. For example:
<SAYAS CLASS="number">12</SAYAS> <!--spoken as twelve-->
PHON
attribute uses the International Phonetic Alphabet (IPA) character
subset of Unicode to define a sequence of sounds. IPA characters are represented
by codes from \u0250
to \u02AF
, by modifiers from \u02B0
to \u02FF
, by
diacritics from \u0300
to \u036F
, and by certain Latin, Greek and symbol
characters from the range \u0000
to \u017F
. Details of the Unicode IPA support
are provided in The Unicode Standard, Version 2.0 (The Unicode Consortium,
Addison-Wesley Developers Press, 1996).
The following examples are equivalent:
<SAYAS PHON=""> phonetics </SAYAS>
<SAYAS PHON="\u0066\u006F\u028A\u006E\u025B
\u0074\u026A\u006B\u0073"> phonetics </SAYAS>
SAYAS
.
<PROS RATE="-30%"><SAYAS SUB="sun dot com">sun.com
</SAYAS></PROS>
<SAYAS SUB="sun dot com"><PROS RATE="-30%">sun.com
</PROS></SAYAS>
EMP
element specifies that a range of text should be spoken with emphasis.
The LEVEL
attribute's values are strong
(for strong emphasis), moderate
(for
some emphasis), none
(for no emphasis), and reduced
(for a reduction in
emphasis).
The EMP
element can also be an empty element, where it specifies that the
immediately following text3 is to be emphasized.
The following examples have the same effect as above:
BREAK
element is an empty element that is used to mark phrasing boundaries
in the speech output. To indicate what type of break is desired, the element can
include a SIZE
attribute or a MSECS
attribute, but not both. A SIZE
attribute
indicates a break that is relative to the characteristics of the current speech, and a
MSECS
attribute indicates a pause for an absolute amount of time.
Where possible, the break should be defined by a SIZE
rather than a MSECS
,
because, in most languages, breaks are produced by special movements in pitch,
by timing changes, and often with a pause. Those factors are significantly affected
by speaking context. For example, a 300 millisecond break in fast speech sounds
more significant than it does in slow speech.
PROS
element provides prosody control for JSML. Prosody is a collection of
features of speech that includes its timing, intonation and phrasing. Proper control
of prosody can improve the understandability and naturalness of speech. They are
better viewed as being "hints" to the synthesizer. Most of the attributes of the
PROS
tag accept numeric values. These values are floating point numbers of the
form 23, 10.8, or -0.55.
The RATE
attribute is defined in words per minute and can have values of the
following forms:
<PROS RATE="150">text at 150 words per minute</PROS>
VOL
attribute can have values of the following forms:The PITCH
and RANGE
attributes can have values of the following forms:
Musically-inclined developers might think of pitch in semitones and octaves. A semitone rise in pitch is approximately +5.9% and a semitone drop is -5.6%. A two-semitone shift is +12.2% or -10.9%. A one-octave shift (12 semitones) is 100% or -50%, that is, doubling or halving pitch.4
While speaking a sentence, pitch moves up and down in natural speech to convey extra information about what is being said. The baseline pitch represents the normal minimum pitch of a sentence. The pitch range represents the amount of variation in pitch above the baseline. Setting the baseline pitch and pitch range can affect whether speech sounds monotonous (small range) or dynamic (large range).
Figure 4 : Baseline Pitch and Pitch Range
Normal baseline pitch for a female voice is between 140Hz and 280Hz, with a pitch range of 80Hz or more. Male voices are typically lower: baseline of 70- 140Hz, with a range of 40-80Hz.
Note that in all cases, relative values increase the portability of JSML across speaking voices and synthesizers. Relative settings allow users to apply the same JSML to different voices (e.g., male and female voices with very different pitch ranges) and to set a local preference for speaking rate. For example, some users set the speaking rate very high (300 words per minute or faster) so they can listen to a lot of text very quickly.
The <EMP/>ACME Trading Corporation, <PROS
RANGE="-30%">which supplies cartoon goods,</PROS> was
purchased yesterday for <PROS RATE="-20%" VOL="+15%">
$2,060,000 </PROS> by <EMP> Road Runner </EMP>
Incorporated.
MARKER
element requests a notification from the speech synthesizer to the
application when the MARK
is reached during the synthesizer's production of
audio for the text.
Answer <MARKER MARK="yes_no_prompt"/> yes or no.
ENGINE
element allows applications to utilize a synthesizer's special
capabilities. The element provides information, the value of the DATA
attribute, to
any speech synthesizers that are identified by the ENGID
attribute. The
information is generally a command in an engine-specific syntax.
ENGINE
is a container element that is treated specially by a speech synthesizer
that matches any engine specified in the ENGID
. A matching engine should
substitute the DATA
for the text contained within the element. Other engines
should ignore the DATA
and instead process the contained text. For example, given
the code
I am <ENGINE ENGID="Acme Voice" DATA="Mr. Acme"> someone
else</ENGINE>
Less-than signs ("<") or ampersands ("&") in a DATA attribute must be escaped to avoid being mistaken for JSML (see Escaping/Quoting Text).
<ENGINE ENGID="Croaker 1.0" DATA="<ribbit=1>"
MARK="frog start"> no frog sound </ENGINE>
2 Words with the same spelling but different pronunciations. For example, "I will read it." and "I have read it."
3 The meaning of "immediately following text" is language dependent. English speech synthesizers will emphasize the next word.
4
Percentages for 1 to 12 semitone pitch rises are +5.9%, +12.2%, +18.9%, +26.0%, +33.5%,
+41.4%, +50%, +58.7, +68.2%, +78.2%, +88.8%, +100%.
Decreases are -5.6%, -10.9%, -15.9%, -20.6%, -25.1%, -29.3%, -33.3%, -37.0%, -40.5%,
-43.9%, -47.0%, -50.0%.
TOC Prev NextJava Speech Markup Language Specification (HTML generated by hunt on August 29, 1997)