Speech Synthesis Markup Language
Controlling pauses, pitch, and other characteristics of your Replica voice
Replica’s AI voices automatically pause with normal punctuation, and generate natural voice with pitch and volume to best synthesize the characteristics and pacing of a real voice.
In some cases, you may want to have additional control over pauses, and other vocal characteristics of the voice. For example, you may to make a voice line sound a bit more dramatic, which you could do by lowering the speaking rate and adding extra pauses to increase the silence between parts of the sentence. Replica supports these controls through the use of Speech Synthesis Markup Language (SSML).
If you are looking to apply a pitch, pace or volume change over the entire generation - the Global Controls for these may be better suited to your task.
Speech Synthesis Markup Language (SSML)Copied!
SSML provides a standard way to markup plain text with tags, which allow more control in generating natural synthetic speech. The following shows an example of how plain text can be marked up with SSML.
<speak>
Control the pacing,<break time="500ms"/> of your Replica voice.
<prosody pitch="2st">Change it's pitch!</prosody>
<prosody volume="loud">Or make it louder.</prosody>
</speak>
How to use SSML in your requestCopied!
Using the supported SSML tags build your text for the Replica voice to synthesize. Make sure to wrap the text in speak tags, as these are required. Once your SSML input is ready, you can send the request just like any other text in the text
parameter.
Supported SSML tagsCopied!
Replica supports a subset of the SSML tags defined in the World Wide Web Consortium's SSML specification Version 1.1. The W3 specification may be helpful for additional context and examples.
speak
This is the root element of the SSML response.
It is required and needs to encapsulate the whole response. There should be nothing outside of this tag, and inside there should only be text of other Replica supported tags.
Only one speak tag should be present per request.
Example:
<speak>
Synthesize speech with SSML in your response.
</speak>
break
This is an optional tag used to control the pacing of synthesized speech by adding silences or pauses. The Replica voice will automatically synthesize pauses to create natural pacing when no break tags are given.
Attribute |
Description |
Examples |
---|---|---|
time |
Specifies the duration of a pause in seconds (s) or milliseconds (ms). Cannot be negative, and can be up to 10 seconds (10s) or 10000 milliseconds (10000ms). Must include the unit ("s" or "ms") with the time value. |
time="2s" |
strength |
|
strength="medium" |
Example:
<speak>
Use breaks to add pauses to your Replica voice,
<break time="400ms">
and adjust the pacing to fit your story.
</speak>
prosody
This is an optional element used to modify the volume, rate, and pitch of synthesized speech.
Attribute |
Description |
Examples |
---|---|---|
volume |
Set the volume with a predefined value: |
volume="soft" |
rate |
Set the rate with a predefined value: |
rate="fast" |
pitch |
Set the pitch with a predefined value: |
pitch="low" |
Examples:
<speak>
To make a voice sound scarier, you can:
<prosody rate="slow">Slow the rate of your Replica voice.</prosody>
<prosody pitch="-2st">Decrease it's pitch!</prosody>
<prosody volume="+6dB">And make it louder.</prosody>
</speak>
You can combine multiple prosody attributes within a single tag.
<speak>
<prosody volume="+6dB" pitch="-8st">
Decrease the pitch and make it louder.
</prosody>
</speak>
say-as
Control how numbers, date and potential acronyms are interpreted.
Attribute |
Description |
Examples |
---|---|---|
interpret-as |
The way in which to interpret the enclosed text. Options are: |
interpret-as=”date” interpret-as=”spell-out” |
format |
Formatting information to control the interpretation. |
format=”dmy” format=”mdy” format=”digits” |
Examples:
<speak>
Thank you for calling. Your next appointment is
<say-as interpret-as="date" format="dmy">
15/11/24.
</say-as>
</speak>
<speak>
Who are we talking about,
<say-as interpret-as="spell-out">
WHO
</say-as>
, you mean the World Health Organization?
</speak>