Speech Synthesis Markup Language

Controlling pauses, pitch, and other characteristics of your Replica voice

Replica’s AI voices automatically pause with normal punctuation, and generate natural voice with pitch and volume to best synthesize the characteristics and pacing of a real voice.

In some cases, you may want to have additional control over pauses, and other vocal characteristics of the voice. For example, you may to make a voice line sound a bit more dramatic, which you could do by lowering the speaking rate and adding extra pauses to increase the silence between parts of the sentence. Replica supports these controls through the use of Speech Synthesis Markup Language (SSML).

If you are looking to apply a pitch, pace or volume change over the entire generation - the Global Controls for these may be better suited to your task.

Speech Synthesis Markup Language (SSML)Copied!

SSML provides a standard way to markup plain text with tags, which allow more control in generating natural synthetic speech. The following shows an example of how plain text can be marked up with SSML.

<speak>

    Control the pacing,<break time="500ms"/> of your Replica voice.

    <prosody pitch="2st">Change it's pitch!</prosody>

    <prosody volume="loud">Or make it louder.</prosody>

</speak>

How to use SSML in your requestCopied!

Using the supported SSML tags build your text for the Replica voice to synthesize. Make sure to wrap the text in speak tags, as these are required. Once your SSML input is ready, you can send the request just like any other text in the text parameter.

Supported SSML tagsCopied!

Replica supports a subset of the SSML tags defined in the World Wide Web Consortium's SSML specification Version 1.1. The W3 specification may be helpful for additional context and examples.

speak

This is the root element of the SSML response.

It is required and needs to encapsulate the whole response. There should be nothing outside of this tag, and inside there should only be text of other Replica supported tags.
Only one speak tag should be present per request.

Example:

<speak>

    Synthesize speech with SSML in your response.

</speak>

break

This is an optional tag used to control the pacing of synthesized speech by adding silences or pauses. The Replica voice will automatically synthesize pauses to create natural pacing when no break tags are given.

Attribute	Description	Examples
time	Specifies the duration of a pause in seconds (s) or milliseconds (ms). Cannot be negative, and can be up to 10 seconds (10s) or 10000 milliseconds (10000ms). Must include the unit ("s" or "ms") with the time value.	time="2s" time="500ms"
strength	`none`: No extra pause added. Though pause may still be synthesized. `x-weak`: Behaves the same as None. `weak`: Adds a small pause, equivalent to a single comma (same behaviour as medium). `medium`: Adds a small pause, equivalent to a single comma. `strong`: Adds a large pause, equivalent to a full stop. `x-strong`: Adds a larger pause, equivalent to end of paragraph.	strength="medium"

Example:

<speak>

    Use breaks to add pauses to your Replica voice, 

    <break time="400ms"> 

    and adjust the pacing to fit your story.

</speak>

prosody

This is an optional element used to modify the volume, rate, and pitch of synthesized speech.

Attribute	Description	Examples
volume	Set the volume with a predefined value: `silent`, `x-soft`, `soft`, `medium`, `loud`, `x-loud` Or, specify the volume by a given decibel value, ±XdB. The values should be in the range of -6dB to +6dB. The polarity and decibel unit "dB" is required.	volume="soft" volume="+3dB" volume="-0.5dB"
rate	Set the rate with a predefined value: `x-slow`, `slow`, `medium`, `fast`, `x-fast` Or, specify a percentage by which to increase or decrease the speed by, X%. The percent sign "%" is required. 100% means that there will be no change. Values above and below 100 with increase and decrease the speed respectively. The speech cannot be lower then 50% or higher then 150%.	rate="fast" rate="50%" rate="120%"
pitch	Set the pitch with a predefined value: `x-low`, `low`, `medium`, `high`, `x-high` Or, specify the number of semitones by which to increase or decrease the pitch by, ±Xst. The pitch cannot be lowered my more then -12st or more then +12st. The polarity and semitone unit "st" is required.	pitch="low" pitch="+6st" pitch="-3st"

Examples:

<speak>

    To make a voice sound scarier, you can:

    <prosody rate="slow">Slow the rate of your Replica voice.</prosody>

    <prosody pitch="-2st">Decrease it's pitch!</prosody>

    <prosody volume="+6dB">And make it louder.</prosody>

</speak>

You can combine multiple prosody attributes within a single tag.

<speak>

    <prosody volume="+6dB" pitch="-8st">

        Decrease the pitch and make it louder.

    </prosody>

</speak>

say-as

Control how numbers, date and potential acronyms are interpreted.

Attribute	Description	Examples
interpret-as	The way in which to interpret the enclosed text. Options are: `date`: Ensure dates like 11/11/24 are read as “11th of November 2024” or “November 11th 2024.” ‍`currency`: Convert symbols like $10 into natural language (“ten dollars”) `number`: Interpret number pronunciation cardinally, as digits or years. `spell-out`: Pronounce each letter individually. So you choose for WHO to be pronounced as the word 'who' or to spell-out each character as 'W H O', as in the World Health Organization.	interpret-as=”date” interpret-as=”spell-out”
format	Formatting information to control the interpretation. Not required if `interpret-as` is `currency` or `spell-out`. If `interpret-as` is `date` : - string with "d" (day), "m" (month), and "y" (year) in the desired pronunciation order. If `interpret-as` is `number` : - `cardinal`: “2025” becomes “two thousand and twenty five.” - `digits`: “2025” as “two zero two five.” - `year`: “2025” as “twenty twenty five.”	format=”dmy” format=”mdy” format=”digits”

Attribute

Description

Examples

interpret-as

The way in which to interpret the enclosed text. Options are:

date: Ensure dates like 11/11/24 are read as “11th of November 2024” or “November 11th 2024.”

‍currency: Convert symbols like $10 into natural language (“ten dollars”)

number: Interpret number pronunciation cardinally, as digits or years.

spell-out: Pronounce each letter individually. So you choose for WHO to be pronounced as the word 'who' or to spell-out each character as 'W H O', as in the World Health Organization.

interpret-as=”date”

interpret-as=”spell-out”

format

Formatting information to control the interpretation.

Not required if interpret-as is currency or spell-out.

If interpret-as is date :
- string with "d" (day), "m" (month), and "y" (year) in the desired pronunciation order.

If interpret-as is number :
- cardinal: “2025” becomes “two thousand and twenty five.”
- digits: “2025” as “two zero two five.”
- year: “2025” as “twenty twenty five.”

format=”dmy”

format=”mdy”

format=”digits”

Examples:

<speak>
  Thank you for calling. Your next appointment is 
  <say-as interpret-as="date" format="dmy">
    15/11/24.
  </say-as>
</speak>

<speak>
  Who are we talking about,
  <say-as interpret-as="spell-out">
    WHO
  </say-as>
  , you mean the World Health Organization?
</speak>

Previous Page

Global Controls

Next Page

Text-to-Speech