Google: Gemini 3.1 Flash TTS Preview

google/gemini-3.1-flash-tts-preview

8KContext Window

4KMax Output

Normal

Gemini 3.1 Flash TTS Preview is a text-to-speech model from Google, and a substantial generational step up from Gemini 2.5 Flash TTS. It takes text input and produces audio output across 70+ languages — nearly 3× the language coverage of its predecessor. The headline addition is a system of 200+ inline audio tags (e.g. `[whispers]`, `[laughs]`, `[excited]`) that let developers steer delivery, emotion, and pacing mid-sentence, alongside a "director's chair" workflow in Google AI Studio for defining per-character Audio Profiles and scene-level context. It supports up to two speakers with independent voice and style configuration per speaker, outputs PCM audio at 24 kHz / 16-bit mono, and automatically watermarks all output with SynthID. Context window is 32k tokens.

Capabilities

Audio GenerationSpeech Recognition

Technical Specs

Input Modality

Text

Output Modality

Text

Arch

—

Pricing

Pay per use, no monthly fees

Billing Type	Unit	Price
Text Input	—	$1.0000/M tokens
Text Output	—	$20.0000/M tokens
Reasoning	—	$1.0000/M tokens
Audio Output	—	< $0.001/分钟

Quick Start

from openai import OpenAI

client = OpenAI(
    base_url="https://api.uniontoken.ai/v1",
    api_key="YOUR_UNIONTOKEN_API_KEY",
)

response = client.chat.completions.create(
    model="google/gemini-3.1-flash-tts-preview",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
)

print(response.choices[0].message.content)