Skip to main content

Qwen3-TTS Integration

Run Alibaba's Qwen3-TTS locally for high-quality, multilingual text-to-speech. This guide covers setting up the OpenAI-compatible TTS server included with Libre WebUI.

Overview

Qwen3-TTS is an advanced text-to-speech system featuring:

  • 9 pre-built voices spanning English, Chinese, Japanese, and Korean
  • 10 language support including German, French, Spanish, Italian, Portuguese, and Russian
  • Voice cloning from 3-second audio samples
  • Voice design using natural language descriptions
  • Instruction control for emotion and prosody

The included server wraps Qwen3-TTS in an OpenAI-compatible API, allowing Libre WebUI to use it through the standard plugin system.

Requirements

ComponentMinimumRecommended
Python3.12+3.12 (not 3.14)
GPU VRAM4GB (0.6B models)8GB+ (1.7B models)
RAM8GB16GB+
Disk5GB10GB

Platform Support

PlatformBackendNotes
NVIDIA GPUCUDABest performance, bfloat16 support
Apple SiliconMPSUse 0.6B models for memory efficiency
CPUPyTorchSlower, use 0.6B models
Apple Silicon Users

Use the customvoice-0.6b model variant on Mac to avoid memory pressure. The 1.7B models may cause system instability on machines with 16GB unified memory.

Quick Start

1. Install the Server

cd examples/qwen-tts-server

# Create virtual environment (Python 3.12 required)
python3.12 -m venv venv
source venv/bin/activate # Linux/macOS
# or: venv\Scripts\activate # Windows

# Install dependencies
pip install -r requirements.txt

2. Start the Server

# NVIDIA GPU (recommended)
python server.py --model customvoice-1.7b

# Apple Silicon
python server.py --model customvoice-0.6b

# CPU (slower)
python server.py --model customvoice-0.6b

The server runs at http://localhost:8100 by default.

3. Configure Libre WebUI

The plugin is pre-configured in plugins/qwen-tts.json. Enable it in Settings → Plugins → Qwen3 TTS.

4. Test It

curl http://localhost:8100/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-tts", "input": "Hello, welcome to Libre WebUI!", "voice": "Ryan"}' \
--output speech.wav

Available Models

ModelSizeUse Case
customvoice-1.7b~3.5GBPre-built voices with instruction control
customvoice-0.6b~1.5GBLightweight variant for limited VRAM
voicedesign-1.7b~3.5GBCreate voices from text descriptions
base-1.7b~3.5GBVoice cloning from 3-second samples
base-0.6b~1.5GBLightweight voice cloning

Voices

Pre-Built Voices (CustomVoice Models)

VoiceLanguageDescription
RyanEnglishMale, clear and natural
AidenEnglishMale, warm tone
VivianChineseFemale, professional
SerenaChineseFemale, friendly
Uncle_FuChineseMale, mature
DylanChineseMale, Beijing dialect
EricChineseMale, Sichuan dialect
Ono_AnnaJapaneseFemale
SoheeKoreanFemale

OpenAI Voice Aliases

For compatibility with OpenAI TTS clients, the server maps OpenAI voice names:

OpenAI VoiceMaps To
alloyRyan
echoAiden
fableVivian
onyxUncle_Fu
novaSerena
shimmerOno_Anna

API Reference

Speech Generation

Endpoint: POST /v1/audio/speech

{
"model": "qwen3-tts",
"input": "Text to convert to speech",
"voice": "Ryan",
"response_format": "wav",
"instruct": "Speak with enthusiasm",
"language": "English"
}
ParameterTypeDefaultDescription
modelstringqwen3-ttsModel identifier
inputstringrequiredText to synthesize (max 10,000 chars)
voicestringryanVoice name (see table above)
response_formatstringwavAudio format (only wav supported)
instructstring""Emotion/prosody instruction
languagestringauto-detectOverride language detection

Response: Audio file (audio/wav)

Voice Design

Endpoint: POST /v1/audio/voice-design

Create custom voices from natural language descriptions.

{
"model": "qwen3-tts-voicedesign",
"input": "Welcome to our service.",
"voice_description": "A warm, friendly female voice with a slight British accent",
"language": "English"
}
note

Requires the voicedesign-1.7b model to be loaded.

Voice Cloning

Endpoint: POST /v1/audio/voice-clone

Clone a voice from a 3+ second audio sample.

curl -X POST http://localhost:8100/v1/audio/voice-clone \
-F "input=Hello, this is my cloned voice." \
-F "reference_audio=@reference.wav" \
-F "reference_text=This is what was said in the reference." \
--output cloned.wav
ParameterTypeDescription
inputstringText to synthesize
reference_audiofile3+ second audio sample
reference_textstringTranscript of reference audio
note

Requires the base-1.7b or base-0.6b model to be loaded.

List Voices

Endpoint: GET /v1/voices

{
"voices": [
{"id": "ryan", "name": "Ryan", "language": "English"},
{"id": "aiden", "name": "Aiden", "language": "English"},
...
]
}

Health Check

Endpoint: GET /health

{"status": "healthy", "model_loaded": true}

Server Configuration

python server.py [OPTIONS]
OptionDefaultDescription
--host0.0.0.0Host to bind to
--port8100Port to bind to
--modelcustomvoice-1.7bModel variant to load

Network Access

To access the server from other machines on your network:

# Start server on all interfaces
python server.py --host 0.0.0.0 --port 8100

# Access from another machine
curl http://192.168.1.100:8100/v1/audio/speech ...

Update the plugin endpoint in plugins/qwen-tts.json:

{
"endpoint": "http://192.168.1.100:8100/v1/audio/speech",
"capabilities": {
"tts": {
"endpoint": "http://192.168.1.100:8100/v1/audio/speech"
}
}
}

Production Features

Text Sanitization

The server automatically sanitizes input text to prevent model hangs:

  • Removes emojis and symbols
  • Strips markdown formatting (*bold*, _italic_, etc.)
  • Collapses repeated characters (FUUUUUFUU)
  • Removes stage directions (*(action)*, (whispers))
  • Normalizes whitespace

Text Chunking

Long text is automatically split at sentence boundaries:

  • Maximum 500 characters per chunk
  • 30-second timeout per chunk
  • Failed chunks are skipped, remaining chunks continue
  • Chunks are concatenated into single audio response

This prevents timeouts on long AI responses while maintaining natural speech flow.

Multi-GPU Setup

For systems with multiple GPUs, the server forces single-GPU execution to avoid tensor device mismatches:

device_map = {"": "cuda:0"}  # Uses first GPU only

To use a specific GPU:

CUDA_VISIBLE_DEVICES=1 python server.py --model customvoice-1.7b

Troubleshooting

Model Download Fails

The model downloads from Hugging Face on first run. If it fails:

# Set Hugging Face token for gated models
export HF_TOKEN=hf_...

# Or download manually
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

Out of Memory (Apple Silicon)

RuntimeError: MPS backend out of memory

Use the smaller model variant:

python server.py --model customvoice-0.6b

CUDA Out of Memory

torch.cuda.OutOfMemoryError: CUDA out of memory
  1. Close other GPU applications
  2. Use the 0.6B model variant
  3. Reduce chunk size in server.py (max_chunk_size=300)

Server Times Out

If generation times out on long text:

  1. The server automatically chunks text and continues with remaining chunks
  2. Check server logs for which chunks timed out
  3. Consider shortening your input text

Audio Sounds Wrong

  • Repeated syllables: Usually caused by emojis or special characters. The sanitizer should handle this automatically.
  • Wrong language: Set the language parameter explicitly in the request.
  • Unnatural pauses: Text may be splitting at wrong boundaries. Check for unusual punctuation.

Plugin Configuration

The included plugin (plugins/qwen-tts.json):

{
"id": "qwen-tts",
"name": "Qwen3 TTS",
"type": "tts",
"endpoint": "http://localhost:8100/v1/audio/speech",
"auth": {
"header": "",
"key_env": ""
},
"model_map": [
"qwen3-tts",
"qwen3-tts-customvoice",
"qwen3-tts-voicedesign",
"qwen3-tts-clone"
],
"capabilities": {
"tts": {
"endpoint": "http://localhost:8100/v1/audio/speech",
"model_map": [
"qwen3-tts",
"qwen3-tts-customvoice",
"qwen3-tts-voicedesign",
"qwen3-tts-clone"
],
"config": {
"voices": ["Ryan", "Aiden", "Vivian", "Serena", "Uncle_Fu", "Dylan", "Eric", "Ono_Anna", "Sohee"],
"default_voice": "Ryan",
"formats": ["wav"],
"default_format": "wav",
"max_characters": 10000,
"supports_streaming": false,
"no_auth_required": true
}
}
},
"description": "Qwen3-TTS local TTS server (NVIDIA CUDA, Apple MPS, or CPU)",
"documentation_url": "https://github.com/QwenLM/Qwen3-TTS"
}

Resources