Dia — Realistic Dialogue Text-to-Speech Model
Dia is a 1.6B parameter TTS model by Nari Labs that generates realistic dialogue audio from transcripts. 19.2K+ GitHub stars. Supports multi-speaker dialogue, non-verbal sounds, and voice cloning. Apa
Ready-to-run agent install
This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.
npx -y tokrepo@latest install 86148916-edf9-4ed9-8348-205c9b535810 --target codexRun after dry-run confirms the install plan.
What it is
Dia is a 1.6B parameter text-to-speech model built by Nari Labs. It generates realistic dialogue audio from text transcripts, supporting multiple speakers in a single generation, non-verbal sounds like laughter and sighs, and voice cloning from reference audio.
Dia targets podcast producers, content creators, and developers building conversational AI interfaces who need natural-sounding multi-speaker audio without recording studios.
How it saves time or tokens
Traditional multi-speaker TTS requires generating each speaker separately and splicing audio. Dia produces a complete multi-speaker conversation in a single pass. The transcript format uses speaker tags like [S1] and [S2], so you write one script and get one audio file with distinct voices.
Voice cloning with a short reference clip eliminates the need for voice actor sessions for consistent character voices.
How to use
- Install Dia:
pip install dia-tts - Prepare a transcript with speaker tags and optional non-verbal annotations
- Run generation with the CLI or Python API
- For voice cloning, provide a reference audio file for each speaker
Example
from dia import Dia
model = Dia('nari-labs/dia-1.6b')
transcript = '''
[S1] Have you tried the new model?
[S2] (laughs) Yeah, it is surprisingly good.
[S1] Right? The voice quality is way better than I expected.
[S2] I might use it for my podcast intros.
'''
audio = model.generate(
transcript,
output_path='dialogue.wav',
sample_rate=44100
)
The output is a single WAV file with two distinct speaker voices and the laugh rendered naturally.
Related on TokRepo
- Voice tools -- Text-to-speech and voice AI tools
- Content tools -- Content creation and production tools
Common pitfalls
- Voice cloning quality degrades with noisy or short reference clips; use clean audio of at least 10 seconds
- Non-verbal sound tags must match the model's vocabulary; unsupported tags are silently ignored
- Running the 1.6B model requires a GPU with at least 6 GB VRAM; CPU inference is possible but very slow
Frequently Asked Questions
Dia supports multi-speaker dialogue with distinct speaker tags. The typical use case is two speakers, but the model can handle additional speakers with diminishing voice distinction. For best results, stick to two or three speakers per generation.
A GPU with at least 6 GB VRAM is recommended for real-time generation. The 1.6B parameter model runs on consumer GPUs like the RTX 3060 or above. CPU inference works but is significantly slower, making it impractical for long dialogues.
Provide a reference audio clip (at least 10 seconds of clean speech) for each speaker. Dia extracts voice characteristics and applies them during generation. The cloned voice maintains the speaking style and timbre of the reference across the entire dialogue.
Dia outputs WAV files by default. You can configure the sample rate (16 kHz, 22 kHz, or 44.1 kHz). For other formats like MP3 or FLAC, post-process the WAV output with ffmpeg or a Python audio library.
Yes. Dia is released under the Apache 2.0 license. The model weights and code are available on GitHub. You can fine-tune the model on your own data for domain-specific voice quality.
Citations (3)
- Dia GitHub— Dia is a 1.6B parameter TTS model by Nari Labs with 19.2K+ GitHub stars
- Dia License— Apache 2.0 license for open-source TTS model
- arXiv— Text-to-speech synthesis using neural network architectures
Related on TokRepo
Source & Thanks
Created by Nari Labs. Licensed under Apache 2.0. nari-labs/dia — 19,200+ GitHub stars
Discussion
Related Assets
ElevenLabs Python SDK — AI Text-to-Speech
Official ElevenLabs Python SDK for AI voice generation. Create realistic voiceovers with 30+ languages, voice cloning, and streaming support.
PlantUML — Generate UML Diagrams from Plain Text
A text-based diagramming tool that converts simple markup into sequence, class, activity, component, and many other UML and non-UML diagram types.
Rasa — Open Source Conversational AI Framework
Rasa is a Python framework for building contextual AI assistants with natural language understanding, dialogue management, and custom action support for text and voice channels.
ChatTTS — Expressive Text-to-Speech for Dialogue
Generate natural conversational speech with laughter, pauses, and emotion. Optimized for dialogue scenarios. 39K+ GitHub stars.