Dia — Realistic Dialogue Text-to-Speech Model
Dia is a 1.6B parameter TTS model by Nari Labs that generates realistic dialogue audio from transcripts. 19.2K+ GitHub stars. Supports multi-speaker dialogue, non-verbal sounds, and voice cloning. Apa
What it is
Dia is a 1.6B parameter text-to-speech model built by Nari Labs. It generates realistic dialogue audio from text transcripts, supporting multiple speakers in a single generation, non-verbal sounds like laughter and sighs, and voice cloning from reference audio.
Dia targets podcast producers, content creators, and developers building conversational AI interfaces who need natural-sounding multi-speaker audio without recording studios.
How it saves time or tokens
Traditional multi-speaker TTS requires generating each speaker separately and splicing audio. Dia produces a complete multi-speaker conversation in a single pass. The transcript format uses speaker tags like [S1] and [S2], so you write one script and get one audio file with distinct voices.
Voice cloning with a short reference clip eliminates the need for voice actor sessions for consistent character voices.
How to use
- Install Dia:
pip install dia-tts - Prepare a transcript with speaker tags and optional non-verbal annotations
- Run generation with the CLI or Python API
- For voice cloning, provide a reference audio file for each speaker
Example
from dia import Dia
model = Dia('nari-labs/dia-1.6b')
transcript = '''
[S1] Have you tried the new model?
[S2] (laughs) Yeah, it is surprisingly good.
[S1] Right? The voice quality is way better than I expected.
[S2] I might use it for my podcast intros.
'''
audio = model.generate(
transcript,
output_path='dialogue.wav',
sample_rate=44100
)
The output is a single WAV file with two distinct speaker voices and the laugh rendered naturally.
Related on TokRepo
- Voice tools -- Text-to-speech and voice AI tools
- Content tools -- Content creation and production tools
Common pitfalls
- Voice cloning quality degrades with noisy or short reference clips; use clean audio of at least 10 seconds
- Non-verbal sound tags must match the model's vocabulary; unsupported tags are silently ignored
- Running the 1.6B model requires a GPU with at least 6 GB VRAM; CPU inference is possible but very slow
Frequently Asked Questions
Dia supports multi-speaker dialogue with distinct speaker tags. The typical use case is two speakers, but the model can handle additional speakers with diminishing voice distinction. For best results, stick to two or three speakers per generation.
A GPU with at least 6 GB VRAM is recommended for real-time generation. The 1.6B parameter model runs on consumer GPUs like the RTX 3060 or above. CPU inference works but is significantly slower, making it impractical for long dialogues.
Provide a reference audio clip (at least 10 seconds of clean speech) for each speaker. Dia extracts voice characteristics and applies them during generation. The cloned voice maintains the speaking style and timbre of the reference across the entire dialogue.
Dia outputs WAV files by default. You can configure the sample rate (16 kHz, 22 kHz, or 44.1 kHz). For other formats like MP3 or FLAC, post-process the WAV output with ffmpeg or a Python audio library.
Yes. Dia is released under the Apache 2.0 license. The model weights and code are available on GitHub. You can fine-tune the model on your own data for domain-specific voice quality.
Citations (3)
- Dia GitHub— Dia is a 1.6B parameter TTS model by Nari Labs with 19.2K+ GitHub stars
- Dia License— Apache 2.0 license for open-source TTS model
- arXiv— Text-to-speech synthesis using neural network architectures
Related on TokRepo
Source & Thanks
Created by Nari Labs. Licensed under Apache 2.0. nari-labs/dia — 19,200+ GitHub stars
Discussion
Related Assets
Flax — Neural Network Library for JAX
A high-performance neural network library built on JAX, providing a flexible module system used extensively across Google DeepMind and the JAX research community.
PyCaret — Low-Code Machine Learning in Python
An open-source AutoML library that wraps scikit-learn, XGBoost, LightGBM, CatBoost, and other ML libraries into a unified low-code interface for rapid experimentation.
DGL — Deep Graph Library for Scalable Graph Neural Networks
A high-performance framework for building graph neural networks on top of PyTorch, TensorFlow, or MXNet, designed for both research prototyping and production-scale graph learning.