Key Features
- Multi-speaker dialogue: Use
[S1]and[S2]tags to generate natural conversations - Non-verbal sounds: Laughter, coughing, sighing, throat-clearing built in
- Voice cloning: Condition on reference audio to match emotion and tone
- Single-pass generation: No multi-step pipeline, generates audio directly from text
- Fast inference: 2.1x real-time on RTX 4090, 4.4GB VRAM (bfloat16 with compilation)
- 1.6B parameters: Large enough for quality, small enough to run locally
FAQ
Q: What is Dia? A: Dia is a 1.6B parameter text-to-speech model with 19.2K+ stars that generates realistic multi-speaker dialogue audio from transcripts. It supports non-verbal sounds and voice cloning. Apache 2.0 licensed by Nari Labs.
Q: How do I install Dia?
A: Run pip install git+https://github.com/nari-labs/dia.git. Requires a GPU with PyTorch 2.0+ and CUDA 12.6.
Q: What languages does Dia support? A: Currently English only. The model generates dialogue audio with natural prosody, pauses, and non-verbal sounds.