Skills2026年3月31日·1 分钟阅读

Dia — Realistic Dialogue Text-to-Speech Model

Dia is a 1.6B parameter TTS model by Nari Labs that generates realistic dialogue audio from transcripts. 19.2K+ GitHub stars. Supports multi-speaker dialogue, non-verbal sounds, and voice cloning. Apa

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Dia — Realistic Dialogue Text-to-Speech Model
直接安装命令
npx -y tokrepo@latest install 86148916-edf9-4ed9-8348-205c9b535810 --target codex

先 dry-run 确认安装计划,再运行此命令。

TL;DR
Dia generates realistic dialogue audio from transcripts with multi-speaker and voice cloning support.
§01

What it is

Dia is a 1.6B parameter text-to-speech model built by Nari Labs. It generates realistic dialogue audio from text transcripts, supporting multiple speakers in a single generation, non-verbal sounds like laughter and sighs, and voice cloning from reference audio.

Dia targets podcast producers, content creators, and developers building conversational AI interfaces who need natural-sounding multi-speaker audio without recording studios.

§02

How it saves time or tokens

Traditional multi-speaker TTS requires generating each speaker separately and splicing audio. Dia produces a complete multi-speaker conversation in a single pass. The transcript format uses speaker tags like [S1] and [S2], so you write one script and get one audio file with distinct voices.

Voice cloning with a short reference clip eliminates the need for voice actor sessions for consistent character voices.

§03

How to use

  1. Install Dia: pip install dia-tts
  2. Prepare a transcript with speaker tags and optional non-verbal annotations
  3. Run generation with the CLI or Python API
  4. For voice cloning, provide a reference audio file for each speaker
§04

Example

from dia import Dia

model = Dia('nari-labs/dia-1.6b')

transcript = '''
[S1] Have you tried the new model?
[S2] (laughs) Yeah, it is surprisingly good.
[S1] Right? The voice quality is way better than I expected.
[S2] I might use it for my podcast intros.
'''

audio = model.generate(
    transcript,
    output_path='dialogue.wav',
    sample_rate=44100
)

The output is a single WAV file with two distinct speaker voices and the laugh rendered naturally.

§05

Related on TokRepo

§06

Common pitfalls

  • Voice cloning quality degrades with noisy or short reference clips; use clean audio of at least 10 seconds
  • Non-verbal sound tags must match the model's vocabulary; unsupported tags are silently ignored
  • Running the 1.6B model requires a GPU with at least 6 GB VRAM; CPU inference is possible but very slow

常见问题

How many speakers can Dia handle in one generation?+

Dia supports multi-speaker dialogue with distinct speaker tags. The typical use case is two speakers, but the model can handle additional speakers with diminishing voice distinction. For best results, stick to two or three speakers per generation.

Does Dia require a GPU?+

A GPU with at least 6 GB VRAM is recommended for real-time generation. The 1.6B parameter model runs on consumer GPUs like the RTX 3060 or above. CPU inference works but is significantly slower, making it impractical for long dialogues.

How does voice cloning work in Dia?+

Provide a reference audio clip (at least 10 seconds of clean speech) for each speaker. Dia extracts voice characteristics and applies them during generation. The cloned voice maintains the speaking style and timbre of the reference across the entire dialogue.

What audio formats does Dia output?+

Dia outputs WAV files by default. You can configure the sample rate (16 kHz, 22 kHz, or 44.1 kHz). For other formats like MP3 or FLAC, post-process the WAV output with ffmpeg or a Python audio library.

Is Dia open source?+

Yes. Dia is released under the Apache 2.0 license. The model weights and code are available on GitHub. You can fine-tune the model on your own data for domain-specific voice quality.

引用来源 (3)
  • Dia GitHub— Dia is a 1.6B parameter TTS model by Nari Labs with 19.2K+ GitHub stars
  • Dia License— Apache 2.0 license for open-source TTS model
  • arXiv— Text-to-speech synthesis using neural network architectures
🙏

来源与感谢

Created by Nari Labs. Licensed under Apache 2.0. nari-labs/dia — 19,200+ GitHub stars

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产