Skills2026年3月31日·1 分钟阅读

Dia — Realistic Dialogue Text-to-Speech Model

Dia is a 1.6B parameter TTS model by Nari Labs that generates realistic dialogue audio from transcripts. 19.2K+ GitHub stars. Supports multi-speaker dialogue, non-verbal sounds, and voice cloning. Apa

Script Depot · Community

Agent 就绪

Agent 可直接安装

这个资产可安装；Agent 先选择当前运行时、检查安装计划，再运行匹配命令。

Native · 98/100策略：允许

Agent 入口

任意 MCP/CLI Agent

类型

Skill

安装

Single

信任

信任等级：Established

入口

Dia — Realistic Dialogue Text-to-Speech Model

直接安装命令

npx -y tokrepo@latest install 86148916-edf9-4ed9-8348-205c9b535810 --target codex

先 dry-run 确认安装计划，再运行此命令。

TL;DR

Dia generates realistic dialogue audio from transcripts with multi-speaker and voice cloning support.

§01

What it is

Dia is a 1.6B parameter text-to-speech model built by Nari Labs. It generates realistic dialogue audio from text transcripts, supporting multiple speakers in a single generation, non-verbal sounds like laughter and sighs, and voice cloning from reference audio.

Dia targets podcast producers, content creators, and developers building conversational AI interfaces who need natural-sounding multi-speaker audio without recording studios.

§02

How it saves time or tokens

Traditional multi-speaker TTS requires generating each speaker separately and splicing audio. Dia produces a complete multi-speaker conversation in a single pass. The transcript format uses speaker tags like [S1] and [S2], so you write one script and get one audio file with distinct voices.

Voice cloning with a short reference clip eliminates the need for voice actor sessions for consistent character voices.

§03

How to use

Install Dia: pip install dia-tts
Prepare a transcript with speaker tags and optional non-verbal annotations
Run generation with the CLI or Python API
For voice cloning, provide a reference audio file for each speaker

§04

Example

from dia import Dia

model = Dia('nari-labs/dia-1.6b')

transcript = '''
[S1] Have you tried the new model?
[S2] (laughs) Yeah, it is surprisingly good.
[S1] Right? The voice quality is way better than I expected.
[S2] I might use it for my podcast intros.
'''

audio = model.generate(
    transcript,
    output_path='dialogue.wav',
    sample_rate=44100
)

The output is a single WAV file with two distinct speaker voices and the laugh rendered naturally.

§05

Related on TokRepo

Voice tools -- Text-to-speech and voice AI tools
Content tools -- Content creation and production tools

§06

Common pitfalls

Voice cloning quality degrades with noisy or short reference clips; use clean audio of at least 10 seconds
Non-verbal sound tags must match the model's vocabulary; unsupported tags are silently ignored
Running the 1.6B model requires a GPU with at least 6 GB VRAM; CPU inference is possible but very slow

常见问题

How many speakers can Dia handle in one generation?+

Dia supports multi-speaker dialogue with distinct speaker tags. The typical use case is two speakers, but the model can handle additional speakers with diminishing voice distinction. For best results, stick to two or three speakers per generation.

Does Dia require a GPU?+

A GPU with at least 6 GB VRAM is recommended for real-time generation. The 1.6B parameter model runs on consumer GPUs like the RTX 3060 or above. CPU inference works but is significantly slower, making it impractical for long dialogues.

How does voice cloning work in Dia?+

Provide a reference audio clip (at least 10 seconds of clean speech) for each speaker. Dia extracts voice characteristics and applies them during generation. The cloned voice maintains the speaking style and timbre of the reference across the entire dialogue.

What audio formats does Dia output?+

Dia outputs WAV files by default. You can configure the sample rate (16 kHz, 22 kHz, or 44.1 kHz). For other formats like MP3 or FLAC, post-process the WAV output with ffmpeg or a Python audio library.

Is Dia open source?+

Yes. Dia is released under the Apache 2.0 license. The model weights and code are available on GitHub. You can fine-tune the model on your own data for domain-specific voice quality.

引用来源 (3)

Dia GitHub— Dia is a 1.6B parameter TTS model by Nari Labs with 19.2K+ GitHub stars
Dia License— Apache 2.0 license for open-source TTS model
arXiv— Text-to-speech synthesis using neural network architectures

🙏

来源与感谢

Created by Nari Labs. Licensed under Apache 2.0. nari-labs/dia — 19,200+ GitHub stars

讨论

登录后参与讨论。

还没有评论，来写第一条吧。

Dia — Realistic Dialogue Text-to-Speech Model

Agent 可直接安装

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

常见问题

引用来源 (3)

TokRepo 相关

来源与感谢

讨论

相关资产

ElevenLabs Python SDK — AI Text-to-Speech

PlantUML — Generate UML Diagrams from Plain Text

Rasa — Open Source Conversational AI Framework

ChatTTS — Expressive Text-to-Speech for Dialogue