Voice & Speech

Mejores herramientas de IA para voz y habla (2026)

Texto a voz, reconocimiento de voz, clonación de voz e IA de audio en tiempo real. Desde transcripción con Whisper hasta síntesis vocal de calidad ElevenLabs.

30 herramientas
Coqui TTS — Deep Learning Text-to-Speech Engine logo

Coqui TTS — Deep Learning Text-to-Speech Engine

Generate speech in 1100+ languages with voice cloning. XTTS v2 streams with under 200ms latency. 44K+ GitHub stars.

TokRepo精选 366Scripts
Video AI Toolkit — Complete Collection logo

Video AI Toolkit — Complete Collection

Curated video AI tools: Remotion (programmatic video), Manim (math animation), MoviePy (editing), Whisper (speech-to-text), ElevenLabs (voiceover). Build automated video pipelines.

Skill Factory 351Skills
Zonos — Multilingual TTS with Voice Cloning logo

Zonos — Multilingual TTS with Voice Cloning

Zonos is an open-weight TTS model trained on 200K+ hours of speech. 7.2K+ stars. Voice cloning, 5 languages, emotion control. Apache 2.0.

Script Depot 298Scripts
Fish Speech — Multilingual TTS for 80+ Languages logo

Fish Speech — Multilingual TTS for 80+ Languages

Fish Speech is a state-of-the-art open-source TTS system supporting 80+ languages. 29K+ GitHub stars. 4B dual-AR model, voice cloning, emotional control with 15K+ tags, real-time inference.

AI Open Source 293Skills
Remotion AI Voiceover Skill — ElevenLabs TTS logo

Remotion AI Voiceover Skill — ElevenLabs TTS

AI skill for adding ElevenLabs text-to-speech voiceover to Remotion videos. Auto-sizes composition duration to match generated audio.

ElevenLabs 292Skills
Together AI Audio TTS/STT Skill for Claude Code logo

Together AI Audio TTS/STT Skill for Claude Code

Skill that teaches Claude Code Together AI's audio API. Covers text-to-speech (REST and WebSocket streaming), speech-to-text transcription, and realtime voice interaction.

Together AI 280Skills
F5-TTS — Flow Matching Text-to-Speech logo

F5-TTS — Flow Matching Text-to-Speech

F5-TTS is a diffusion transformer TTS system with flow matching. 14.3K+ GitHub stars. Multi-speaker, voice chat, Gradio UI, CLI inference, 0.04 RTF on L20 GPU. MIT code.

Script Depot 267Skills
Dia — Realistic Dialogue Text-to-Speech Model logo

Dia — Realistic Dialogue Text-to-Speech Model

Dia is a 1.6B parameter TTS model by Nari Labs that generates realistic dialogue audio from transcripts. 19.2K+ GitHub stars. Supports multi-speaker dialogue, non-verbal sounds, and voice cloning. Apa

Script Depot 253Skills
GPT-SoVITS — Few-Shot Voice Cloning and Text-to-Speech logo

GPT-SoVITS — Few-Shot Voice Cloning and Text-to-Speech

An open-source TTS system that can clone any voice from just one minute of audio data, combining GPT-style language modeling with VITS synthesis for natural speech generation.

AI Open Source 233Skills
Index TTS — Industrial Zero-Shot Text-to-Speech System logo

Index TTS — Industrial Zero-Shot Text-to-Speech System

A controllable and efficient zero-shot text-to-speech system built for industrial use, supporting voice cloning and cross-lingual synthesis with high-quality output.

Script Depot 198Skills
CosyVoice — Multilingual Voice Generation with LLM-Based TTS logo

CosyVoice — Multilingual Voice Generation with LLM-Based TTS

CosyVoice is an open-source text-to-speech system built on large language models by Alibaba's FunAudioLLM team. It supports 9 languages and 18+ Chinese dialects with zero-shot voice cloning, streaming synthesis, and fine-grained prosody control.

AI Open Source 183Skills
Groq Whisper — Sub-Second Speech-to-Text for Voice Agents logo

Groq Whisper — Sub-Second Speech-to-Text for Voice Agents

Whisper-large-v3 on Groq runs 166× realtime — 60-sec clip in <400ms. OpenAI-compat audio.transcriptions endpoint for voice agents.

Groq 171Skills
Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality logo

Tortoise TTS — Multi-Voice Text-to-Speech Focused on Quality

A multi-voice TTS system trained with an emphasis on audio quality. Uses autoregressive and diffusion models to produce natural, expressive speech from text.

AI Open Source 170Skills
Deepgram Aura TTS — Text-to-Speech for Voice Agents logo

Deepgram Aura TTS — Text-to-Speech for Voice Agents

Deepgram Aura TTS produces natural English TTS with 250ms TTFA. Streaming WebSocket, 12 voices, tuned for conversational agents not narration.

Deepgram 165Scripts
SenseVoice — Multilingual Speech Understanding Model logo

SenseVoice — Multilingual Speech Understanding Model

SenseVoice is an open-source speech foundation model by Alibaba's FunAudioLLM team that performs automatic speech recognition, language identification, speech emotion recognition, and audio event detection in a single model. It supports 50+ languages and runs significantly faster than Whisper.

AI Open Source 153Skills
Piper — Fast Local Text-to-Speech Engine for 30+ Languages logo

Piper — Fast Local Text-to-Speech Engine for 30+ Languages

Lightweight neural TTS system optimized for Raspberry Pi and edge devices with offline support and dozens of voice models.

AI Open Source 99Configs
OmniVoice Studio — Open-Source Voice Cloning and TTS Desktop App logo

OmniVoice Studio — Open-Source Voice Cloning and TTS Desktop App

OmniVoice Studio is a self-hosted desktop application for voice cloning, text-to-speech, dubbing, and dictation. It runs entirely on your local machine, providing a privacy-first alternative to cloud-based voice synthesis services.

Script Depot 47Scripts
whisper.cpp — Local Speech-to-Text in Pure C/C++ logo

whisper.cpp — Local Speech-to-Text in Pure C/C++

High-performance port of OpenAI Whisper in C/C++. No Python, no GPU required. Runs on CPU, Apple Silicon, CUDA, and even Raspberry Pi. Real-time transcription.

Script Depot 1,782代码Skills
Moshi — Real-Time AI Voice Conversation Engine logo

Moshi — Real-Time AI Voice Conversation Engine

Open-source real-time voice AI by Kyutai. Full-duplex speech conversation with 200ms latency, emotion recognition, and on-device processing. Apache 2.0 licensed.

AI Open Source 334Skills
ChatTTS — Expressive Text-to-Speech for Dialogue logo

ChatTTS — Expressive Text-to-Speech for Dialogue

Generate natural conversational speech with laughter, pauses, and emotion. Optimized for dialogue scenarios. 39K+ GitHub stars.

Script Depot 321Scripts
WhisperX — 70x Faster Speech Recognition logo

WhisperX — 70x Faster Speech Recognition

WhisperX provides 70x realtime speech recognition with word-level timestamps and speaker diarization. 21K+ GitHub stars. Batched inference, under 8GB VRAM. BSD-2-Clause.

Script Depot 298Skills
Remotion Rule: Voiceover logo

Remotion Rule: Voiceover

Remotion skill rule: Adding AI-generated voiceover to Remotion compositions using TTS. Part of the official Remotion Agent Skill for programmatic video in React.

Skill Factory 290Skills
Whisper — OpenAI Speech-to-Text logo

Whisper — OpenAI Speech-to-Text

OpenAI's open-source speech recognition model. Transcribe audio/video to text with word-level timestamps in 99 languages. Essential for subtitle generation.

OpenAI 289Skills
LiveKit Agents — Build Real-Time Voice AI Agents logo

LiveKit Agents — Build Real-Time Voice AI Agents

Framework for building real-time voice AI agents. STT, LLM, TTS pipeline with sub-second latency. Supports OpenAI, Anthropic, Deepgram, ElevenLabs. 9.9K+ stars.

LiveKit 286Skills
Kokoro — Lightweight 82M TTS in 9 Languages logo

Kokoro — Lightweight 82M TTS in 9 Languages

Kokoro is an 82M parameter text-to-speech model delivering quality comparable to larger models. 6.2K+ GitHub stars. Supports English, Spanish, French, Japanese, Chinese, and more. Apache 2.0.

Script Depot 285Skills
Bark — AI Text-to-Audio with Music & Effects logo

Bark — AI Text-to-Audio with Music & Effects

Bark is a transformer text-to-audio model by Suno that generates speech, music, and sound effects. 39.1K+ GitHub stars. 12+ languages, 100+ voice presets, non-speech audio. MIT licensed.

Script Depot 279Skills
Faster Whisper — 4x Faster Speech-to-Text logo

Faster Whisper — 4x Faster Speech-to-Text

Faster Whisper is a reimplementation of OpenAI Whisper using CTranslate2, up to 4x faster with less memory. 21.8K+ GitHub stars. GPU/CPU, 8-bit quantization, word timestamps, VAD. MIT licensed.

Script Depot 264Skills
ElevenLabs Python SDK — AI Text-to-Speech logo

ElevenLabs Python SDK — AI Text-to-Speech

Official ElevenLabs Python SDK for AI voice generation. Create realistic voiceovers with 30+ languages, voice cloning, and streaming support.

ElevenLabs 256SkillsCLI Tools
Fonoster — Open-Source AI Telecom Platform logo

Fonoster — Open-Source AI Telecom Platform

Open-source alternative to Twilio for building AI voice applications. Programmable voice with Answer, Say, Gather, Dial verbs. NodeJS SDK, OAuth2, Google Speech API. MIT, 7,800+ stars.

AI Open Source 251Skills
VibeVoice — Open-Source Frontier Voice AI by Microsoft logo

VibeVoice — Open-Source Frontier Voice AI by Microsoft

An open-source voice AI platform from Microsoft for speech synthesis, voice conversion, and real-time audio processing.

AI Open Source 217Skills

Tecnologías de voz con IA

AI Voice Technology

Voice AI has reached a turning point — synthetic speech is now indistinguishable from human narration, and real-time transcription works in 100+ languages. Text-to-Speech (TTS) — ElevenLabs, Coqui TTS, ChatTTS, Fish Speech, and Kokoro generate natural voiceovers with emotional control, multilingual support, and voice cloning from just seconds of sample audio.

Speech-to-Text (STT) — OpenAI's Whisper family (whisper.cpp, WhisperX, Faster Whisper) dominates transcription with near-human accuracy. Self-hosted options run entirely on local hardware for privacy-sensitive applications. Real-Time Voice — Moshi and Dia enable real-time conversational AI with natural turn-taking, interruption handling, and emotional awareness.

Voice Cloning & Synthesis — Clone any voice from a 15-second sample. F5-TTS and Zonos offer open-source voice cloning with quality rivaling commercial APIs. Essential for content creators, podcast producers, and accessibility applications.

Voice is the most natural interface — AI has finally made it programmable.

Preguntas frecuentes

¿Cuál es la mejor herramienta de IA de texto a voz?+

Para calidad: ElevenLabs lidera con las voces más naturales y el mejor control emocional. Para autoalojamiento: Coqui TTS y Fish Speech ofrecen calidad comparable sin costos de API. Para velocidad: ChatTTS y Kokoro generan voz en tiempo real. Para multilingüe: los pipelines basados en Whisper combinados con TTS multilingüe gestionan 100+ idiomas.

¿Puede la IA clonar mi voz?+

Sí. Las herramientas modernas de clonación vocal necesitan tan solo 15 segundos de muestra de audio. ElevenLabs ofrece clonación en la nube, mientras que F5-TTS y Zonos brindan alternativas open source para ejecutar localmente. La calidad es notablemente alta — las voces clonadas preservan acento, tono y estilo. Siempre obtén consentimiento antes de clonar la voz de alguien.

¿Cuál es el mejor reconocimiento de voz open source?+

Whisper de OpenAI (vía whisper.cpp para inferencia local) es el estándar de oro. WhisperX añade diarización de hablantes (quién dijo qué) y timestamps a nivel de palabra. Faster Whisper usa CTranslate2 para una mejora de velocidad 4x. Todos corren localmente sin enviar audio a servidores externos — crítico para aplicaciones sensibles como transcripción médica o legal.

Explora categorías relacionadas