Esta página se muestra en inglés. Una traducción al español está en curso.
Esta página se muestra en inglés. Una traducción al español está en curso.
Voice & Speech

Best AI Tools for Voice & Speech (2026)

Text-to-speech, speech-to-text, voice cloning, and real-time audio AI. From Whisper transcription to ElevenLabs-quality voice synthesis.

30 herramientas

Zonos — Multilingual TTS with Voice Cloning

Zonos is an open-weight TTS model trained on 200K+ hours of speech. 7.2K+ stars. Voice cloning, 5 languages, emotion control. Apache 2.0.

Script Depot 82Scripts

Coqui TTS — Deep Learning Text-to-Speech Engine

Generate speech in 1100+ languages with voice cloning. XTTS v2 streams with under 200ms latency. 44K+ GitHub stars.

TokRepo Curated 81Scripts
🧩

Video AI Toolkit — Complete Collection

Curated video AI tools: Remotion (programmatic video), Manim (math animation), MoviePy (editing), Whisper (speech-to-text), ElevenLabs (voiceover). Build automated video pipelines.

Skill Factory 68Skills

F5-TTS — Flow Matching Text-to-Speech

F5-TTS is a diffusion transformer TTS system with flow matching. 14.3K+ GitHub stars. Multi-speaker, voice chat, Gradio UI, CLI inference, 0.04 RTF on L20 GPU. MIT code.

Script Depot 55Scripts

Dia — Realistic Dialogue Text-to-Speech Model

Dia is a 1.6B parameter TTS model by Nari Labs that generates realistic dialogue audio from transcripts. 19.2K+ GitHub stars. Supports multi-speaker dialogue, non-verbal sounds, and voice cloning. Apa

Script Depot 54Scripts

Fish Speech — Multilingual TTS for 80+ Languages

Fish Speech is a state-of-the-art open-source TTS system supporting 80+ languages. 29K+ GitHub stars. 4B dual-AR model, voice cloning, emotional control with 15K+ tags, real-time inference.

AI Open Source 53Configs

Remotion AI Voiceover Skill — ElevenLabs TTS

AI skill for adding ElevenLabs text-to-speech voiceover to Remotion videos. Auto-sizes composition duration to match generated audio.

Skill Factory 47Skills

Together AI Audio TTS/STT Skill for Claude Code

Skill that teaches Claude Code Together AI's audio API. Covers text-to-speech (REST and WebSocket streaming), speech-to-text transcription, and realtime voice interaction.

Prompt Lab 45Skills

whisper.cpp — Local Speech-to-Text in Pure C/C++

High-performance port of OpenAI Whisper in C/C++. No Python, no GPU required. Runs on CPU, Apple Silicon, CUDA, and even Raspberry Pi. Real-time transcription.

Script Depot 187代码Scripts

ChatTTS — Expressive Text-to-Speech for Dialogue

Generate natural conversational speech with laughter, pauses, and emotion. Optimized for dialogue scenarios. 39K+ GitHub stars.

Script Depot 91Scripts

WhisperX — 70x Faster Speech Recognition

WhisperX provides 70x realtime speech recognition with word-level timestamps and speaker diarization. 21K+ GitHub stars. Batched inference, under 8GB VRAM. BSD-2-Clause.

Script Depot 88Scripts

Remotion Rule: Voiceover

Remotion skill rule: Adding AI-generated voiceover to Remotion compositions using TTS. Part of the official Remotion Agent Skill for programmatic video in React.

Skill Factory 75Skills
📜

Fonoster — Open-Source AI Telecom Platform

Open-source alternative to Twilio for building AI voice applications. Programmable voice with Answer, Say, Gather, Dial verbs. NodeJS SDK, OAuth2, Google Speech API. MIT, 7,800+ stars.

AI Open Source 68Scripts

Moshi — Real-Time AI Voice Conversation Engine

Open-source real-time voice AI by Kyutai. Full-duplex speech conversation with 200ms latency, emotion recognition, and on-device processing. Apache 2.0 licensed.

AI Open Source 67Configs
📜

Bark — AI Text-to-Audio with Music & Effects

Bark is a transformer text-to-audio model by Suno that generates speech, music, and sound effects. 39.1K+ GitHub stars. 12+ languages, 100+ voice presets, non-speech audio. MIT licensed.

Script Depot 62Scripts

Whisper — OpenAI Speech-to-Text

OpenAI's open-source speech recognition model. Transcribe audio/video to text with word-level timestamps in 99 languages. Essential for subtitle generation.

Script Depot 54Scripts
📜

LiveKit Agents — Build Real-Time Voice AI Agents

Framework for building real-time voice AI agents. STT, LLM, TTS pipeline with sub-second latency. Supports OpenAI, Anthropic, Deepgram, ElevenLabs. 9.9K+ stars.

Script Depot 52Scripts

ElevenLabs Python SDK — AI Text-to-Speech

Official ElevenLabs Python SDK for AI voice generation. Create realistic voiceovers with 30+ languages, voice cloning, and streaming support.

Script Depot 48CLI Tools

Kokoro — Lightweight 82M TTS in 9 Languages

Kokoro is an 82M parameter text-to-speech model delivering quality comparable to larger models. 6.2K+ GitHub stars. Supports English, Spanish, French, Japanese, Chinese, and more. Apache 2.0.

Script Depot 48Scripts

Faster Whisper — 4x Faster Speech-to-Text

Faster Whisper is a reimplementation of OpenAI Whisper using CTranslate2, up to 4x faster with less memory. 21.8K+ GitHub stars. GPU/CPU, 8-bit quantization, word timestamps, VAD. MIT licensed.

Script Depot 47Scripts

Remotion AI Skill — Programmatic Video in React

Official Remotion Agent Skill for Claude Code and Codex. 30+ rules covering animations, transitions, captions, FFmpeg, audio visualization, voiceover, 3D, and more.

TokRepo Curated 187Skills

VoltAgent — TypeScript AI Agent Framework

Open-source TypeScript framework for building AI agents with built-in Memory, RAG, Guardrails, MCP, Voice, and Workflow support. Includes LLM observability console for debugging.

Script Depot 90Scripts

OpenAI Realtime Agents — Voice AI Agent Patterns

Advanced agentic patterns for voice AI built on OpenAI Realtime API. Chat-supervisor and sequential handoff patterns with WebRTC streaming. MIT, 6,800+ stars.

Agent Toolkit 73Scripts
📜

Replicate — Run AI Models via Simple API Calls

Cloud platform to run open-source AI models with a simple API. Replicate hosts Llama, Stable Diffusion, Whisper, and thousands of models — no GPU setup or Docker required.

AI Open Source 72Scripts

Candle — Minimalist Machine Learning Framework for Rust

Candle is a Rust-native ML framework focused on inference performance, small binaries, and serverless deployment. It runs Llama, Whisper, Stable Diffusion, and other PyTorch models in pure Rust — no Python required.

AI Open Source 70Configs

Cloudflare Workers AI — Serverless AI Inference

Run AI models at the edge with Cloudflare Workers. Text generation, image generation, speech-to-text, translation, embeddings — all serverless with global distribution.

Script Depot 61Scripts

Mattermost — Open Source Slack Alternative for Team Collaboration

Mattermost is an open-source messaging platform for secure team collaboration. Channels, threads, voice/video calls, playbooks, and integrations — self-hosted Slack alternative.

AI Open Source 58Configs
📜

VideoCaptioner — AI Subtitle Pipeline

LLM-powered video subtitle tool: Whisper transcription + AI correction + 99-language translation + styled subtitle export. 13,800+ stars.

Script Depot 58Scripts

Open WebUI — Self-Hosted AI Chat Interface

User-friendly, self-hosted AI chat interface. Supports Ollama, OpenAI, Anthropic, and any OpenAI-compatible API. RAG, web search, voice, image gen, and plugins. 129K+ stars.

Script Depot 51Scripts
⚙️

LocalAI — Run Any AI Model Locally, No GPU

LocalAI is an open-source AI engine running LLMs, vision, voice, and image models locally. 44.6K+ GitHub stars. OpenAI/Anthropic-compatible API, 35+ backends, MCP, agents. MIT licensed.

AI Open Source 41Configs

AI Voice Technology

AI Voice Technology

Voice AI has reached a turning point — synthetic speech is now indistinguishable from human narration, and real-time transcription works in 100+ languages. Text-to-Speech (TTS) — ElevenLabs, Coqui TTS, ChatTTS, Fish Speech, and Kokoro generate natural voiceovers with emotional control, multilingual support, and voice cloning from just seconds of sample audio.

Speech-to-Text (STT) — OpenAI's Whisper family (whisper.cpp, WhisperX, Faster Whisper) dominates transcription with near-human accuracy. Self-hosted options run entirely on local hardware for privacy-sensitive applications. Real-Time Voice — Moshi and Dia enable real-time conversational AI with natural turn-taking, interruption handling, and emotional awareness.

Voice Cloning & Synthesis — Clone any voice from a 15-second sample. F5-TTS and Zonos offer open-source voice cloning with quality rivaling commercial APIs. Essential for content creators, podcast producers, and accessibility applications.

Voice is the most natural interface — AI has finally made it programmable.

Preguntas frecuentes

What is the best AI text-to-speech tool?+

For quality: ElevenLabs leads with the most natural-sounding voices and best emotional control. For self-hosting: Coqui TTS and Fish Speech offer comparable quality without API costs. For speed: ChatTTS and Kokoro generate speech in real-time. For multilingual: Whisper-based pipelines combined with multilingual TTS handle 100+ languages.

Can AI clone my voice?+

Yes. Modern voice cloning tools need as little as 15 seconds of sample audio. ElevenLabs offers cloud-based cloning, while F5-TTS and Zonos provide open-source alternatives you can run locally. Quality is remarkably high — cloned voices preserve accent, tone, and speaking style. Always get consent before cloning someone's voice.

What is the best open-source speech recognition?+

OpenAI's Whisper (via whisper.cpp for local inference) is the gold standard. WhisperX adds speaker diarization (who said what) and word-level timestamps. Faster Whisper uses CTranslate2 for 4x speed improvement. All run locally without sending audio to external servers — critical for privacy-sensitive applications like medical or legal transcription.

Explora categorías relacionadas