# Parler-TTS — High-Quality Text-to-Speech Training and Inference Library > Parler-TTS by Hugging Face provides inference and training capabilities for high-quality text-to-speech models with natural prosody and controllable speaker attributes described in plain text. ## Install Save as a script file and run: # Parler-TTS — High-Quality Text-to-Speech Training and Inference Library ## Quick Use ```bash pip install parler-tts python -c " from parler_tts import ParlerTTSForConditionalGeneration from transformers import AutoTokenizer import soundfile as sf model = ParlerTTSForConditionalGeneration.from_pretrained('parler-tts/parler-tts-mini-v1') tokenizer = AutoTokenizer.from_pretrained('parler-tts/parler-tts-mini-v1') " ``` ## Introduction Parler-TTS is a text-to-speech library from Hugging Face that generates natural-sounding speech from text descriptions. Instead of selecting a voice by ID, you describe the desired voice characteristics in plain English, and the model produces matching audio output. ## What Parler-TTS Does - Generates speech from text with controllable speaker attributes - Accepts natural language voice descriptions (e.g., calm female, deep male) - Provides both inference and training pipelines for TTS models - Supports multiple model sizes from mini to large - Integrates with the Hugging Face Transformers ecosystem ## Architecture Overview Parler-TTS uses a conditional generation architecture based on the EnCodec audio codec and a text-conditioned decoder. The model takes two text inputs: the speech content and a voice description. It encodes both through a shared transformer and decodes audio tokens that an EnCodec decoder converts to waveform audio. ## Self-Hosting & Configuration - Install via pip with Python 3.9+ and PyTorch - Download pretrained models from Hugging Face Hub (parler-tts/parler-tts-mini-v1) - Run inference on CPU or GPU (GPU recommended for real-time generation) - Fine-tune on custom voice datasets using the included training scripts - Export generated audio in WAV, MP3, or FLAC formats ## Key Features - Text-described voice control without voice ID databases - Multiple model sizes (mini, small, large) for different latency requirements - Streaming audio generation for real-time applications - Training pipeline for custom voice model development - Native Hugging Face Transformers integration ## Comparison with Similar Tools - **Bark** — generates speech with music and effects; Parler-TTS focuses on controllable voice quality - **Kokoro** — lightweight multilingual TTS; Parler-TTS offers richer voice description control - **Fish Speech** — multilingual focus; Parler-TTS uses text-based voice conditioning - **F5-TTS** — flow matching approach; Parler-TTS uses conditional generation with EnCodec ## FAQ **Q: Can I describe any voice characteristics?** A: The model responds to descriptions of gender, tone, pace, accent, and recording quality. Results depend on training data coverage. **Q: Does Parler-TTS support languages other than English?** A: The base models focus on English. Community fine-tunes extend to other languages. **Q: What hardware is needed for real-time generation?** A: The mini model runs in near-real-time on a modern GPU. CPU inference works but with higher latency. **Q: Can I train a model on my own voice data?** A: Yes. The library includes training scripts and documentation for fine-tuning on custom datasets. ## Sources - https://github.com/huggingface/parler-tts - https://huggingface.co/parler-tts --- Source: https://tokrepo.com/en/workflows/asset-64bcbec2 Author: Script Depot