Introduction
DeepSpec is an open-source framework from DeepSeek AI for training and evaluating speculative decoding algorithms. Speculative decoding accelerates LLM inference by using a smaller draft model to predict tokens that a larger verifier model then accepts or rejects in parallel, achieving significant speedups without changing output quality.
What DeepSpec Does
- Trains draft models optimized for speculative decoding with target LLMs
- Evaluates acceptance rates and speedup ratios across decoding strategies
- Benchmarks different speculative decoding algorithms on standard tasks
- Provides reproducible training pipelines for research and production
- Supports multiple draft-verifier pairing configurations
Architecture Overview
DeepSpec implements the full speculative decoding pipeline: draft model training with distillation from the target model, tree-based speculative sampling for higher acceptance rates, and a verification step that guarantees output quality matches the target model exactly. The framework is modular, letting researchers swap components to test new algorithms.
Self-Hosting & Configuration
- Requires Python 3.10+ and PyTorch with CUDA support
- Configure draft and target model paths in the YAML config
- Adjust tree width and depth parameters for speed-quality tradeoffs
- Distributed training supported via DeepSpeed or FSDP
- Export optimized draft models for deployment with vLLM or TGI
Key Features
- End-to-end pipeline from draft model training to production deployment
- Tree-based speculative sampling improves acceptance rates over naive approaches
- Guaranteed output equivalence with the target model (no quality degradation)
- Comprehensive benchmarking suite for comparing decoding strategies
- Integration with popular serving frameworks for production use
Comparison with Similar Tools
- vLLM — high-throughput serving engine with built-in speculative decoding support
- SGLang — fast LLM serving with RadixAttention but separate speculative decoding
- Medusa — parallel decoding heads approach rather than separate draft models
- TensorRT-LLM — NVIDIA's inference optimization with speculative decoding support
- llama.cpp — local inference in C++ with basic speculative decoding
FAQ
Q: How much speedup can speculative decoding achieve? A: Typical speedups range from 1.5x to 3x depending on the draft model quality and task characteristics.
Q: Does speculative decoding change the model output? A: No. The verification step guarantees that the output distribution is identical to running the target model alone.
Q: What models can be used as draft models? A: Any smaller model in the same family works. DeepSpec also supports training custom draft models from scratch.
Q: Can I use DeepSpec with open-weight models? A: Yes. It works with any model pair where you have weight access for both draft and target.