# DeepSpec — Full-Stack Speculative Decoding Training and Evaluation by DeepSeek

> Open-source codebase from DeepSeek for training, evaluating, and deploying speculative decoding algorithms that accelerate LLM inference.

## Install

Save in your project root:

# DeepSpec — Speculative Decoding Training and Evaluation by DeepSeek

## Quick Use
```bash
git clone https://github.com/deepseek-ai/DeepSpec.git
cd DeepSpec
pip install -e .
python train.py --config configs/default.yaml
```

## Introduction
DeepSpec is an open-source framework from DeepSeek AI for training and evaluating speculative decoding algorithms. Speculative decoding accelerates LLM inference by using a smaller draft model to predict tokens that a larger verifier model then accepts or rejects in parallel, achieving significant speedups without changing output quality.

## What DeepSpec Does
- Trains draft models optimized for speculative decoding with target LLMs
- Evaluates acceptance rates and speedup ratios across decoding strategies
- Benchmarks different speculative decoding algorithms on standard tasks
- Provides reproducible training pipelines for research and production
- Supports multiple draft-verifier pairing configurations

## Architecture Overview
DeepSpec implements the full speculative decoding pipeline: draft model training with distillation from the target model, tree-based speculative sampling for higher acceptance rates, and a verification step that guarantees output quality matches the target model exactly. The framework is modular, letting researchers swap components to test new algorithms.

## Self-Hosting & Configuration
- Requires Python 3.10+ and PyTorch with CUDA support
- Configure draft and target model paths in the YAML config
- Adjust tree width and depth parameters for speed-quality tradeoffs
- Distributed training supported via DeepSpeed or FSDP
- Export optimized draft models for deployment with vLLM or TGI

## Key Features
- End-to-end pipeline from draft model training to production deployment
- Tree-based speculative sampling improves acceptance rates over naive approaches
- Guaranteed output equivalence with the target model (no quality degradation)
- Comprehensive benchmarking suite for comparing decoding strategies
- Integration with popular serving frameworks for production use

## Comparison with Similar Tools
- **vLLM** — high-throughput serving engine with built-in speculative decoding support
- **SGLang** — fast LLM serving with RadixAttention but separate speculative decoding
- **Medusa** — parallel decoding heads approach rather than separate draft models
- **TensorRT-LLM** — NVIDIA's inference optimization with speculative decoding support
- **llama.cpp** — local inference in C++ with basic speculative decoding

## FAQ
**Q: How much speedup can speculative decoding achieve?**
A: Typical speedups range from 1.5x to 3x depending on the draft model quality and task characteristics.

**Q: Does speculative decoding change the model output?**
A: No. The verification step guarantees that the output distribution is identical to running the target model alone.

**Q: What models can be used as draft models?**
A: Any smaller model in the same family works. DeepSpec also supports training custom draft models from scratch.

**Q: Can I use DeepSpec with open-weight models?**
A: Yes. It works with any model pair where you have weight access for both draft and target.

## Sources
- https://github.com/deepseek-ai/DeepSpec

---
Source: https://tokrepo.com/en/workflows/asset-033cfc51
Author: AI Open Source