DeepSpec — Full-Stack Speculative Decoding Training and Evaluation by DeepSeek

Introduction

DeepSpec is an open-source framework from DeepSeek AI for training and evaluating speculative decoding algorithms. Speculative decoding accelerates LLM inference by using a smaller draft model to predict tokens that a larger verifier model then accepts or rejects in parallel, achieving significant speedups without changing output quality.

What DeepSpec Does

Trains draft models optimized for speculative decoding with target LLMs
Evaluates acceptance rates and speedup ratios across decoding strategies
Benchmarks different speculative decoding algorithms on standard tasks
Provides reproducible training pipelines for research and production
Supports multiple draft-verifier pairing configurations

Architecture Overview

DeepSpec implements the full speculative decoding pipeline: draft model training with distillation from the target model, tree-based speculative sampling for higher acceptance rates, and a verification step that guarantees output quality matches the target model exactly. The framework is modular, letting researchers swap components to test new algorithms.

Self-Hosting & Configuration

Requires Python 3.10+ and PyTorch with CUDA support
Configure draft and target model paths in the YAML config
Adjust tree width and depth parameters for speed-quality tradeoffs
Distributed training supported via DeepSpeed or FSDP
Export optimized draft models for deployment with vLLM or TGI

Key Features

End-to-end pipeline from draft model training to production deployment
Tree-based speculative sampling improves acceptance rates over naive approaches
Guaranteed output equivalence with the target model (no quality degradation)
Comprehensive benchmarking suite for comparing decoding strategies
Integration with popular serving frameworks for production use

Comparison with Similar Tools

vLLM — high-throughput serving engine with built-in speculative decoding support
SGLang — fast LLM serving with RadixAttention but separate speculative decoding
Medusa — parallel decoding heads approach rather than separate draft models
TensorRT-LLM — NVIDIA's inference optimization with speculative decoding support
llama.cpp — local inference in C++ with basic speculative decoding

FAQ

Q: How much speedup can speculative decoding achieve? A: Typical speedups range from 1.5x to 3x depending on the draft model quality and task characteristics.

Q: Does speculative decoding change the model output? A: No. The verification step guarantees that the output distribution is identical to running the target model alone.

Q: What models can be used as draft models? A: Any smaller model in the same family works. DeepSpec also supports training custom draft models from scratch.

Q: Can I use DeepSpec with open-weight models? A: Yes. It works with any model pair where you have weight access for both draft and target.

Sources

https://github.com/deepseek-ai/DeepSpec

DeepSpec — Full-Stack Speculative Decoding Training and Evaluation by DeepSeek

Ready-to-run agent install

Introduction

What DeepSpec Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Wasp — Full-Stack React & Node.js Framework with Declarative DSL

Fresh — Next-Gen Full-Stack Web Framework for Deno

SST — Full-Stack Framework for Building on AWS

Meteor — Full-Stack JavaScript Platform for Real-Time Web Apps