ConfigsJul 5, 2026·3 min read

DeepSpec — Full-Stack Speculative Decoding Training and Evaluation by DeepSeek

Open-source codebase from DeepSeek for training, evaluating, and deploying speculative decoding algorithms that accelerate LLM inference.

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
DeepSpec Overview
Direct install command
npx -y tokrepo@latest install 033cfc51-7809-11f1-9bc6-00163e2b0d79 --target codex

Run after dry-run confirms the install plan.

Introduction

DeepSpec is an open-source framework from DeepSeek AI for training and evaluating speculative decoding algorithms. Speculative decoding accelerates LLM inference by using a smaller draft model to predict tokens that a larger verifier model then accepts or rejects in parallel, achieving significant speedups without changing output quality.

What DeepSpec Does

  • Trains draft models optimized for speculative decoding with target LLMs
  • Evaluates acceptance rates and speedup ratios across decoding strategies
  • Benchmarks different speculative decoding algorithms on standard tasks
  • Provides reproducible training pipelines for research and production
  • Supports multiple draft-verifier pairing configurations

Architecture Overview

DeepSpec implements the full speculative decoding pipeline: draft model training with distillation from the target model, tree-based speculative sampling for higher acceptance rates, and a verification step that guarantees output quality matches the target model exactly. The framework is modular, letting researchers swap components to test new algorithms.

Self-Hosting & Configuration

  • Requires Python 3.10+ and PyTorch with CUDA support
  • Configure draft and target model paths in the YAML config
  • Adjust tree width and depth parameters for speed-quality tradeoffs
  • Distributed training supported via DeepSpeed or FSDP
  • Export optimized draft models for deployment with vLLM or TGI

Key Features

  • End-to-end pipeline from draft model training to production deployment
  • Tree-based speculative sampling improves acceptance rates over naive approaches
  • Guaranteed output equivalence with the target model (no quality degradation)
  • Comprehensive benchmarking suite for comparing decoding strategies
  • Integration with popular serving frameworks for production use

Comparison with Similar Tools

  • vLLM — high-throughput serving engine with built-in speculative decoding support
  • SGLang — fast LLM serving with RadixAttention but separate speculative decoding
  • Medusa — parallel decoding heads approach rather than separate draft models
  • TensorRT-LLM — NVIDIA's inference optimization with speculative decoding support
  • llama.cpp — local inference in C++ with basic speculative decoding

FAQ

Q: How much speedup can speculative decoding achieve? A: Typical speedups range from 1.5x to 3x depending on the draft model quality and task characteristics.

Q: Does speculative decoding change the model output? A: No. The verification step guarantees that the output distribution is identical to running the target model alone.

Q: What models can be used as draft models? A: Any smaller model in the same family works. DeepSpec also supports training custom draft models from scratch.

Q: Can I use DeepSpec with open-weight models? A: Yes. It works with any model pair where you have weight access for both draft and target.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets