Configs2026年7月5日·1 分钟阅读

DeepSpec — Full-Stack Speculative Decoding Training and Evaluation by DeepSeek

Open-source codebase from DeepSeek for training, evaluating, and deploying speculative decoding algorithms that accelerate LLM inference.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
DeepSpec Overview
直接安装命令
npx -y tokrepo@latest install 033cfc51-7809-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run 确认安装计划,再运行此命令。

Introduction

DeepSpec is an open-source framework from DeepSeek AI for training and evaluating speculative decoding algorithms. Speculative decoding accelerates LLM inference by using a smaller draft model to predict tokens that a larger verifier model then accepts or rejects in parallel, achieving significant speedups without changing output quality.

What DeepSpec Does

  • Trains draft models optimized for speculative decoding with target LLMs
  • Evaluates acceptance rates and speedup ratios across decoding strategies
  • Benchmarks different speculative decoding algorithms on standard tasks
  • Provides reproducible training pipelines for research and production
  • Supports multiple draft-verifier pairing configurations

Architecture Overview

DeepSpec implements the full speculative decoding pipeline: draft model training with distillation from the target model, tree-based speculative sampling for higher acceptance rates, and a verification step that guarantees output quality matches the target model exactly. The framework is modular, letting researchers swap components to test new algorithms.

Self-Hosting & Configuration

  • Requires Python 3.10+ and PyTorch with CUDA support
  • Configure draft and target model paths in the YAML config
  • Adjust tree width and depth parameters for speed-quality tradeoffs
  • Distributed training supported via DeepSpeed or FSDP
  • Export optimized draft models for deployment with vLLM or TGI

Key Features

  • End-to-end pipeline from draft model training to production deployment
  • Tree-based speculative sampling improves acceptance rates over naive approaches
  • Guaranteed output equivalence with the target model (no quality degradation)
  • Comprehensive benchmarking suite for comparing decoding strategies
  • Integration with popular serving frameworks for production use

Comparison with Similar Tools

  • vLLM — high-throughput serving engine with built-in speculative decoding support
  • SGLang — fast LLM serving with RadixAttention but separate speculative decoding
  • Medusa — parallel decoding heads approach rather than separate draft models
  • TensorRT-LLM — NVIDIA's inference optimization with speculative decoding support
  • llama.cpp — local inference in C++ with basic speculative decoding

FAQ

Q: How much speedup can speculative decoding achieve? A: Typical speedups range from 1.5x to 3x depending on the draft model quality and task characteristics.

Q: Does speculative decoding change the model output? A: No. The verification step guarantees that the output distribution is identical to running the target model alone.

Q: What models can be used as draft models? A: Any smaller model in the same family works. DeepSpec also supports training custom draft models from scratch.

Q: Can I use DeepSpec with open-weight models? A: Yes. It works with any model pair where you have weight access for both draft and target.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产