Configs2026年5月15日·1 分钟阅读

RWKV — RNN Language Model with Transformer-Level Performance

RWKV is an open source large language model architecture that combines the training parallelism of Transformers with the constant-memory inference of RNNs, achieving competitive quality with linear time complexity and no KV cache.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
RWKV
通用 CLI 安装命令
npx tokrepo install ceeb3bc3-5016-11f1-9bc6-00163e2b0d79

Introduction

RWKV (pronounced RwaKuv) is an architecture for large language models that replaces the attention mechanism with a linear-complexity recurrence, allowing it to be trained in parallel like a Transformer while running inference with constant memory like an RNN. Created by PENG Bo, it is now at version RWKV-7 (codename Goose) and offers a practical alternative for deployment scenarios where memory and latency matter.

What RWKV Does

  • Generates text with quality comparable to similarly-sized Transformer models
  • Runs inference with O(1) memory per token instead of O(n) for attention-based models
  • Trains efficiently on GPUs with full parallelism across the sequence dimension
  • Supports unlimited context length at inference time without degradation
  • Provides free sentence embeddings from the hidden state without additional training

Architecture Overview

RWKV replaces multi-head attention with a time-mixing and channel-mixing mechanism that operates as a linear recurrence. During training, the recurrence is unrolled into a parallel scan, making it as fast to train as a Transformer. During inference, the model maintains a fixed-size hidden state that is updated token-by-token, giving constant memory usage regardless of context length. This makes RWKV uniquely suited for long-context and streaming applications.

Self-Hosting & Configuration

  • Models are available in various sizes (0.1B to 14B parameters) on Hugging Face
  • The rwkv Python package supports CPU, CUDA, and quantized inference strategies
  • Strategy strings like cuda fp16 or cpu fp32 control device placement and precision
  • RWKV models can be converted to GGUF format and served with llama.cpp or Ollama
  • Fine-tuning is supported via LoRA with the official RWKV-LM training scripts

Key Features

  • Linear time and constant memory inference, enabling arbitrarily long contexts
  • Training parallelism on par with Transformers using the parallel scan formulation
  • Competitive benchmark scores with GPT-class models at equivalent parameter counts
  • Native streaming inference without the need for KV cache management
  • Active community with multilingual models trained on diverse corpora

Comparison with Similar Tools

  • Llama / GPT — Standard Transformer LLMs; higher quality at large scale but quadratic attention cost
  • Mamba — State-space model with similar linear complexity; different mathematical formulation, newer ecosystem
  • RETNET — Microsoft's retention-based architecture; similar goals but less community adoption
  • Linear Attention Transformers — Various approaches to linearize attention; RWKV's recurrence is a distinct design
  • llama.cpp — Inference runtime for GGUF models; can run RWKV models after format conversion

FAQ

Q: How does RWKV quality compare to Transformers? A: At equivalent parameter counts and training data, RWKV-7 achieves benchmark scores within a few percent of Transformer models. The gap narrows with larger models and more training data.

Q: Can RWKV handle long documents? A: Yes. Because inference uses constant memory, RWKV can process arbitrarily long sequences without increasing VRAM usage or slowing down, making it ideal for long documents and streaming.

Q: Is RWKV compatible with existing LLM tooling? A: RWKV models can be converted to GGUF format for use with llama.cpp, Ollama, and other standard inference servers. The official Python package also provides a straightforward API.

Q: What hardware do I need to run RWKV? A: Small models (1.5B) run on consumer GPUs with 4 GB VRAM or even on CPU. Larger models (7B, 14B) benefit from GPUs with 16+ GB VRAM, or can use quantization to fit on less.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产