Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsMay 15, 2026·3 min de lecture

RWKV — RNN Language Model with Transformer-Level Performance

RWKV is an open source large language model architecture that combines the training parallelism of Transformers with the constant-memory inference of RNNs, achieving competitive quality with linear time complexity and no KV cache.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
RWKV
Commande CLI universelle
npx tokrepo install ceeb3bc3-5016-11f1-9bc6-00163e2b0d79

Introduction

RWKV (pronounced RwaKuv) is an architecture for large language models that replaces the attention mechanism with a linear-complexity recurrence, allowing it to be trained in parallel like a Transformer while running inference with constant memory like an RNN. Created by PENG Bo, it is now at version RWKV-7 (codename Goose) and offers a practical alternative for deployment scenarios where memory and latency matter.

What RWKV Does

  • Generates text with quality comparable to similarly-sized Transformer models
  • Runs inference with O(1) memory per token instead of O(n) for attention-based models
  • Trains efficiently on GPUs with full parallelism across the sequence dimension
  • Supports unlimited context length at inference time without degradation
  • Provides free sentence embeddings from the hidden state without additional training

Architecture Overview

RWKV replaces multi-head attention with a time-mixing and channel-mixing mechanism that operates as a linear recurrence. During training, the recurrence is unrolled into a parallel scan, making it as fast to train as a Transformer. During inference, the model maintains a fixed-size hidden state that is updated token-by-token, giving constant memory usage regardless of context length. This makes RWKV uniquely suited for long-context and streaming applications.

Self-Hosting & Configuration

  • Models are available in various sizes (0.1B to 14B parameters) on Hugging Face
  • The rwkv Python package supports CPU, CUDA, and quantized inference strategies
  • Strategy strings like cuda fp16 or cpu fp32 control device placement and precision
  • RWKV models can be converted to GGUF format and served with llama.cpp or Ollama
  • Fine-tuning is supported via LoRA with the official RWKV-LM training scripts

Key Features

  • Linear time and constant memory inference, enabling arbitrarily long contexts
  • Training parallelism on par with Transformers using the parallel scan formulation
  • Competitive benchmark scores with GPT-class models at equivalent parameter counts
  • Native streaming inference without the need for KV cache management
  • Active community with multilingual models trained on diverse corpora

Comparison with Similar Tools

  • Llama / GPT — Standard Transformer LLMs; higher quality at large scale but quadratic attention cost
  • Mamba — State-space model with similar linear complexity; different mathematical formulation, newer ecosystem
  • RETNET — Microsoft's retention-based architecture; similar goals but less community adoption
  • Linear Attention Transformers — Various approaches to linearize attention; RWKV's recurrence is a distinct design
  • llama.cpp — Inference runtime for GGUF models; can run RWKV models after format conversion

FAQ

Q: How does RWKV quality compare to Transformers? A: At equivalent parameter counts and training data, RWKV-7 achieves benchmark scores within a few percent of Transformer models. The gap narrows with larger models and more training data.

Q: Can RWKV handle long documents? A: Yes. Because inference uses constant memory, RWKV can process arbitrarily long sequences without increasing VRAM usage or slowing down, making it ideal for long documents and streaming.

Q: Is RWKV compatible with existing LLM tooling? A: RWKV models can be converted to GGUF format for use with llama.cpp, Ollama, and other standard inference servers. The official Python package also provides a straightforward API.

Q: What hardware do I need to run RWKV? A: Small models (1.5B) run on consumer GPUs with 4 GB VRAM or even on CPU. Larger models (7B, 14B) benefit from GPUs with 16+ GB VRAM, or can use quantization to fit on less.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires