Introduction
RWKV (pronounced RwaKuv) is an architecture for large language models that replaces the attention mechanism with a linear-complexity recurrence, allowing it to be trained in parallel like a Transformer while running inference with constant memory like an RNN. Created by PENG Bo, it is now at version RWKV-7 (codename Goose) and offers a practical alternative for deployment scenarios where memory and latency matter.
What RWKV Does
- Generates text with quality comparable to similarly-sized Transformer models
- Runs inference with O(1) memory per token instead of O(n) for attention-based models
- Trains efficiently on GPUs with full parallelism across the sequence dimension
- Supports unlimited context length at inference time without degradation
- Provides free sentence embeddings from the hidden state without additional training
Architecture Overview
RWKV replaces multi-head attention with a time-mixing and channel-mixing mechanism that operates as a linear recurrence. During training, the recurrence is unrolled into a parallel scan, making it as fast to train as a Transformer. During inference, the model maintains a fixed-size hidden state that is updated token-by-token, giving constant memory usage regardless of context length. This makes RWKV uniquely suited for long-context and streaming applications.
Self-Hosting & Configuration
- Models are available in various sizes (0.1B to 14B parameters) on Hugging Face
- The
rwkvPython package supports CPU, CUDA, and quantized inference strategies - Strategy strings like
cuda fp16orcpu fp32control device placement and precision - RWKV models can be converted to GGUF format and served with llama.cpp or Ollama
- Fine-tuning is supported via LoRA with the official RWKV-LM training scripts
Key Features
- Linear time and constant memory inference, enabling arbitrarily long contexts
- Training parallelism on par with Transformers using the parallel scan formulation
- Competitive benchmark scores with GPT-class models at equivalent parameter counts
- Native streaming inference without the need for KV cache management
- Active community with multilingual models trained on diverse corpora
Comparison with Similar Tools
- Llama / GPT — Standard Transformer LLMs; higher quality at large scale but quadratic attention cost
- Mamba — State-space model with similar linear complexity; different mathematical formulation, newer ecosystem
- RETNET — Microsoft's retention-based architecture; similar goals but less community adoption
- Linear Attention Transformers — Various approaches to linearize attention; RWKV's recurrence is a distinct design
- llama.cpp — Inference runtime for GGUF models; can run RWKV models after format conversion
FAQ
Q: How does RWKV quality compare to Transformers? A: At equivalent parameter counts and training data, RWKV-7 achieves benchmark scores within a few percent of Transformer models. The gap narrows with larger models and more training data.
Q: Can RWKV handle long documents? A: Yes. Because inference uses constant memory, RWKV can process arbitrarily long sequences without increasing VRAM usage or slowing down, making it ideal for long documents and streaming.
Q: Is RWKV compatible with existing LLM tooling? A: RWKV models can be converted to GGUF format for use with llama.cpp, Ollama, and other standard inference servers. The official Python package also provides a straightforward API.
Q: What hardware do I need to run RWKV? A: Small models (1.5B) run on consumer GPUs with 4 GB VRAM or even on CPU. Larger models (7B, 14B) benefit from GPUs with 16+ GB VRAM, or can use quantization to fit on less.