ConfigsMay 15, 2026·3 min read

RWKV — RNN Language Model with Transformer-Level Performance

RWKV is an open source large language model architecture that combines the training parallelism of Transformers with the constant-memory inference of RNNs, achieving competitive quality with linear time complexity and no KV cache.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
RWKV
Universal CLI install command
npx tokrepo install ceeb3bc3-5016-11f1-9bc6-00163e2b0d79

Introduction

RWKV (pronounced RwaKuv) is an architecture for large language models that replaces the attention mechanism with a linear-complexity recurrence, allowing it to be trained in parallel like a Transformer while running inference with constant memory like an RNN. Created by PENG Bo, it is now at version RWKV-7 (codename Goose) and offers a practical alternative for deployment scenarios where memory and latency matter.

What RWKV Does

  • Generates text with quality comparable to similarly-sized Transformer models
  • Runs inference with O(1) memory per token instead of O(n) for attention-based models
  • Trains efficiently on GPUs with full parallelism across the sequence dimension
  • Supports unlimited context length at inference time without degradation
  • Provides free sentence embeddings from the hidden state without additional training

Architecture Overview

RWKV replaces multi-head attention with a time-mixing and channel-mixing mechanism that operates as a linear recurrence. During training, the recurrence is unrolled into a parallel scan, making it as fast to train as a Transformer. During inference, the model maintains a fixed-size hidden state that is updated token-by-token, giving constant memory usage regardless of context length. This makes RWKV uniquely suited for long-context and streaming applications.

Self-Hosting & Configuration

  • Models are available in various sizes (0.1B to 14B parameters) on Hugging Face
  • The rwkv Python package supports CPU, CUDA, and quantized inference strategies
  • Strategy strings like cuda fp16 or cpu fp32 control device placement and precision
  • RWKV models can be converted to GGUF format and served with llama.cpp or Ollama
  • Fine-tuning is supported via LoRA with the official RWKV-LM training scripts

Key Features

  • Linear time and constant memory inference, enabling arbitrarily long contexts
  • Training parallelism on par with Transformers using the parallel scan formulation
  • Competitive benchmark scores with GPT-class models at equivalent parameter counts
  • Native streaming inference without the need for KV cache management
  • Active community with multilingual models trained on diverse corpora

Comparison with Similar Tools

  • Llama / GPT — Standard Transformer LLMs; higher quality at large scale but quadratic attention cost
  • Mamba — State-space model with similar linear complexity; different mathematical formulation, newer ecosystem
  • RETNET — Microsoft's retention-based architecture; similar goals but less community adoption
  • Linear Attention Transformers — Various approaches to linearize attention; RWKV's recurrence is a distinct design
  • llama.cpp — Inference runtime for GGUF models; can run RWKV models after format conversion

FAQ

Q: How does RWKV quality compare to Transformers? A: At equivalent parameter counts and training data, RWKV-7 achieves benchmark scores within a few percent of Transformer models. The gap narrows with larger models and more training data.

Q: Can RWKV handle long documents? A: Yes. Because inference uses constant memory, RWKV can process arbitrarily long sequences without increasing VRAM usage or slowing down, making it ideal for long documents and streaming.

Q: Is RWKV compatible with existing LLM tooling? A: RWKV models can be converted to GGUF format for use with llama.cpp, Ollama, and other standard inference servers. The official Python package also provides a straightforward API.

Q: What hardware do I need to run RWKV? A: Small models (1.5B) run on consumer GPUs with 4 GB VRAM or even on CPU. Larger models (7B, 14B) benefit from GPUs with 16+ GB VRAM, or can use quantization to fit on less.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets