How do I install RWKV — RNN Language Model with Transformer-Level Performance?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

RWKV — RNN Language Model with Transformer-Level Performance

Introduction

RWKV (pronounced RwaKuv) is an architecture for large language models that replaces the attention mechanism with a linear-complexity recurrence, allowing it to be trained in parallel like a Transformer while running inference with constant memory like an RNN. Created by PENG Bo, it is now at version RWKV-7 (codename Goose) and offers a practical alternative for deployment scenarios where memory and latency matter.

What RWKV Does

Generates text with quality comparable to similarly-sized Transformer models
Runs inference with O(1) memory per token instead of O(n) for attention-based models
Trains efficiently on GPUs with full parallelism across the sequence dimension
Supports unlimited context length at inference time without degradation
Provides free sentence embeddings from the hidden state without additional training

Architecture Overview

RWKV replaces multi-head attention with a time-mixing and channel-mixing mechanism that operates as a linear recurrence. During training, the recurrence is unrolled into a parallel scan, making it as fast to train as a Transformer. During inference, the model maintains a fixed-size hidden state that is updated token-by-token, giving constant memory usage regardless of context length. This makes RWKV uniquely suited for long-context and streaming applications.

Self-Hosting & Configuration

Models are available in various sizes (0.1B to 14B parameters) on Hugging Face
The rwkv Python package supports CPU, CUDA, and quantized inference strategies
Strategy strings like cuda fp16 or cpu fp32 control device placement and precision
RWKV models can be converted to GGUF format and served with llama.cpp or Ollama
Fine-tuning is supported via LoRA with the official RWKV-LM training scripts

Key Features

Linear time and constant memory inference, enabling arbitrarily long contexts
Training parallelism on par with Transformers using the parallel scan formulation
Competitive benchmark scores with GPT-class models at equivalent parameter counts
Native streaming inference without the need for KV cache management
Active community with multilingual models trained on diverse corpora

Comparison with Similar Tools

Llama / GPT — Standard Transformer LLMs; higher quality at large scale but quadratic attention cost
Mamba — State-space model with similar linear complexity; different mathematical formulation, newer ecosystem
RETNET — Microsoft's retention-based architecture; similar goals but less community adoption
Linear Attention Transformers — Various approaches to linearize attention; RWKV's recurrence is a distinct design
llama.cpp — Inference runtime for GGUF models; can run RWKV models after format conversion

FAQ

Q: How does RWKV quality compare to Transformers? A: At equivalent parameter counts and training data, RWKV-7 achieves benchmark scores within a few percent of Transformer models. The gap narrows with larger models and more training data.

Q: Can RWKV handle long documents? A: Yes. Because inference uses constant memory, RWKV can process arbitrarily long sequences without increasing VRAM usage or slowing down, making it ideal for long documents and streaming.

Q: Is RWKV compatible with existing LLM tooling? A: RWKV models can be converted to GGUF format for use with llama.cpp, Ollama, and other standard inference servers. The official Python package also provides a straightforward API.

Q: What hardware do I need to run RWKV? A: Small models (1.5B) run on consumer GPUs with 4 GB VRAM or even on CPU. Larger models (7B, 14B) benefit from GPUs with 16+ GB VRAM, or can use quantization to fit on less.

RWKV — RNN Language Model with Transformer-Level Performance

This asset can be read and installed directly by agents

Introduction

What RWKV Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

bpftrace — High-Level Tracing Language for Linux eBPF

Mojo — Python-Superset Language for AI Performance

Triton Language — GPU Kernel Programming Made Accessible

Julia — High-Performance Language for Scientific Computing