# RWKV — RNN Language Model with Transformer-Level Performance

> RWKV is an open source large language model architecture that combines the training parallelism of Transformers with the constant-memory inference of RNNs, achieving competitive quality with linear time complexity and no KV cache.

## Install

Save in your project root:

# RWKV — RNN Language Model with Transformer-Level Performance

## Quick Use
```bash
# Install the RWKV pip package
pip install rwkv

# Download a model (e.g., RWKV-7 1.5B)
# Models available at https://huggingface.co/BlinkDL

# Run inference
python -c "
from rwkv.model import RWKV
model = RWKV(model='path/to/model.pth', strategy='cuda fp16')
out, state = model.forward([187, 510, 1563], None)
"
```

## Introduction
RWKV (pronounced RwaKuv) is an architecture for large language models that replaces the attention mechanism with a linear-complexity recurrence, allowing it to be trained in parallel like a Transformer while running inference with constant memory like an RNN. Created by PENG Bo, it is now at version RWKV-7 (codename Goose) and offers a practical alternative for deployment scenarios where memory and latency matter.

## What RWKV Does
- Generates text with quality comparable to similarly-sized Transformer models
- Runs inference with O(1) memory per token instead of O(n) for attention-based models
- Trains efficiently on GPUs with full parallelism across the sequence dimension
- Supports unlimited context length at inference time without degradation
- Provides free sentence embeddings from the hidden state without additional training

## Architecture Overview
RWKV replaces multi-head attention with a time-mixing and channel-mixing mechanism that operates as a linear recurrence. During training, the recurrence is unrolled into a parallel scan, making it as fast to train as a Transformer. During inference, the model maintains a fixed-size hidden state that is updated token-by-token, giving constant memory usage regardless of context length. This makes RWKV uniquely suited for long-context and streaming applications.

## Self-Hosting & Configuration
- Models are available in various sizes (0.1B to 14B parameters) on Hugging Face
- The `rwkv` Python package supports CPU, CUDA, and quantized inference strategies
- Strategy strings like `cuda fp16` or `cpu fp32` control device placement and precision
- RWKV models can be converted to GGUF format and served with llama.cpp or Ollama
- Fine-tuning is supported via LoRA with the official RWKV-LM training scripts

## Key Features
- Linear time and constant memory inference, enabling arbitrarily long contexts
- Training parallelism on par with Transformers using the parallel scan formulation
- Competitive benchmark scores with GPT-class models at equivalent parameter counts
- Native streaming inference without the need for KV cache management
- Active community with multilingual models trained on diverse corpora

## Comparison with Similar Tools
- **Llama / GPT** — Standard Transformer LLMs; higher quality at large scale but quadratic attention cost
- **Mamba** — State-space model with similar linear complexity; different mathematical formulation, newer ecosystem
- **RETNET** — Microsoft's retention-based architecture; similar goals but less community adoption
- **Linear Attention Transformers** — Various approaches to linearize attention; RWKV's recurrence is a distinct design
- **llama.cpp** — Inference runtime for GGUF models; can run RWKV models after format conversion

## FAQ
**Q: How does RWKV quality compare to Transformers?**
A: At equivalent parameter counts and training data, RWKV-7 achieves benchmark scores within a few percent of Transformer models. The gap narrows with larger models and more training data.

**Q: Can RWKV handle long documents?**
A: Yes. Because inference uses constant memory, RWKV can process arbitrarily long sequences without increasing VRAM usage or slowing down, making it ideal for long documents and streaming.

**Q: Is RWKV compatible with existing LLM tooling?**
A: RWKV models can be converted to GGUF format for use with llama.cpp, Ollama, and other standard inference servers. The official Python package also provides a straightforward API.

**Q: What hardware do I need to run RWKV?**
A: Small models (1.5B) run on consumer GPUs with 4 GB VRAM or even on CPU. Larger models (7B, 14B) benefit from GPUs with 16+ GB VRAM, or can use quantization to fit on less.

## Sources
- https://github.com/BlinkDL/RWKV-LM
- https://wiki.rwkv.com

---
Source: https://tokrepo.com/en/workflows/asset-ceeb3bc3
Author: AI Open Source