# RWKV — RNN Language Model with Transformer-Level Performance > RWKV is an open source large language model architecture that combines the training parallelism of Transformers with the constant-memory inference of RNNs, achieving competitive quality with linear time complexity and no KV cache. ## Install Save in your project root: # RWKV — RNN Language Model with Transformer-Level Performance ## Quick Use ```bash # Install the RWKV pip package pip install rwkv # Download a model (e.g., RWKV-7 1.5B) # Models available at https://huggingface.co/BlinkDL # Run inference python -c " from rwkv.model import RWKV model = RWKV(model='path/to/model.pth', strategy='cuda fp16') out, state = model.forward([187, 510, 1563], None) " ``` ## Introduction RWKV (pronounced RwaKuv) is an architecture for large language models that replaces the attention mechanism with a linear-complexity recurrence, allowing it to be trained in parallel like a Transformer while running inference with constant memory like an RNN. Created by PENG Bo, it is now at version RWKV-7 (codename Goose) and offers a practical alternative for deployment scenarios where memory and latency matter. ## What RWKV Does - Generates text with quality comparable to similarly-sized Transformer models - Runs inference with O(1) memory per token instead of O(n) for attention-based models - Trains efficiently on GPUs with full parallelism across the sequence dimension - Supports unlimited context length at inference time without degradation - Provides free sentence embeddings from the hidden state without additional training ## Architecture Overview RWKV replaces multi-head attention with a time-mixing and channel-mixing mechanism that operates as a linear recurrence. During training, the recurrence is unrolled into a parallel scan, making it as fast to train as a Transformer. During inference, the model maintains a fixed-size hidden state that is updated token-by-token, giving constant memory usage regardless of context length. This makes RWKV uniquely suited for long-context and streaming applications. ## Self-Hosting & Configuration - Models are available in various sizes (0.1B to 14B parameters) on Hugging Face - The `rwkv` Python package supports CPU, CUDA, and quantized inference strategies - Strategy strings like `cuda fp16` or `cpu fp32` control device placement and precision - RWKV models can be converted to GGUF format and served with llama.cpp or Ollama - Fine-tuning is supported via LoRA with the official RWKV-LM training scripts ## Key Features - Linear time and constant memory inference, enabling arbitrarily long contexts - Training parallelism on par with Transformers using the parallel scan formulation - Competitive benchmark scores with GPT-class models at equivalent parameter counts - Native streaming inference without the need for KV cache management - Active community with multilingual models trained on diverse corpora ## Comparison with Similar Tools - **Llama / GPT** — Standard Transformer LLMs; higher quality at large scale but quadratic attention cost - **Mamba** — State-space model with similar linear complexity; different mathematical formulation, newer ecosystem - **RETNET** — Microsoft's retention-based architecture; similar goals but less community adoption - **Linear Attention Transformers** — Various approaches to linearize attention; RWKV's recurrence is a distinct design - **llama.cpp** — Inference runtime for GGUF models; can run RWKV models after format conversion ## FAQ **Q: How does RWKV quality compare to Transformers?** A: At equivalent parameter counts and training data, RWKV-7 achieves benchmark scores within a few percent of Transformer models. The gap narrows with larger models and more training data. **Q: Can RWKV handle long documents?** A: Yes. Because inference uses constant memory, RWKV can process arbitrarily long sequences without increasing VRAM usage or slowing down, making it ideal for long documents and streaming. **Q: Is RWKV compatible with existing LLM tooling?** A: RWKV models can be converted to GGUF format for use with llama.cpp, Ollama, and other standard inference servers. The official Python package also provides a straightforward API. **Q: What hardware do I need to run RWKV?** A: Small models (1.5B) run on consumer GPUs with 4 GB VRAM or even on CPU. Larger models (7B, 14B) benefit from GPUs with 16+ GB VRAM, or can use quantization to fit on less. ## Sources - https://github.com/BlinkDL/RWKV-LM - https://wiki.rwkv.com --- Source: https://tokrepo.com/en/workflows/asset-ceeb3bc3 Author: AI Open Source