LLMLingua — Prompt Compression by Microsoft Research
The Problem
LLM API costs are directly tied to token count. Long contexts in RAG pipelines, multi-document QA, and chain-of-thought prompting can consume thousands of tokens per request. Additionally, LLMs suffer from the "lost-in-the-middle" problem — they focus on the beginning and end of long contexts, missing information in the middle.
The Solution
LLMLingua uses a small language model to identify and remove non-essential tokens from prompts, achieving up to 20x compression with minimal performance loss.
Three Methods
| Method | Paper | Compression | Speed |
|---|---|---|---|
| LLMLingua | EMNLP 2023 | Up to 20x | Baseline |
| LongLLMLingua | ACL 2024 | 4x (+ 21.4% RAG improvement) | Same |
| LLMLingua-2 | ACL 2024 Findings | Up to 20x | 3-6x faster |
Installation
pip install llmlinguaUsage Examples
Basic compression:
from llmlingua import PromptCompressor
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank"
)
compressed = compressor.compress_prompt(
context=["Long document text here..."],
instruction="Answer the question based on the context.",
question="What are the key findings?",
target_token=200
)
print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']}")
print(compressed["compressed_prompt"])For RAG pipelines (LongLLMLingua):
from llmlingua import PromptCompressor
compressor = PromptCompressor()
# Multiple retrieved documents
contexts = [
"Document 1: ...",
"Document 2: ...",
"Document 3: ..."
]
compressed = compressor.compress_prompt(
context=contexts,
instruction="Answer based on the provided documents.",
question="What is the main conclusion?",
target_token=500,
use_context_level_filter=True # LongLLMLingua feature
)Performance Benchmarks
- 20x compression on general prompts with <2% performance drop
- 21.4% improvement on RAG tasks using only 1/4 of tokens (LongLLMLingua)
- 3-6x speed improvement with LLMLingua-2 (uses data distillation from GPT-4)
FAQ
Q: What is LLMLingua? A: A Microsoft Research toolkit for compressing LLM prompts by up to 20x while maintaining performance, reducing API costs and solving the lost-in-the-middle problem in long contexts.
Q: Is LLMLingua free? A: Yes, fully open-source under the MIT license.
Q: Does LLMLingua work with any LLM? A: Yes, LLMLingua compresses prompts before they are sent to any LLM. It works with OpenAI, Claude, Gemini, and any other model.