What is LLMLingua — Compress Prompts 20x with Minimal Loss?

Microsoft research tool for prompt compression. Reduce token usage up to 20x while maintaining LLM performance. Solves lost-in-the-middle for RAG. MIT, 6,000+ stars.

Is LLMLingua — Compress Prompts 20x with Minimal Loss free to use?

Yes. LLMLingua — Compress Prompts 20x with Minimal Loss is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install LLMLingua — Compress Prompts 20x with Minimal Loss?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

LLMLingua — Compress Prompts 20x with Minimal Loss

LLMLingua — Prompt Compression by Microsoft Research

The Problem

LLM API costs are directly tied to token count. Long contexts in RAG pipelines, multi-document QA, and chain-of-thought prompting can consume thousands of tokens per request. Additionally, LLMs suffer from the "lost-in-the-middle" problem — they focus on the beginning and end of long contexts, missing information in the middle.

The Solution

LLMLingua uses a small language model to identify and remove non-essential tokens from prompts, achieving up to 20x compression with minimal performance loss.

Three Methods

Method	Paper	Compression	Speed
LLMLingua	EMNLP 2023	Up to 20x	Baseline
LongLLMLingua	ACL 2024	4x (+ 21.4% RAG improvement)	Same
LLMLingua-2	ACL 2024 Findings	Up to 20x	3-6x faster

Installation

pip install llmlingua

Usage Examples

Basic compression:

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank"
)

compressed = compressor.compress_prompt(
    context=["Long document text here..."],
    instruction="Answer the question based on the context.",
    question="What are the key findings?",
    target_token=200
)

print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']}")
print(compressed["compressed_prompt"])

For RAG pipelines (LongLLMLingua):

from llmlingua import PromptCompressor

compressor = PromptCompressor()

# Multiple retrieved documents
contexts = [
    "Document 1: ...",
    "Document 2: ...",
    "Document 3: ..."
]

compressed = compressor.compress_prompt(
    context=contexts,
    instruction="Answer based on the provided documents.",
    question="What is the main conclusion?",
    target_token=500,
    use_context_level_filter=True  # LongLLMLingua feature
)

Performance Benchmarks

20x compression on general prompts with <2% performance drop
21.4% improvement on RAG tasks using only 1/4 of tokens (LongLLMLingua)
3-6x speed improvement with LLMLingua-2 (uses data distillation from GPT-4)

FAQ

Q: What is LLMLingua? A: A Microsoft Research toolkit for compressing LLM prompts by up to 20x while maintaining performance, reducing API costs and solving the lost-in-the-middle problem in long contexts.

Q: Is LLMLingua free? A: Yes, fully open-source under the MIT license.

Q: Does LLMLingua work with any LLM? A: Yes, LLMLingua compresses prompts before they are sent to any LLM. It works with OpenAI, Claude, Gemini, and any other model.

LLMLingua — Compress Prompts 20x with Minimal Loss

Use it first, then decide how deep to go

LLMLingua — Prompt Compression by Microsoft Research

The Problem

The Solution

Three Methods

Installation

Usage Examples

Performance Benchmarks

FAQ

Source & Thanks

Discussion

Related Assets

Inngest — Durable AI Workflow Orchestration

Great Expectations — Data Validation for AI Pipelines

TokenCost — LLM Price Calculator for 400+ Models