ScriptsApr 9, 2026·2 min read

LLMLingua — Compress Prompts 20x with Minimal Loss

Microsoft research tool for prompt compression. Reduce token usage up to 20x while maintaining LLM performance. Solves lost-in-the-middle for RAG. MIT, 6,000+ stars.

SC
Script Depot · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

  1. Install:
pip install llmlingua
  1. Compress a prompt:
from llmlingua import PromptCompressor

compressor = PromptCompressor()
result = compressor.compress_prompt(
    context=["Your long context here..."],
    instruction="Summarize the key points.",
    target_token=500
)
print(result["compressed_prompt"])

Intro

LLMLingua is Microsoft Research's prompt compression toolkit with 6,000+ GitHub stars, published at EMNLP 2023 and ACL 2024. It reduces prompt length by up to 20x while preserving LLM performance, saving significant API costs. LLMLingua-2 offers 3-6x speed improvement over the original through GPT-4 data distillation. Especially effective for RAG pipelines where long retrieved contexts cause the "lost-in-the-middle" problem. Best for developers building production LLM apps who need to optimize token usage and costs.

See also: TokenCost for tracking LLM spending on TokRepo.


LLMLingua — Prompt Compression by Microsoft Research

The Problem

LLM API costs are directly tied to token count. Long contexts in RAG pipelines, multi-document QA, and chain-of-thought prompting can consume thousands of tokens per request. Additionally, LLMs suffer from the "lost-in-the-middle" problem — they focus on the beginning and end of long contexts, missing information in the middle.

The Solution

LLMLingua uses a small language model to identify and remove non-essential tokens from prompts, achieving up to 20x compression with minimal performance loss.

Three Methods

Method Paper Compression Speed
LLMLingua EMNLP 2023 Up to 20x Baseline
LongLLMLingua ACL 2024 4x (+ 21.4% RAG improvement) Same
LLMLingua-2 ACL 2024 Findings Up to 20x 3-6x faster

Installation

pip install llmlingua

Usage Examples

Basic compression:

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank"
)

compressed = compressor.compress_prompt(
    context=["Long document text here..."],
    instruction="Answer the question based on the context.",
    question="What are the key findings?",
    target_token=200
)

print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']}")
print(compressed["compressed_prompt"])

For RAG pipelines (LongLLMLingua):

from llmlingua import PromptCompressor

compressor = PromptCompressor()

# Multiple retrieved documents
contexts = [
    "Document 1: ...",
    "Document 2: ...",
    "Document 3: ..."
]

compressed = compressor.compress_prompt(
    context=contexts,
    instruction="Answer based on the provided documents.",
    question="What is the main conclusion?",
    target_token=500,
    use_context_level_filter=True  # LongLLMLingua feature
)

Performance Benchmarks

  • 20x compression on general prompts with <2% performance drop
  • 21.4% improvement on RAG tasks using only 1/4 of tokens (LongLLMLingua)
  • 3-6x speed improvement with LLMLingua-2 (uses data distillation from GPT-4)

FAQ

Q: What is LLMLingua? A: A Microsoft Research toolkit for compressing LLM prompts by up to 20x while maintaining performance, reducing API costs and solving the lost-in-the-middle problem in long contexts.

Q: Is LLMLingua free? A: Yes, fully open-source under the MIT license.

Q: Does LLMLingua work with any LLM? A: Yes, LLMLingua compresses prompts before they are sent to any LLM. It works with OpenAI, Claude, Gemini, and any other model.


🙏

Source & Thanks

Created by Microsoft Research. Licensed under MIT.

LLMLingua — ⭐ 6,000+

Thanks to Huiqiang Jiang, Qianhui Wu, and the Microsoft Research team for advancing prompt compression.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets