Cette page est affichée en anglais. Une traduction française est en cours.

PromptsApr 9, 2026·2 min de lecture

LLMLingua — Compress Prompts 20x with Minimal Loss

Microsoft research tool for prompt compression. Reduce token usage up to 20x while maintaining LLM performance. Solves lost-in-the-middle for RAG. MIT, 6,000+ stars.

Script Depot · Community

Prêt pour agents

Installation avec revue préalable

Cet actif nécessite une revue. Le prompt copié demande un dry-run, affiche les écritures, puis continue seulement après confirmation.

Needs Confirmation · 64/100Policy : confirmer

Surface agent

Tout agent MCP/CLI

Type

Prompt

Installation

Single

Confiance

Confiance : Established

Point d'entrée

step-1.md

Commande avec revue préalable

npx -y tokrepo@latest install 1510da0c-33d7-11f1-9bc6-00163e2b0d79 --target codex

Dry-run d'abord, confirmez les écritures, puis lancez cette commande.

TL;DR

LLMLingua compresses LLM prompts by up to 20x while preserving response quality, cutting token costs.

§01

What it is

LLMLingua is a prompt compression tool from Microsoft Research. It reduces the token count of prompts sent to large language models by up to 20x while maintaining the quality of model outputs. It addresses the lost-in-the-middle problem in RAG pipelines where long contexts degrade retrieval accuracy.

Developers running RAG systems, long-context applications, or any LLM workflow where token costs matter will benefit from LLMLingua's compression capabilities.

§02

How it saves time or tokens

LLMLingua directly reduces token consumption. A 10,000-token prompt compressed to 500 tokens saves API costs proportionally. For RAG pipelines, compression also improves result quality by removing redundant context that causes the model to lose focus on relevant passages.

§03

How to use

Install LLMLingua via pip.
Load a small compression model (e.g., LLaMA-based).
Pass your prompt through the compressor before sending to the target LLM.

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name='microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank',
    use_llmlingua2=True,
)

original_prompt = 'Your very long prompt with lots of context...'
compressed = compressor.compress_prompt(
    original_prompt,
    rate=0.5,  # Keep 50% of tokens
    force_tokens=['important_keyword'],
)
print(f'Compressed from {compressed["origin_tokens"]} to {compressed["compressed_tokens"]} tokens')
print(compressed['compressed_prompt'])

§04

Example

In a RAG pipeline, compress retrieved documents before sending them as context:

# Retrieved documents from vector search
documents = retrieve_relevant_docs(query)
context = '\n'.join(documents)

# Compress before sending to LLM
compressed = compressor.compress_prompt(context, rate=0.3)
response = llm.generate(f'{compressed["compressed_prompt"]}\n\nQuestion: {query}')

§05

Related on TokRepo

AI tools for RAG — Retrieval-augmented generation tools and techniques
AI tools for content — Content processing and optimization

§06

Common pitfalls

Compression requires running a small model locally, which adds latency. Balance compression time against token cost savings.
Aggressive compression rates (below 0.2) can remove critical information. Start with conservative rates and test output quality.
The compression model needs GPU memory. For CPU-only environments, expect slower processing on long prompts.

Questions fréquentes

How much compression can LLMLingua achieve?+

LLMLingua can compress prompts by up to 20x in optimal cases. Typical compression rates of 2-5x preserve most of the original information quality. The actual ratio depends on the redundancy in your specific prompts.

Does compression affect LLM output quality?+

At moderate compression rates (2-5x), output quality is largely preserved. Microsoft's research shows minimal performance degradation on benchmarks. However, aggressive compression (10-20x) may lose subtle nuances in the original prompt.

What is the lost-in-the-middle problem?+

Lost-in-the-middle refers to the tendency of LLMs to pay less attention to information in the middle of long contexts, focusing more on the beginning and end. LLMLingua addresses this by removing redundant middle content.

Can I use LLMLingua with any LLM?+

Yes. LLMLingua compresses the prompt before it reaches the target LLM, so it works with OpenAI, Anthropic, open-source models, or any LLM API. The compression step is model-agnostic.

What is LLMLingua-2?+

LLMLingua-2 is the improved version that uses a BERT-based model for faster compression. It achieves better compression quality with lower latency compared to the original LLMLingua, which required a larger LLaMA-based model.

Sources citées (3)

LLMLingua GitHub— LLMLingua compresses prompts up to 20x with minimal quality loss
LLMLingua Paper— Addresses lost-in-the-middle problem in long-context LLM applications
LLMLingua-2 Paper— LLMLingua-2 uses BERT-based models for faster compression

En lien sur TokRepo

RAG tools Content tools Featured workflows

🙏

Source et remerciements

Created by Microsoft Research. Licensed under MIT.

LLMLingua — ⭐ 6,000+

Thanks to Huiqiang Jiang, Qianhui Wu, and the Microsoft Research team for advancing prompt compression.

Fil de discussion

Connectez-vous pour rejoindre la discussion.

Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires

Starship — Minimal Blazing-Fast Customizable Shell Prompt

Starship is a minimal, blazing-fast, and infinitely customizable prompt for any shell. Works with bash, zsh, fish, PowerShell, nushell, and more. Single TOML config file and Rust-powered speed with gorgeous defaults.

Prompts

Script Depot

Prompt Architect — 27 Frameworks for Expert Prompts

Transform vague prompts into structured, expert-level prompts using 27 research-backed frameworks across 7 intent categories. Works with Claude Code, ChatGPT, Cursor, and 30+ AI tools.

Prompts

Prompt Lab

Huh — Build Terminal Forms and Prompts in Go

A Go library for building interactive terminal forms with inputs, selects, confirms, and file pickers, powered by the Bubble Tea TUI framework.

Prompts

Script Depot

POML — Prompt Orchestration Markup Language

POML structures advanced prompts as markup. Use Python or Node packages, VS Code tooling, templating, data blocks, and presentation controls for LLM apps.

Prompts

Microsoft