Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsApr 9, 2026·2 min de lectura

LLMLingua — Compress Prompts 20x with Minimal Loss

Microsoft research tool for prompt compression. Reduce token usage up to 20x while maintaining LLM performance. Solves lost-in-the-middle for RAG. MIT, 6,000+ stars.

Introducción

LLMLingua is Microsoft Research's prompt compression toolkit with 6,000+ GitHub stars, published at EMNLP 2023 and ACL 2024. It reduces prompt length by up to 20x while preserving LLM performance, saving significant API costs. LLMLingua-2 offers 3-6x speed improvement over the original through GPT-4 data distillation. Especially effective for RAG pipelines where long retrieved contexts cause the "lost-in-the-middle" problem. Best for developers building production LLM apps who need to optimize token usage and costs.

See also: TokenCost for tracking LLM spending on TokRepo.


LLMLingua — Prompt Compression by Microsoft Research

The Problem

LLM API costs are directly tied to token count. Long contexts in RAG pipelines, multi-document QA, and chain-of-thought prompting can consume thousands of tokens per request. Additionally, LLMs suffer from the "lost-in-the-middle" problem — they focus on the beginning and end of long contexts, missing information in the middle.

The Solution

LLMLingua uses a small language model to identify and remove non-essential tokens from prompts, achieving up to 20x compression with minimal performance loss.

Three Methods

Method Paper Compression Speed
LLMLingua EMNLP 2023 Up to 20x Baseline
LongLLMLingua ACL 2024 4x (+ 21.4% RAG improvement) Same
LLMLingua-2 ACL 2024 Findings Up to 20x 3-6x faster

Installation

pip install llmlingua

Usage Examples

Basic compression:

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank"
)

compressed = compressor.compress_prompt(
    context=["Long document text here..."],
    instruction="Answer the question based on the context.",
    question="What are the key findings?",
    target_token=200
)

print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']}")
print(compressed["compressed_prompt"])

For RAG pipelines (LongLLMLingua):

from llmlingua import PromptCompressor

compressor = PromptCompressor()

# Multiple retrieved documents
contexts = [
    "Document 1: ...",
    "Document 2: ...",
    "Document 3: ..."
]

compressed = compressor.compress_prompt(
    context=contexts,
    instruction="Answer based on the provided documents.",
    question="What is the main conclusion?",
    target_token=500,
    use_context_level_filter=True  # LongLLMLingua feature
)

Performance Benchmarks

  • 20x compression on general prompts with <2% performance drop
  • 21.4% improvement on RAG tasks using only 1/4 of tokens (LongLLMLingua)
  • 3-6x speed improvement with LLMLingua-2 (uses data distillation from GPT-4)

FAQ

Q: What is LLMLingua? A: A Microsoft Research toolkit for compressing LLM prompts by up to 20x while maintaining performance, reducing API costs and solving the lost-in-the-middle problem in long contexts.

Q: Is LLMLingua free? A: Yes, fully open-source under the MIT license.

Q: Does LLMLingua work with any LLM? A: Yes, LLMLingua compresses prompts before they are sent to any LLM. It works with OpenAI, Claude, Gemini, and any other model.


🙏

Fuente y agradecimientos

Created by Microsoft Research. Licensed under MIT.

LLMLingua — ⭐ 6,000+

Thanks to Huiqiang Jiang, Qianhui Wu, and the Microsoft Research team for advancing prompt compression.

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados