# LLMLingua — Compress Prompts 20x with Minimal Loss

> Microsoft research tool for prompt compression. Reduce token usage up to 20x while maintaining LLM performance. Solves lost-in-the-middle for RAG. MIT, 6,000+ stars.

## Install

Paste the prompt below into your AI tool:

## Quick Use

1. Install:
```bash
pip install llmlingua
```

2. Compress a prompt:
```python
from llmlingua import PromptCompressor

compressor = PromptCompressor()
result = compressor.compress_prompt(
    context=["Your long context here..."],
    instruction="Summarize the key points.",
    target_token=500
)
print(result["compressed_prompt"])
```

---

## Intro

LLMLingua is Microsoft Research's prompt compression toolkit with 6,000+ GitHub stars, published at EMNLP 2023 and ACL 2024. It reduces prompt length by up to 20x while preserving LLM performance, saving significant API costs. LLMLingua-2 offers 3-6x speed improvement over the original through GPT-4 data distillation. Especially effective for RAG pipelines where long retrieved contexts cause the "lost-in-the-middle" problem. Best for developers building production LLM apps who need to optimize token usage and costs.

See also: [TokenCost for tracking LLM spending](https://tokrepo.com/en/workflows/) on TokRepo.

---

## LLMLingua — Prompt Compression by Microsoft Research

### The Problem

LLM API costs are directly tied to token count. Long contexts in RAG pipelines, multi-document QA, and chain-of-thought prompting can consume thousands of tokens per request. Additionally, LLMs suffer from the "lost-in-the-middle" problem — they focus on the beginning and end of long contexts, missing information in the middle.

### The Solution

LLMLingua uses a small language model to identify and remove non-essential tokens from prompts, achieving up to 20x compression with minimal performance loss.

### Three Methods

| Method | Paper | Compression | Speed |
|--------|-------|-------------|-------|
| **LLMLingua** | EMNLP 2023 | Up to 20x | Baseline |
| **LongLLMLingua** | ACL 2024 | 4x (+ 21.4% RAG improvement) | Same |
| **LLMLingua-2** | ACL 2024 Findings | Up to 20x | 3-6x faster |

### Installation

```bash
pip install llmlingua
```

### Usage Examples

**Basic compression:**
```python
from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank"
)

compressed = compressor.compress_prompt(
    context=["Long document text here..."],
    instruction="Answer the question based on the context.",
    question="What are the key findings?",
    target_token=200
)

print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']}")
print(compressed["compressed_prompt"])
```

**For RAG pipelines (LongLLMLingua):**
```python
from llmlingua import PromptCompressor

compressor = PromptCompressor()

# Multiple retrieved documents
contexts = [
    "Document 1: ...",
    "Document 2: ...",
    "Document 3: ..."
]

compressed = compressor.compress_prompt(
    context=contexts,
    instruction="Answer based on the provided documents.",
    question="What is the main conclusion?",
    target_token=500,
    use_context_level_filter=True  # LongLLMLingua feature
)
```

### Performance Benchmarks

- **20x compression** on general prompts with <2% performance drop
- **21.4% improvement** on RAG tasks using only 1/4 of tokens (LongLLMLingua)
- **3-6x speed improvement** with LLMLingua-2 (uses data distillation from GPT-4)

### FAQ

**Q: What is LLMLingua?**
A: A Microsoft Research toolkit for compressing LLM prompts by up to 20x while maintaining performance, reducing API costs and solving the lost-in-the-middle problem in long contexts.

**Q: Is LLMLingua free?**
A: Yes, fully open-source under the MIT license.

**Q: Does LLMLingua work with any LLM?**
A: Yes, LLMLingua compresses prompts before they are sent to any LLM. It works with OpenAI, Claude, Gemini, and any other model.

---

## Source & Thanks

> Created by [Microsoft Research](https://github.com/microsoft). Licensed under MIT.
>
> [LLMLingua](https://github.com/microsoft/LLMLingua) — ⭐ 6,000+

Thanks to Huiqiang Jiang, Qianhui Wu, and the Microsoft Research team for advancing prompt compression.

---

<!-- ZH -->

## Quick Use

1. Install:
```bash
pip install llmlingua
```

2. Compress a prompt:
```python
from llmlingua import PromptCompressor

compressor = PromptCompressor()
result = compressor.compress_prompt(
    context=["Your long text..."],
    instruction="Summarize the key points.",
    target_token=500
)
print(result["compressed_prompt"])
```

---

## Introduction

LLMLingua is Microsoft Research's prompt compression toolkit, with 6,000+ GitHub stars and papers at EMNLP 2023 and ACL 2024. Compress prompts up to 20x with minimal performance loss, dramatically reducing API cost. Especially effective against the "lost in the middle" problem in RAG pipelines. Ideal for developers building production-grade LLM apps who want to optimize token usage and cost.

---

## LLMLingua — Microsoft Research Prompt Compression

### Three Methods

| Method | Paper | Compression | Speed |
|--------|-------|-------------|-------|
| LLMLingua | EMNLP 2023 | Up to 20x | Baseline |
| LongLLMLingua | ACL 2024 | 4x (RAG +21.4%) | Same |
| LLMLingua-2 | ACL 2024 | Up to 20x | 3–6x faster |

### Performance Benchmarks

- 20x compression on general prompts with <2% performance drop
- 21.4% improvement on RAG tasks using 1/4 the tokens
- LLMLingua-2 is 3–6x faster

### FAQ

**Q: What is LLMLingua?**
A: Microsoft Research's prompt compression toolkit — up to 20x compression with negligible impact on LLM performance.

**Q: Is it free?**
A: Completely free and open source under the MIT license.

---

## Source & Thanks

> Created by [Microsoft Research](https://github.com/microsoft). Licensed under MIT.
>
> [LLMLingua](https://github.com/microsoft/LLMLingua) — ⭐ 6,000+


---
Source: https://tokrepo.com/en/workflows/llmlingua-compress-prompts-20x-minimal-loss-1510da0c
Author: Script Depot