Introduction
LLMLingua is Microsoft Research's prompt compression toolkit, with 6,000+ GitHub stars and papers at EMNLP 2023 and ACL 2024. Compress prompts up to 20x with minimal performance loss, dramatically reducing API cost. Especially effective against the "lost in the middle" problem in RAG pipelines. Ideal for developers building production-grade LLM apps who want to optimize token usage and cost.
LLMLingua — Microsoft Research Prompt Compression
Three Methods
| Method | Paper | Compression | Speed |
|---|---|---|---|
| LLMLingua | EMNLP 2023 | Up to 20x | Baseline |
| LongLLMLingua | ACL 2024 | 4x (RAG +21.4%) | Same |
| LLMLingua-2 | ACL 2024 | Up to 20x | 3–6x faster |
Performance Benchmarks
- 20x compression on general prompts with <2% performance drop
- 21.4% improvement on RAG tasks using 1/4 the tokens
- LLMLingua-2 is 3–6x faster
FAQ
Q: What is LLMLingua? A: Microsoft Research's prompt compression toolkit — up to 20x compression with negligible impact on LLM performance.
Q: Is it free? A: Completely free and open source under the MIT license.