LLMLingua: Innovating LLM Efficiency with Prompt Compression

本文由 AI 分析生成

建立時間： 2026-03-28 來源： https://www.microsoft.com/en-us/research/blog/llmlingua-innovating-llm-efficiency-with-prompt-compression/

Summary

Microsoft Research blog post on LLMLingua (EMNLP 2023), a prompt compression method using a small language model (GPT-2/LLaMA-7B) to identify and remove unimportant tokens from prompts — achieving up to 20x compression while maintaining reasoning and summarization quality for the target closed LLM.

微軟研究院關於 LLMLingua（EMNLP 2023）的博客文章，一種使用小型語言模型（GPT-2/LLaMA-7B）識別並刪除提示中不重要 token 的提示壓縮方法——在保持目標閉源 LLM 推理和摘要質量的同時實現高達 20 倍的壓縮。

Key Points

Problem: advanced prompting techniques (CoT, ICL) produce very long prompts (tens of thousands of tokens) → exceeds context windows, increases cost, reduces performance
LLMLingua method: (1) budget controller for module-level compression ratios; (2) coarse-grained (sentence-level) then fine-grained (token-level) compression using a small LM’s perplexity as importance signal
Compressed prompts are not human-readable but are effective for LLMs
Results (EMNLP 2023): up to 20x compression; maintains full 9-step CoT reasoning capability; 20-30% latency reduction
Tested on: GSM8K (math), BBH (reasoning), ShareGPT (conversation), Arxiv-March23 (summarization)
Recoverability: GPT-4 can restore compressed prompts to near-original, confirming semantic preservation
Integration: already in LlamaIndex RAG framework; extended as LongLLMLingua for long-context scenarios

Insights

The key insight — use a small model’s perplexity to identify which tokens are “surprising” (high information density) vs. predictable (removable) — is elegant. High-perplexity tokens carry signal; low-perplexity tokens are largely redundant. The 20x compression finding challenges the assumption that every token in a long prompt is necessary. LLMLingua directly addresses the context window bottleneck that limits RAG systems: if you can compress retrieved documents 10-20x without losing key reasoning, you can fit far more context into the same window. LongLLMLingua extends this to dynamic multi-document QA scenarios.

Connections

Raw Excerpt

Using a well-trained small language model, LLMLingua identifies and removes unimportant tokens from prompts. Although the token-level compressed prompts may be difficult for humans to understand, they prove highly effective for LLMs.

bot_vault

Explorer

LLMLingua: Innovating LLM Efficiency with Prompt Compression

Summary

Key Points

Insights

Connections

Raw Excerpt

Graph View

Table of Contents

Backlinks