本文由 AI 分析生成
Summary
Microsoft Research blog post on LLMLingua (EMNLP 2023), a prompt compression method using a small language model (GPT-2/LLaMA-7B) to identify and remove unimportant tokens from prompts — achieving up to 20x compression while maintaining reasoning and summarization quality for the target closed LLM.
微軟研究院關於 LLMLingua(EMNLP 2023)的博客文章,一種使用小型語言模型(GPT-2/LLaMA-7B)識別並刪除提示中不重要 token 的提示壓縮方法——在保持目標閉源 LLM 推理和摘要質量的同時實現高達 20 倍的壓縮。
Key Points
- Problem: advanced prompting techniques (CoT, ICL) produce very long prompts (tens of thousands of tokens) → exceeds context windows, increases cost, reduces performance
- LLMLingua method: (1) budget controller for module-level compression ratios; (2) coarse-grained (sentence-level) then fine-grained (token-level) compression using a small LM’s perplexity as importance signal
- Compressed prompts are not human-readable but are effective for LLMs
- Results (EMNLP 2023): up to 20x compression; maintains full 9-step CoT reasoning capability; 20-30% latency reduction
- Tested on: GSM8K (math), BBH (reasoning), ShareGPT (conversation), Arxiv-March23 (summarization)
- Recoverability: GPT-4 can restore compressed prompts to near-original, confirming semantic preservation
- Integration: already in LlamaIndex RAG framework; extended as LongLLMLingua for long-context scenarios
Insights
The key insight — use a small model’s perplexity to identify which tokens are “surprising” (high information density) vs. predictable (removable) — is elegant. High-perplexity tokens carry signal; low-perplexity tokens are largely redundant. The 20x compression finding challenges the assumption that every token in a long prompt is necessary. LLMLingua directly addresses the context window bottleneck that limits RAG systems: if you can compress retrieved documents 10-20x without losing key reasoning, you can fit far more context into the same window. LongLLMLingua extends this to dynamic multi-document QA scenarios.
Connections
Raw Excerpt
Using a well-trained small language model, LLMLingua identifies and removes unimportant tokens from prompts. Although the token-level compressed prompts may be difficult for humans to understand, they prove highly effective for LLMs.