Summary

EN: Part 1 of a 5-part series introducing information theory to data scientists. This installment covers Shannon’s foundational 1948 paper, the concept of self-information (how much information a single event carries), and the bit as the natural unit of information. The series aims to build intuition for entropy, KL divergence, and mutual information — concepts that underpin modern ML and data compression.

ZH: 資料科學家資訊理論入門系列的第 1 篇(共 5 篇),介紹 Shannon 1948 年奠基論文、自訊息(single event 帶有的資訊量)以及 bit 作為資訊量單位的概念,為理解熵、KL 散度和互資訊建立直覺基礎。

Key Points

  • Information theory originates from Shannon’s 1948 paper “A Mathematical Theory of Communication”
  • Self-information: I(x) = -log₂(p(x)) — rare events carry more information than common ones
  • Bit: the amount of information needed to distinguish between two equally likely outcomes
  • Surprise is the key intuition: a certain event (p=1) carries 0 bits; an impossible event would carry infinite bits
  • The series will progress to Shannon entropy (expected self-information), KL divergence, and mutual information
  • Content is truncated (member-only) — this covers only the introductory framing

Insights

  • The framing of information as “surprise” is the most intuitive entry point — it makes abstract math feel grounded
  • The connection to ML is direct: cross-entropy loss in neural networks is literally negative log likelihood, which is self-information in disguise
  • Starting from first principles (Shannon’s original paper) builds understanding that survives framework changes

Connections

  • Cross-entropy loss in neural networks (used in all classification models) is directly derived from self-information
  • KL divergence (covered in later parts) is the foundation of VAEs and many generative model training objectives
  • Connects to PromptWizard and DSPy: both use probability distributions over outputs — information theory gives the mathematical language for reasoning about model uncertainty

Raw Excerpt

“Shannon defined information as surprise. The more unexpected an event, the more information it carries. Mathematically: I(x) = -log₂(p(x)). A coin flip with p=0.5 carries exactly 1 bit of information. A certainty carries 0 bits. This simple formula is the foundation of everything in information theory.”