Information Theory

🧩 Information Theory

Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. It was founded by Claude Shannon in 1948 and has since become the foundation for digital communication and data science.

🟢 1. Entropy: The Measure of Uncertainty

Shannon Entropy ( $H$ )

Entropy measures the average amount of information produced by a stochastic source of data. $H(X) = -\sum_{i=1}^n P(x_i) \log_b P(x_i)$

High Entropy: The outcome is highly uncertain (e.g., a fair coin toss).
Low Entropy: The outcome is highly predictable (e.g., a biased coin).

Cross-Entropy

Used extensively as a loss function in machine learning, it measures the difference between two probability distributions (the true labels and the predicted probabilities). $H(P, Q) = -\sum_{x} P(x) \log Q(x)$

🟡 2. Information Gain and Divergence

Mutual Information ( $I(X; Y)$ )

Measures the amount of information that can be obtained about one random variable by observing another. $I(X; Y) = H(X) - H(X|Y)$

It is symmetric: $I(X; Y) = I(Y; X)$ .
It is non-negative: $I(X; Y) \ge 0$ .

KL-Divergence ( $D_{KL}$ )

Also known as relative entropy, it measures how one probability distribution $P$ diverges from a second, expected probability distribution $Q$ . $D_{KL}(P \| Q) = \sum P(x) \log \frac{P(x)}{Q(x)}$

🔴 3. Coding Theory and Channel Capacity

Source Coding Theorem

Shannon’s source coding theorem establishes that, on average, the number of bits needed to represent the result of an uncertain event is given by its entropy.

This provides the theoretical limit for lossless data compression.

Channel Capacity ( $C$ )

The tightest upper bound on the rate of information that can be reliably transmitted over a communications channel. $C = \max_{P(x)} I(X; Y)$ For a Gaussian channel with bandwidth $B$ and signal-to-noise ratio $S/N$ : $C = B \log_2(1 + S/N)$