4.5. info theoryÂ¶

• material from Cover âElements of Information Theoryâ

4.5.1. overviewÂ¶

• with small number of points, estimated mutual information is too high

• founded with claude shannon 1948

• set of principles that govern flow + transmission of info

• X is a R.V. on (finite) discrete alphabet ~ wonât cover much continuous

4.5.2. entropy, relative entropy, and mutual infoÂ¶

4.5.2.1. entropyÂ¶

• $$H(X) = - \sum p(x) \:\log p(x) = E[h(p)]$$

• $$h(p)= - \log(p)$$

• $$H(p)$$ implies p is binary

• usually for discrete variables only

• assume 0 log 0 = 0

• intuition

• higher entropy $$\implies$$ more uniform

• lower entropy $$\implies$$ more pure

1. expectation of variable $$W=W(X)$$, which assumes the value $$-log(p_i)$$ with probability $$p_i$$

2. minimum, average number of binary questions (like is X=1?) required to determine value is between $$H(X)$$ and $$H(X)+1$$

3. related to asymptotic behavior of sequence of i.i.d. random variables

• properties

• $$H(X) \geq 0$$ since $$p(x) \in [0, 1]$$

• funtion of distr. only, not the specific values the RV takes (the support of the RV)

• gaussian has max entropy s.t. variance constraint

• $$H(Y\|X)=\sum p(x) H(Y\|X=x) =- \sum_x p(x) \sum_y p(y|x) \log \: p(y|x)$$

• $$H(X,Y)=H(X)+H(Y\|X) =H(Y)+H(X\|Y)$$

4.5.2.2. relative entropy / mutual infoÂ¶

• relative entropy = KL divergence - measures distance between 2 distributions

• $D(p\|\|q) = \sum_x p(x) log \frac{p(x)}{q(x)} = \mathbb E_p log \frac{p(X)}{q(X)} = \mathbb E_p[-\log q(X)] - H(p)$
• if we knew the true distribution p of the random variable, we could construct a code with average description length H(p).

• If, instead, we used the code for a distribution q, we would need H(p) + D(p||q) bits on average to describe the random variable.

• $$D(p\|\|q) \neq D(q\|\|p)$$

• properties

• nonnegative

• not symmetric

• mutual info I(X; Y): how much you can predict about one given the other

• $$I(X; Y) = \sum_X \sum_y p(x,y) log \frac{p(x,y)}{p(x) p(y)} = D(p(x,y)\|\|p(x) p(y))$$

• $$I(X; Y) = -H(X,Y) + H(X) + H(Y))$$

• $$=I(Y|X)$$

• $$I(X; X) = H(X)$$ so entropy sometimes called self-information

• cross-entropy: $$H_q(p) = -\sum_x p(x) \: log \: q(x)$$

4.5.2.3. chain rulesÂ¶

• entropy - $$H(X_1, ..., X_n) = \sum_i H(X_i \| X_{i-1}, ..., X_1) = H(X_n \| X_{n-1}, ..., X_1) + ... + H(X_1)$$

• conditional mutual info $$I(X; Y\|Z) = H(X\|Z) - H(X\|Y,Z)$$

• $$I(X_1, ..., X_n; Y) = \sum_i I(X_i; Y\|X_{i-1}, ... , X_1)$$

• conditional relative entropy $$D(p(y\|x) \|\| q(y\|x)) = \sum_x p(x) \sum_y p(y\|x) log \frac{p(y\|x)}{q(y\|x)}$$

• $$D(p(x, y)\|\|q(x, y)) = D(p(x)\|\|q(x)) + D(p(y\|x)\|\|q(y\|x))$$

4.5.2.4. axiomatic approachÂ¶

• fundamental theorem of information theory - it is possible to transmit information through a noisy channel at any rate less than channel capacity with an arbitrarily small probability of error

• to achieve arbitrarily high reliability, it is necessary to reduce the transmission rate to the channel capacity

• uncertainty measure axioms

1. H(1/M,âŠ,1/M)=f(M) is a montonically increasing function of M

2. f(ML) = f(M)+f(L) where M,L $$\in \mathbb{Z}^+$$

3. grouping axiom

4. H(p,1-p) is continuous function of p

• $$H(p_1,...,p_M) = - \sum p_i log p_i = E[h(p_i)]$$

• $$h(p_i)= - log(p_i)$$

• only solution satisfying above axioms

• H(p,1-p) has max at 1/2

• lemma - Let $$p_1,...,p_M$$ and $$q_1,...,q_M$$ be arbitrary positive numbers with $$\sum p_i = \sum q_i = 1$$. Then $$-\sum p_i log p_i \leq - \sum p_i log q_i$$. Only equal if $$p_i = q_i \: \forall i$$

• intuitively, $$\sum p_i log q_i$$ is maximized when $$p_i=q_i$$, like a dot product

• $$H(p_1,...,p_M) \leq log M$$ with equality iff all $$p_i = 1/M$$

• $$H(X,Y) \leq H(X) + H(Y)$$ with equality iff X and Y are independent

• $$I(X,Y)=H(Y)-H(Y\|X)$$

• sometimes allow p=0 by saying 0log0 = 0

• information $$I(x)=log_2 \frac{1}{p(x)}=-log_2p(x)$$

• reduction in uncertainty (amount of surprise in the outcome)

• if the probability of event happening is small and it happens the info is large

• entropy $$H(X)=E[I(X)]=\sum_i p(x_i)I(x_i)=-\sum_i p(x_i)log_2 p(x_i)$$

• information gain $$IG(X,Y)=H(Y)-H(Y\|X)$$

• $$=-\sum_j p(x_j) \sum_i p(y_i\|x_j) log_2 p(y_i\|x_j)$$

• parts

1. random variable X taking on $$x_1,...,x_M$$ with probabilities $$p_1,...,p_M$$

2. code alphabet = set $$a_1,...,a_D$$ . Each symbol $$x_i$$ is assigned to finite sequence of code characters called code word associated with $$x_i$$

3. objective - minimize the average word length $$\sum p_i n_i$$ where $$n_i$$ is average word length of $$x_i$$

• code is uniquely decipherable if every finite sequence of code characters corresponds to at most one message

• instantaneous code - no code word is a prefix of another code word

4.5.3. basic inequalitiesÂ¶

4.5.3.1. jensenâs inequalityÂ¶

• convex - function lies below any chord

• has positive 2nd deriv

• linear functions are both convex and concave

• Jensenâs inequality - if f is a convex function and X is an R.V., $$f(E[X]) \leq E[f(X)]$$

• if f strictly convex, equality $$\implies X=E[X]$$

• implications

• information inequality $$D(p\|\|q) \geq 0$$ with equality iff p(x)=q(x) for all x

• $$H(X) \leq log \|X\|$$ where |X| denotes the number of elements in the range of X, with equality if and only X has a uniform distr

• $$H(X\|Y) \leq H(X)$$ - information canât hurt

• $$H(X_1, ..., X_n) \leq \sum_i H(X_i)$$

4.5.4. mdlÂ¶

• chapter 1: overview

• explain data given limited observations

• benefits

• occamâs razor

• no overfitting (can pick both form of model and params), without need for ad hoc penalties

• bayesian interpretation

• no need for underlying truth

• description - in terms of some description method

• e.g. a python program which prints a sequence then halts = Kolmogorov complexity

• invariance thm - as long as sequence is long enough, choice of programming language doesnât matter) - (kolmogorov 1965, chaitin 1969, solomonoff 1964)

• not computable in general

• for small samples in practice, depends on choice of programming language

• in practice, we donât use general programming languages but rather select a description method which we know we can get the length of the shortest description in that class (e.g. linear models)

• trade-off: we may fail to minimally compress some sequences which have regularity

• knowing data-generating process can help compress (e.g. recording times for something to fall from a height, generating digits of $$\pi$$ via taylor expansion, compressing natural language based on correct grammar)

• simplest version - let $$\theta$$ be the model and $$X$$ be the data

• 2-part MDL: minimize $$L(\theta) + L(X|\theta)$$

• $$L(X|\theta) = - \log P(X|\theta)$$ - Shannon code

• $$L(\theta)$$ - hard to get this, basic problem with 2-part codes

• have to do this for each model, not model-class (e.g. different linear models with same number of parameters would have different $$L(\theta)$$

• stochastic complexity (ârefined mdlâ): $$\bar{L}(X|\theta)$$ - only construct one code

• ex. $$\bar L(X|\theta) = |\theta|_0 + L(X|\theta)$$ - like 2-part code but breaks up $$\theta$$ space into different sets (e.g. same number of parameters) and assigns them equal codelength

• normalized maximum likelihood - most recent version - select from among a set of codes

• chapter 2.2.1 background

• in mdl, we only work with prefix codes (i.e. no codeword is a prefix of any other codeword)

• these are uniquely decodable

• in fact, any uniquely decodable code can be rewritten as a prefix code which achieves the same code length

• chapter 2.2.2: probability mass functions correspond to codelength functions

• coding: $$x \sim P(X)$$, codelengths $$\ell(x)$$

• Kraft inequality: $$\sum_x 2^{-\ell(x)} \leq 1$$

• given a code $$C$$ and a prob distr. $$P$$, we can construct a code so short codewords get high probs and vice versa

• given $$P$$, $$\exists C, \forall z \: L_C(z) \leq \lceil -\log P(z) \rceil$$

• given $$C'$$, $$\exists P' \: \forall z -\log P(z) = L_{C'}(z)$$

• we redefine codelength so it doesnât require actual integer lengths

• we donât care about the actual encodings, only the codelengths

• given a sample space $$\mathcal Z$$, the set of all codelength functions $$L_\mathcal Z$$ is the set of functions $$L$$ on $$\mathcal Z$$ where $$\exists \,Q$$ (distr), such that $$\sum_z Q(z) \leq 1$$ and $$\forall z,\; L(z) = -\log Q(z)$$

• uniform distr. - every codeword just has same length (fixed-length)

• we usually assume we are encoding a sequence $$x^n$$ which is large, so the rounding becomes negligible

• ex. encoding integers: send $$\log k$$ zeros, then add a 1, then uniform code from 0 to $$2^{\log k}$$

• Given $$P(X)$$, the codelength function $$L(X) = -\log P(X)$$ minimizes expected code length for the variable $$X$$

• chapter 2.2.3 - the information inequality: $$E_P[-\log Q(X)] > E_P[-\log P(X)]$$

• if $$X$$ was generated by $$P$$, then codes with length $$-\log P(X)$$ give the shortest encodings on average

• not surprising - in a large sample, X will occur with frequency proportial to $$P(X)$$, so we want to give it a short codelength $$-\log P(X)$$

• consequently, ideal mean length = $$H(X)$$

• chapter 2.4: crude mdl ex. pick order of markov chain by minimizing $$L(H) + L(D|H)$$ where $$D$$ is data, $$H$$ is hypothesis

• we get to pick codes for $$L(H)$$ and $$L(D|H)$$

• let $$L(x^n|H) = -\log P(x^n|H)$$ (length of data is just negative log-likelihood)

• for $$L(H)$$, describe length of chain $$k$$ with code for integers, then $$k$$ parameters between 0 and 1 by describing the counts generated by the params in n samples - this tends to be harder to do well

• chapter 2.5: universal codes + models

• note: these codes are only universal relative to the set of considered codes $$\mathcal L$$

• universal model - prob. distr corresponding to universal codes

• (different from how we use model in statistics)

• given a set of codes $$\mathcal L = \{ L_1, L_2, ... \}$$, given a squences $$x^n$$, one of the codes $$L \in \mathcal L$$ has the shortest length for that sequence $$\bar L(x^n)$$

• however, we have to pick one $$L$$, before we see $$x^n$$

• universal code - one code such that no matter which $$x^n$$ arrives, length is not much longer than the shortest length among all considered codes

• ex. 2-part codes: first send bits to pick among codes, then use the selected code to encode $$x^n$$ - overhead is small because it doesnât depend on $$n$$

• among countably infinite codes, can still send $$k$$ to index the code, although $$k$$ can get very large

• ex. bayesian universal model - assign prior distr to codes

• ex. nml is an optimal (minimizes worst-case regret) universal model

• $$\bar P_{\text{nml}} (x^n) = \frac{P(x^n | \hat \theta (x^n))}{\sum_{z^n \in \mathcal X^n} P(z^n | \hat \theta (z^n))}$$

• literally a normalized likelihood

• regret of $$\bar P$$ relative to $$M$$: $$â\log \bar P(x^n)â \min_{P \in M} -\log P(x^n )$$

• measures the performance of universal models relative to a set of candidate sources M

• $$\bar P$$ is a probability distribution on $$\mathcal X^n$$ ($$P$$ is not necessarily in $$M$$)

• when evaluating a code, we may look at the worst regret over all $$x^n$$, or the average