Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Fully justify all steps by providing one of the following: the name of the
theorem, theorem number, or the equation number in lecture. Carefully define
any notation that you introduce.
Problem 1 (20 points).
1. (5 pts) Let P and Q be two probability distributions defined on the same discrete alphabet
A. Show that
D(P∥Q) ≥ 0, (1)
with equality if and only if P = Q.
2. (1 pt) Using (1), show that for any two discrete random variables X and Y ,
I(X;Y ) ≥ 0. (2)
3. (2 pts) Show that for any two discrete random variables X and Y ,
I(X;Y ) = H(Y )−H(Y |X). (3)
4. (2 pts) Show that for any two discrete random variables X and Y ,
H(Y |X) ≤ H(Y ). (4)
5. (3 pt) Show an example of two random variables X and Y such that for some x from the
alphabet of X,
H(Y |X = x) > H(Y ). (5)
6. (3 pt) Consider a function f : A 7→ B, and assume that PX(a) > 0 for all a ∈ A, where A
is the alphabet of X. Show that
ıf(X)(f(x)) ≤ ıX(x), (6)
with equality if and only if f(a) ̸= f(x) for all a ̸= x. Conclude that H(f(X)) ≤ H(X)
with equality if and only if f is injective.
7. (4 pts) Show that for any function φ of X,
H(X,φ(X)) = H(X). (7)
1
EE/Ma/CS/IDS 136 Homework #1 Instructor: Victoria Kostina; TA: Siming Sun
Context: The sub-problems in Problem 1 are fundamental results in information theory.
The inequalities in (1) and (2) are known as “non-negativity of relative entropy” and
“non-negativity of mutual information”.
The inequality in (4) is known by the mnemonic “conditioning decreases entropy”. Recall
that entropy summarizes the information content (the amount of uncertainty) in a random
variable: higher entropy random variables contain more uncertainty. Intuitively, (4) says that
knowing another random variable X can only reduce uncertainty in Y . For some realization
x of the random variable X, the entropy H(Y |X = x) may be greater than H(Y ) (see (5)),
but on average over all realizations x of X, H(Y |X) ≜∑x PX(x)H(Y |X = x) ≤ H(Y ).
The result ıf(X)(f(X)) = ıX(X) for injective f is known as “invariance to labeling”: the
information in a random variable depends only on the sizes of its probability masses, but not
on its alphabet.
Problem 2 (5 points). Prove Theorem 3.1. Using this result, show that the same code also
achieves the minimum expected encoded length over all variable-length lossless codes.
Problem 3 (5 points).
1. (3 pts) Let (X1, . . . , Xn) be discrete random variables. Show that
H(X1, . . . , Xn) ≤
n∑
i=1
H(Xi), (8)
where equality is achieved if and only if the random variables are independent.
2. (2 pts) For two random variables, show that the difference between the right side and
the left side of (8) is the mutual information, i.e.,
H(X, Y ) = H(X) +H(Y )− I(X;Y ) (9)
Context: What (8) tells us is that among all distributions on A1× . . .×An with the same
marginals PX1 , . . . , PXn , the product distribution PXn =
∏n
i=1 PXi maximizes the entropy.
Intuitively, the entropy is a measure of uncertainty, while dependence means less uncertainty.
So what (8) tells us is that to maximize the uncertainty in a collection of random variables
under the constraint on their marginals, we need to make the random variables independent.
Maximum entropy results are important because often we do not know the exactly the
probability distribution of data. However, some characteristics of that distribution might be
known, such as the alphabet size in Theorem 2.3 or the marginal distributions of the collection
of random variables in (8). Maximizing the entropy under the constraints corresponding to
those known characteristics leads to a universal upper bound on the entropy that holds for
any distribution within that constrained class.
Unifying (3) and (9), we re-derive the chain rule of entropy (Theorem 2.1):
H(X, Y ) = H(X) +H(Y |X)
2
EE/Ma/CS/IDS 136 Homework #1 Instructor: Victoria Kostina; TA: Siming Sun
Problem 4 (10 points).
1. (5 pts) In class, we defined the relative entropy between two discrete distributions
as D(P∥Q) ≜∑a∈A P (a) log P (a)Q(a) . If instead the distributions P and Q have density
functions P (x) and Q(x) on Rn, then the relative entropy is defined as
D(P∥Q) ≜ EX∼P log P (X)
Q(X)
=
∫
Rn
P (a) log
P (a)
Q(a)
da.
Using this definition, express the relative entropy between vector Gaussian distributions
N (µ1,Σ1) and N (µ2,Σ2) in terms of µ1,Σ1,µ2,Σ2. Assume that Σ1 and Σ2 are
of the same size and are nonsingular. Particularize your expression to the two cases:
µ1 = µ2, and Σ1 = Σ2. Deriving insight from these particularizations, discuss the
meaning of the terms in your original expression.
2. (5 pts) The mutual information between two random variables X ∈ Rk and Y ∈ Rn
such that the joint distribution between X and Y has a density PXY(x,y) on Rn+k
and the random variables X and Y have marginal densities PX(x) and PY(y) on Rk
and Rn respectively is defined as
I(X;Y) ≜ E(X,Y)∼PXY
[
log
PXY(X,Y)
PX(X)PY(Y)
]
=
∫
Rk
∫
Rn
PXY(a,b) log
PXY(a,b)
PX(a)PY(b)
dbda
An additive Gaussian communication channel acts on input X ∈ Rk as
Y = AX+ Z, (10)
where Z ∈ Rn, Z ∼ N (0,Σ), A ∈ Rn×k is a constant matrix. Assume that the channel
input X ∼ N (0,P), and that Σ and P are nonsingular covariance matrices. Express
I(X;Y) in terms of Σ,P, and A.
Problem 5 (5 points). Consider the optimal lossless code (f⋆, f⋆−1) in Theorem 3.1. Observe
that for all x ∈ {1, 2, . . .},
ℓ(f⋆(x)) = ⌊log2(x+ 1)⌋. (11)
Using this result, show the achievability bound
E [ℓ(f⋆(X))] ≤ H(X) + 1 bits. (12)
Context: In class, we focused on the excess length of lossless codes, and showed achievability
and converse bounds to the quantity
ϵ⋆(PX , k) ≜ min
lossless codes (f,g)
P [ℓ(f(X)) > k] .
Those bounds were expressed in terms of the complementary cdf of the information random
variable ıX(X). Problem 5 focuses on the quantity E [ℓ(f⋆(X))], and in (12) shows an
achievability bound to it. That bound is expressed in terms of H(X). If H(X) is large
(e.g. X is a text of 100 symbols), the bound (12) shows that E [ℓ(f⋆(X))] is approximately
bounded above by H(X).
3
EE/Ma/CS/IDS 136 Homework #1 Instructor: Victoria Kostina; TA: Siming Sun
Problem 6 (5 points). Let A be a finite alphabet. The type of a sequence x1, . . . , xn, where
xi ∈ A, is the vector { 1nNa}a∈A, where Na is the number of occurrences of the letter a in
(x1, . . . , xn). Let X1, . . . , Xn be drawn i.i.d. from a distribution PX on A. Show that the
information in (X1, . . . , Xn) is a function of n, the type of (X1, . . . , Xn) and PX only.
Context: Notice that
{
1
n
Na
}
a∈A is a probability distribution on A. It is known as the em-
pirical distribution of (X1, . . . , Xn). Because it is a function of the random vector (X1, . . . , Xn),
this probability distribution is random. As n gets larger, the empirical distribution becomes
close to the true distribution PX with high probability.
Problem 7 (no collaboration; 5 points). Consider the optimal lossless code (f⋆, f⋆−1) in
Theorem 3.1 designed for the distribution PX . Suppose that the actual distribution of the
information source is PY and thus the resulting expected encoded length is E [ℓ(f⋆(Y ))].
Assume that PX and PY are such that if PX(a) = 0 then PY (a) = 0 also.
1 Show that
E [ℓ(f⋆(Y ))] ≤ H(Y ) +D(PY ∥PX) + 1. (13)
Problem 8 (0 points). How much time did you spend on this set?
1We say that PY is absolutely continuous with respect to PX .