Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CS 8395: Homework 2
Overall directions
Any resource available to you may be used to answer the homework questions, and you may collaborate with
anyone in the class, with the following caveats:
• You must record the names of your collaborators at the top of your homework.
• You are strongly encouraged to avoid copying code/answers directly; instead, where possible, write
your own versions! (Copying boilerplate data/plotting code is often most efficient, but you’ll learn
more about the Deep Learning content by implementing those parts yourself.)
This homework is due November 1st at Midnight, via Brightspace. Please zip your code using either the
.zip or .tar.gz formats, and please submit written work in PDF.
Please also include the approximate amount of time you spent on this homework at the top of your
submission.
Rate-Distortion and Echos
In class the VAE we trained used a loss function composed of a data term (the MSE) and a prior term (the
KL divergence). A week or so later we derived the β-VAE, which relaxes a constrained optimization problem
to the following (unconstrained) loss:
L[p, q] = − log p(x|z)︸ ︷︷ ︸
MSE
+β KL[ q(z|x) ∥ p(z) ]︸ ︷︷ ︸
prior
a) Describe VAE behavior at the extreme cases of β → 0 and β →∞.
b) We also discussed the Rate-Distortion version of the loss function, using q(z) the posterior or empirical
marginal distribution of z induced by q(z|x) (integrated over data x):
L[p, q] = − log p(x|z)︸ ︷︷ ︸
MSE
+β KL[ q(z|x) ∥ q(z) ]︸ ︷︷ ︸
posterior
Explain why calculation of q(z) is hard with Gaussian encoders q(z|x).
c) Suppose instead we choose a different encoder distribution, using two (determinstic) functions m(x) and
s(x):
q∗(z|x) = m(x) + s(x) · z1
Here, z1 is ALSO sampled from q
∗(z|x = x1), but using x1, a new data sample. Write out two recursions of
q∗(z|x), expanding z1 into its definition using q∗, etc., stopping at z3.
d) We know that KL[ q(z|x) ∥ q(z) ] is equivalent to I(x, z). Using the identity I(x, z) = H(z) −H(z|x),
show that KL[ q∗(z|x) ∥ q∗(z) ] = Ex[log detS(x)], where q∗(z) is induced by q∗(z|x).
Hint: use H(z|x) = Ex [ H[m(x) + s(s) · z1|X = x] ]
e) What are the main computational issues with sampling from q∗? Devise a scheme to sample from q∗,
using any approximation/simplifying assumptions you deem necessary.
1
GAN Construction
In this assignment you will construct this basic GAN and then the W-GAN using the MNIST dataset. You
are free to use whichever architecture you prefer. However, unless you have access to a GPU, I strongly
recommend modifying the decoder structure from the VAE assignment. There are several extant examples
using fully connected networks with modules like:
[Linear→ LeakyReLU]× 4
using increasing linear layer sizes, e.g., 128 → 256 → 512 → 1024 → (28 ∗ 28). LeakyReLU often replaces
the ReLU non-linearity in direct generative models both in convolutional generators and fully connected
generators. If you do choose to use convolution, I recommend an architecture that looks like:
[ConvTranspose→ BatchNormLeakyReLU]× 2
after several fully connected layers. Using a stride of 2 for the conv transpose, this means you can go from
7× 7 to 14× 14 to 28× 28 which is the image size.
As a reminder, for the “vanilla” GAN, the loss function of the discriminator D is
Ladv = BCEntropy[D(xreal), 1] + BCEntropy[D(G(z)), 0]
The loss for the generator G is
Lgen = BCEntropy[D(G(z)), 1]
= −BCEntropy[D(G(z)), 0]
Fully connected discriminators should be sufficient in both cases. I also recommend a slower learning
rate ∼ 0.0001, and at least 100 epochs. For MNIST there generally should be no need to tune the critic
index, training 1-to-1 should be sufficient.
a) Plot the results of your GAN (even if they’re bad).
b) Implement the gradient penalty of W-GAN, and compare results to the regular GAN.
c) Given an image, compute its GAN embedding z by gradient descent; that is, given x, freeze the generator
and optimize ∥G(z)−x∥ for z. Test this out on the first few points in the test dataset of MNIST. By starting
at different initializations of z, do we receive different z∗ = argminz ∥G(z) − x∥? (essentially, is argmin a
local min?).
d) Given two images xA and xB and their inverted codes zA and zB , describe in English/pseudocode a method
for traversing between their codes with minimal differences per step length, i.e. construct a (discretization
of) Z(t) a path through the space of z’s that minimizes
U(Z(t)) =
∫ 1
0
[∥G(Z(t))− xA∥22 + ∥G(Z(t))− xB∥22] ∥∥∥∥dZ(t)dt
∥∥∥∥
2
dt such that Z(0) = zA, Z(1) = zB (1)
(bonus points for implementing the proposed method)
e) Suppose we wanted to learn decodings of z so that minZ(t) U(Z(t)) is minimal. How could we modify our
GAN training procedure to accomplish this?