Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CS 8395: Homework 2
Overall directions Any resource available to you may be used to answer the homework questions, and you may collaborate with anyone in the class, with the following caveats: • You must record the names of your collaborators at the top of your homework. • You are strongly encouraged to avoid copying code/answers directly; instead, where possible, write your own versions! (Copying boilerplate data/plotting code is often most efficient, but you’ll learn more about the Deep Learning content by implementing those parts yourself.) This homework is due November 1st at Midnight, via Brightspace. Please zip your code using either the .zip or .tar.gz formats, and please submit written work in PDF. Please also include the approximate amount of time you spent on this homework at the top of your submission. Rate-Distortion and Echos In class the VAE we trained used a loss function composed of a data term (the MSE) and a prior term (the KL divergence). A week or so later we derived the β-VAE, which relaxes a constrained optimization problem to the following (unconstrained) loss: L[p, q] = − log p(x|z)︸ ︷︷ ︸ MSE +β KL[ q(z|x) ∥ p(z) ]︸ ︷︷ ︸ prior a) Describe VAE behavior at the extreme cases of β → 0 and β →∞. b) We also discussed the Rate-Distortion version of the loss function, using q(z) the posterior or empirical marginal distribution of z induced by q(z|x) (integrated over data x): L[p, q] = − log p(x|z)︸ ︷︷ ︸ MSE +β KL[ q(z|x) ∥ q(z) ]︸ ︷︷ ︸ posterior Explain why calculation of q(z) is hard with Gaussian encoders q(z|x). c) Suppose instead we choose a different encoder distribution, using two (determinstic) functions m(x) and s(x): q∗(z|x) = m(x) + s(x) · z1 Here, z1 is ALSO sampled from q ∗(z|x = x1), but using x1, a new data sample. Write out two recursions of q∗(z|x), expanding z1 into its definition using q∗, etc., stopping at z3. d) We know that KL[ q(z|x) ∥ q(z) ] is equivalent to I(x, z). Using the identity I(x, z) = H(z) −H(z|x), show that KL[ q∗(z|x) ∥ q∗(z) ] = Ex[log detS(x)], where q∗(z) is induced by q∗(z|x). Hint: use H(z|x) = Ex [ H[m(x) + s(s) · z1|X = x] ] e) What are the main computational issues with sampling from q∗? Devise a scheme to sample from q∗, using any approximation/simplifying assumptions you deem necessary. 1 GAN Construction In this assignment you will construct this basic GAN and then the W-GAN using the MNIST dataset. You are free to use whichever architecture you prefer. However, unless you have access to a GPU, I strongly recommend modifying the decoder structure from the VAE assignment. There are several extant examples using fully connected networks with modules like: [Linear→ LeakyReLU]× 4 using increasing linear layer sizes, e.g., 128 → 256 → 512 → 1024 → (28 ∗ 28). LeakyReLU often replaces the ReLU non-linearity in direct generative models both in convolutional generators and fully connected generators. If you do choose to use convolution, I recommend an architecture that looks like: [ConvTranspose→ BatchNormLeakyReLU]× 2 after several fully connected layers. Using a stride of 2 for the conv transpose, this means you can go from 7× 7 to 14× 14 to 28× 28 which is the image size. As a reminder, for the “vanilla” GAN, the loss function of the discriminator D is Ladv = BCEntropy[D(xreal), 1] + BCEntropy[D(G(z)), 0] The loss for the generator G is Lgen = BCEntropy[D(G(z)), 1] = −BCEntropy[D(G(z)), 0] Fully connected discriminators should be sufficient in both cases. I also recommend a slower learning rate ∼ 0.0001, and at least 100 epochs. For MNIST there generally should be no need to tune the critic index, training 1-to-1 should be sufficient. a) Plot the results of your GAN (even if they’re bad). b) Implement the gradient penalty of W-GAN, and compare results to the regular GAN. c) Given an image, compute its GAN embedding z by gradient descent; that is, given x, freeze the generator and optimize ∥G(z)−x∥ for z. Test this out on the first few points in the test dataset of MNIST. By starting at different initializations of z, do we receive different z∗ = argminz ∥G(z) − x∥? (essentially, is argmin a local min?). d) Given two images xA and xB and their inverted codes zA and zB , describe in English/pseudocode a method for traversing between their codes with minimal differences per step length, i.e. construct a (discretization of) Z(t) a path through the space of z’s that minimizes U(Z(t)) = ∫ 1 0 [∥G(Z(t))− xA∥22 + ∥G(Z(t))− xB∥22] ∥∥∥∥dZ(t)dt ∥∥∥∥ 2 dt such that Z(0) = zA, Z(1) = zB (1) (bonus points for implementing the proposed method) e) Suppose we wanted to learn decodings of z so that minZ(t) U(Z(t)) is minimal. How could we modify our GAN training procedure to accomplish this?