Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Applied Fundamentals of Deep Learning
Date : July 4th, 2022
Question 1. [10 marks]
Circle the best answer for each of the questions below. Do not circle more than one answer per question.
Part (a) [1 mark]
Suppose we want to train a neural network to approximate f(x) = x3 , x ∈ R. What is your choice for the activation function of the output layer?
(A) Tanh
(B) Softplus
(C) ReLU
(D) Linear * (1 marks)
Part (b) [1 mark]
Which one of the following techniques is NOT used to prevent over-fitting?
(A) Dropout
(B) Batch Normalization
(C) Momentum * (1 marks)
(D) Weight decay
Part (c) [1 mark]
Which one of the following is correct about normalization layers?
(A) LayerNorm is compatible with batch size of 1. * (1 marks)
(B) BatchNorm is compatible with batch size of 1.
(C) LayerNorm keeps track of training statistics for infer-ence.
(D) BatchNorm does not have trainable parameters.
Part (d) [1 mark]
Suppose we have a fully-connected network with the following configuration: [10, 20, 20, 2] where each number denotes the num- ber of neurons in the corresponding layer (e.g., input size of 10, output size of 2). What is the total number of parameters in this model considering the bias term?
(A) 640
(B) 682 * (1 marks)
(C) 692
(D) 680
Part (e) [1 mark]
Which one of the following is correct about CNNs?
(A) The weight sharing occurs across all spatial dimensions. * (1 marks)
(B) As we go deeper, the feature map resolution is increased.
(C) Convolution operation is rotation invariant.
(D) Strides are used to reduce the depth information.
Part (f) [1 mark]
Which one of the following is correct about deep CNN architectures?
(A) Pixel-wise transformations are used to manipulate the feature map resolution.
(B) skip-connections in ResNet multiply the output signals by the input signals.
(C) We can use proxy losses over intermediate layers to tackle the gradient vanishing problem. * (1 marks)
(D) Most recent deep architectures use 5 × 5 kernels.
Part (g) [1 mark]
Which one of the following is Not correct about unsupervised learning?
(A) The loss function in contrastive learning is defined over the embedding space.
(B) In contrastive learning, we maximize the agreement be-tween embeddings encoded from random augmentations over different samples. * (1 marks)
(C) Variational autoencoders impose a distribution prior over the embedding space.
(D) For transfer learning with autoencoders, we throw out the decoder after pre-training.
Part (h) [1 mark]
Which one of the following is correct about optimizers?
(A) Increasing the learning rate as the training progresses leads to faster convergence.
(B) When using Adam optimizer, each parameter is updated by a custom learning rate. * (1 marks)
(C) Grid search is preferred to find the best learning rate.
(D) Adam optimizer integrates momentum in its updates.
Part (i) [1 mark]
Which one of the following is Not correct about loss functions?
(A) The inputs to nn.CrossEntropyLoss() are logits.
(B) The inputs to nn.NLLLoss() are from a categorical dis-tribution.
(C) The inputs to nn.BCELoss() are from a Bernoulli distri-bution.
(D) None of above. * (1 marks)
Part (j) [1 mark]
What is the output size of applying a con- volution filter of size 3x3x5 [depth, hight, width] over an input image of 3x63x105 with no padding and stride of size 2?
(A) 3x30x50.
(B) 1x30x50. * (1 marks)
(C) 3x30x30.
(D) 1x50x50.
Question 2. [10 marks]
Answer the following questions regarding training and evaluating neural networks. Part (a) [2 marks]
Why accuracy is not a good measure for evaluating performance on imbalanced data? Give an example where accuracy measure fails.
Part (b) [2 marks]
Which evaluation metric tells us the percentage of actual positive samples out of all the samples that our model has predicted as positive? Which evaluation metric tells us what percentage of positive samples are correctly classified as positive by our model?
Part (c) [2 marks]
Suppose we have access to train/validation/test splits of our data and we want to standardize the splits before passing them to our model by subtracting the mean and dividing by standard deviation. Explain how we should compute and apply these statistics?
Part (d) [4 marks]
Answer the following questions.
(1) How should we change our learning rate if the training loss is fluctuating?
(2) How should we change our model if the training loss is not decreasing?
(3) How should we change our model if the training loss is decreasing but validation loss is increasing?
(4) Which type of model error is more dangerous in medical diagnosis: False positive or False negative?
Question 3. [15 marks]
Answer the following questions about fully-connected neural networks. Part (a) [3 marks]
Suppose we want to fit a model with a single weight (y = wx + b) to the following data.
(1) Can we ignore the bias term? Why?
(2) Do we need non-linearity?
Part (b) [2 marks]
In the following neural networks, we are using linear functions for the output and hidden layers. Which of the networks is more powerful? why?
Part (c) [6 marks]
Suppose we have the following two-layer NN that receives a quadratic function of input x and predicts a binary class. Derive the update rule for weight w1. Suppose that we use a custom non-linearity as f(z) = 3/1z3 and E = ?[t × log(y) + (1 ? t) × log(1 ? y)] is the binary cross-entropy loss function with t representing the True label and y the predicted label. Also note that
Part (d) [2 marks]
In binary classification tasks, we usually set the threshold to 0 .5 (p < 0.5 is considered as predicting class 0 and p >= 0.5 is considered as model predicting class 1) to impose the prior that True Positives and True Negatives are equally important for us. Suppose for a specific task, we are more interested in increasing the Recall than precision. How can we address this using the threshold value? How this affects the total number of False Positives?
Part (e) [2 marks]
Explain why we can use a linear classifier on top of a neural network with a few hidden layers to classify non-linear data.
Question 4. [20 marks]
Answer the following questions regrading CNNs. In all questions assume that: (1) we are not using a bias term, (2) the sizes are represented as Depth ×Height ×Width.
Part (a) [2 marks]
Assume we have a feature map of size 25 × 100 × 100 and we want to map it to a feature map of size 50 × 100 × 100. Fill out the missing parts.
1 # Conv2d : in_ch , out_ch , kernel_size , stride , padding
2 self . conv = nn . Conv2d (25 , ? , ? , 1 , 0)
Part (b) [2 marks]
Suppose that the input image to a CNN is of size 3 × 40 × 100. And we apply the following kernel to this image.
(1) what is the total number of parameters?
(2) what is the total number of parameters if the input image size changes to 3 × 120 × 300?
1 # Conv2d : in_ch , out_ch , kernel_size , stride , padding
2 self . conv1 = nn . Conv2d (3 , 10 , 3 , 1 , 0)
Part (c) [6 marks]
Assume that we have two binary classification tasks. Task A is classifying a given image of a building to commercial/residential. Task B is to predict if a given image of a shape has vertical edges or not. Because we have only very few number of labeled samples per task, we want to use transfer learning from a pretrained VGG. To do so, we plug in a linear binary classifier to the VGG. With these assumptions, answer the following questions:
(1) Where is the best place to plug in the classifier for task A? Why?
(2) Where is the best place to plug in the classifier for task B? Why? and
(3) Which layers we should freeze and which layer we should tune for both tasks?
Part (d) [8 marks]
Suppose we have the following Convolutional Denoising Autoencoder. Answer the following questions (show your calculations):
(1) what is the total number of parameters in the encoder?
(2) what is the total number of parameters in the decoder?
(3) Given a reconstruction loss function L(?, ?) what are the two parameters that should be passed to the function?
(4) What is the dimension of z if the size of input image x is 65 × 65 × 1?
1 class AutoEncoder ( nn . Module ) :
2 def __init__ ( self ) :
3 super ( AutoEncoder , self ) . __init__ ()
4 self . drop = nn . Dropout (0.15)
5 self . encoder = nn . Sequential (
6 # in_ch , out_ch , k_size , stride , padding
7 nn . Conv2d (1 , 16 , 3 , 2 , 1) ,
8 nn . ReLU () ,
9 nn . Conv2d (16 , 32 , 3 , 2 , 1) ,
10 nn . ReLU () ,
11 nn . Conv2d (32 , 64 , 7)
12 )
13 self . decoder = nn . Sequential (
14 # in_ch , out_ch , k_size , stride , padding , out_padding
15 nn . ConvTranspose2d (64 , 32 , 7) ,
16 nn . ReLU () ,
17 nn . ConvTranspose2d (32 , 16 , 3 , 2 , 1 , 1) ,
18 nn . ReLU () ,
19 nn . ConvTranspose2d (16 , 1 , 3 , 2 , 1 , 1) ,
20 nn . Sigmoid ()
21 )
22
23 def forward ( self , x ) :
24 x_noisy = torch . clip ( x + 0.4 * torch . randn (* x . shape ) , 0. , 1.)
25 z = self . encoder ( self . drop ( x_noisy ) )
26 x_reconstructed = self . decoder ( z )
27 return x , x_noisy , x_reconstructed , z
Part (e) [2 marks]
Suppose size of output feature map of the last convolutional layer of a hypothetical model is 256 × 8 × 8. Now suppose instead of using a fully-connected network, we want to use a convolutional layer perform classification over 512 classes. Fill out the missing parts.
1 # Conv2d : in_ch , out_ch , kernel_size , stride , padding
2 self . conv = nn . Conv2d (256 , ? , ? , 1 , 0)
Question 5. [15 marks] Part (a) [5 marks]
Suppose we want to design a model that can perform classification and denoising the input images at the same time. For simplicity suppose, our inputs are gray-scale 32x32 images belonging to one of 10 possible classes and we want to only use fully-connected layers in our model. The model receives these noisy input images and simultaneously classifies them to one of the 10 classes and also removes the noise from the input image. Explain how you can design such model and would you train it?
Part (b) [5 marks]
Suppose we have a network that is trained to classify an input to one of 5 possible classes. Now suppose for a specific sample, the network outputs the following logits: [2 .1, ?1.4, 0.8, ?0.3, 1.7]. Also suppose due to disagreement among our human annotators we have the following distribution for our true class: [0.1, 0.05, 0.05, 0.1, 0.7]. Compute the KL-divergence and Cross Entropy loss between the predictions and ground truth for this instance.
Part (c) [5 marks]
Explain why we need reparametrization trick to train VAEs. Also explain, once we trained a VAE, how we can get the embeddings of input images and how we can generate new images.
Question 6. [10 marks]
Fill out the init and forward functions of the class to implement the following Inception block.
1 class InceptionBlock ( nn . Module ) :
2 def __init__ ( self ) :
3 super ( InceptionBlock , self ) . __init__ ()
4 ....
5
6 def forward ( self , x ) :
7 ...