Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CS260C: Deep Learning
Homework 3: Due March 9, 11:59 PM (No Late Submission)
In this homework, you will implement self-attention, and practice fine-tuning BERT on a sentiment analysis
task based on the transformers library.
Requirements
• Please work this Google Colab notebook:
• Submit your notebook (a .ipynb file). On Google Colab, you can download the notebook via:
File → Download → Download .ipynb. Please make sure to keep the output of your final run in
the notebook. It is strictly prohibited to falsify the output.
• Submit a PDF file. Show results of your fine-tuning, inference, and the comparison between fine-tuning
pre-trained BERT and training from scratch.
Part 1 Implementing Self-Attention (30 pts)
In this part, you are asked to implement self-attention in a multi head self attention function. You are
not allowed to use high-level PyTorch functions that are directly designed for self-attention. The correctness of
implementation will be checked by comparing the results with those by torch.nn.MultiheadAttention on a
number of test cases.
Part 2 Training Transformers and Fine-tuning BERT (70 pts)
In this part, we will utilize the transformers library to fine-
tune pre-trained BERT on a sentiment analysis task. Please get familiar with its usage by reading the following
pages:
For this part, we do not have a skeleton code, and please create your own code cells.
Fine-tuning BERT
Practice fine-tuning BERT on the following setting:
• Dataset: SST-2
• Pre-trained model: distilbert-base-uncased .
Expected test accuracy is at least 90%.
Inference
Propose 4 sentences by yourself, where 2 sentences have positive sentiments, and 2 others have negative senti-
ments. Use these 4 sentences as the input to the model respectively, obtain and discuss the prediction by the
model.
Comparison with Training from Scratch
Compare the performance (test accuracy and inference results with your own input) of fine-tuning with training
from scratch. You will need to use a model with the same architecture as distilbert-base-uncased, while
using random initialization for the parameters rather than loading pre-trained parameters. You
may still use the pre-trained tokenizer.