Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision

Target

Select target project
  • mid24bab/vdl-submissions
  • eberts/vdl-submissions
2 results
Select Git revision
Show changes
Commits on Source (61)
Showing
with 12594 additions and 325 deletions
https://www.overleaf.com/7239936934qmwszqydhcfd#3390dc
https://chat.whatsapp.com/B9O3AvLF04w3WfcTg8zG9w
%% Cell type:markdown id: tags:
<a href="https://colab.research.google.com/github/mindgarage/very-deep-learning-wise2324/blob/main/exercises/Exercise_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
%% Cell type:markdown id: tags:
# Programming Exercise 1: Introduction to Deep Learning with PyTorch
## Very Deep Learning (VDL) - Winter Semester 2023/24
---
### Group Details:
- **Group Name:** \[Enter OLAT Group Name Here\]
### Members:
- \[Participant 1 Name\], \[Matrikel-Nr 1\]
- \[Participant 2 Name\], \[Matrikel-Nr 1\]
- ...
---
**Instructions**: The tasks in this notebook are a part of Sheet 1. Look for `TODO` tags throughout the notebook and complete the sections with missing code. Once done, ensure all outputs are visible and correctly displayed. Save your notebook and submit the `.ipynb` file together with the exercise sheet PDF in a single ZIP file.
%% Cell type:markdown id: tags:
## Introduction:
Welcome to the first programming exercise of the Very Deep Learning course. In this exercise, you will be introduced to PyTorch, one of the most widely used deep learning frameworks in academia and industry. With its dynamic computation graph and vast ecosystem, PyTorch provides an intuitive and versatile platform for building various deep learning models.
The aim of this task is to familiarize you with the basics of [PyTorch](https://pytorch.org/) and [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/starter/introduction.html), guiding you in building, training, and evaluating a simple neural network model. You'll be working with the [FashionMNIST dataset](https://pytorch.org/vision/stable/generated/torchvision.datasets.FashionMNIST.html), a collection of grayscale images representing ten fashion categories. By the end of this exercise, you should have a foundational understanding of neural networks, how they are trained, and how to evaluate their performance.
%% Cell type:code id: tags:
```
# Install dependencies.
# Note: You can execute bash commands inside Google Colab!
!pip install pytorch-lightning # reduces boilerplate in vanilla PyTorch
!pip install torchmetrics # simplifies metric computation
!pip install pandas # for reading training logs from CSV
# We also need `torch` and `torchvision`, but they come pre-installed inside Colab
```
%% Cell type:markdown id: tags:
## Step 1: Data Preparation
The first step in any deep learning pipeline is data preparation. Here, we will download the ImageNet dataset and prepare it for training and validation.
%% Cell type:code id: tags:
```
import pytorch_lightning as pl
pl.seed_everything(42) # seed to make randomness deterministic
from matplotlib.image import NonUniformImage
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from torchvision.datasets import FashionMNIST
# Define data transformations
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])
# Download and load the FashionMNIST dataset
train_dataset = FashionMNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = FashionMNIST(root='./data', train=False, download=True, transform=transform)
# TODO: Split train dataset into train and val sets using an 80/20 ratio
# HINT: Use torch.utils.data.random_split
train_dataset, val_dataset = None, None
# TODO: Create train, val and test DataLoaders
# HINT: Set `shuffle=True` for the train set.
batch_size = 64
num_workers = 2
train_loader = None
val_loader = None
test_loader = None
```
%% Cell type:markdown id: tags:
## Step 2: Model Definition
For this tutorial, we'll use a pre-trained ResNet-18 model, which is a popular model for image classification tasks.
%% Cell type:code id: tags:
```
import torch
import torch.nn as nn
from torchvision.models import resnet18, ResNet18_Weights
class FashionMNISTClassifier(nn.Module):
def __init__(self, num_classes=10):
super(FashionMNISTClassifier, self).__init__()
# Create a pre-trained ResNet-18 model from `torchvision` instead of writing
# our own model. This model was trained on the ImageNet dataset, which has
# RGB images with 1000 different classes.
self.model = resnet18(weights=ResNet18_Weights.DEFAULT)
# HINT: Use print(self.model) to see the architecture
# The FashionMNIST dataset has grayscale images with and 10 classes.
# TODO: Modify the first layer to accept grayscale images
# TODO: Modify the last layer to fit the number of classes in FashionMNIST
def forward(self, x):
return self.model(x)
```
%% Cell type:markdown id: tags:
## Step 3: Training with PyTorch Lightning
PyTorch Lightning simplifies the training loop. Let's create a Lightning module for our classification task.
%% Cell type:code id: tags:
```
import pytorch_lightning as pl
import torch.optim as optim
from torchmetrics import Accuracy
class ClassificationModule(pl.LightningModule):
""" A PyTorch Lightning module for contains both the network and the
training logic, unlike simple PyTorch code we saw in the first tutorial. """
def __init__(self, learning_rate=0.001, num_classes=10):
super(ClassificationModule, self).__init__()
self.save_hyperparameters() # allows access to constructor args with self.hparams.*
self.model = FashionMNISTClassifier(num_classes=num_classes)
# TODO: Define a loss function
# HINT: This is a classification task with multiple classes.
self.loss_fn = None
# TODO: Create an appropriate metric from torchmetrics
self.metric = None
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
images, labels = batch
outputs = self(images) # Forward pass
loss = self.loss_fn(outputs, labels)
self.log('train_loss', loss)
# Note: We do not need to manually call loss.backward() or optim.step()
# when using PyTorch Lightning
return loss
def on_validation_epoch_start(self):
self.metric.reset()
def validation_step(self, batch, batch_idx):
images, labels = batch
outputs = self(images)
loss = self.loss_fn(outputs, labels)
self.log('val_loss', loss, prog_bar=True)
# Update accuracy for current batch
_, preds = torch.max(outputs, 1)
self.metric.update(preds, labels)
return loss
def on_validation_epoch_end(self):
avg_accuracy = self.metric.compute()
self.log('val_accuracy', avg_accuracy, prog_bar=True)
def on_test_epoch_start(self):
self.metric.reset()
def test_step(self, batch, batch_idx):
images, labels = batch
outputs = self(images)
# Note: We do not need to calculate loss when evaluating
# on the test dataset, only the performance metric!
# Update accuracy for current batch
_, preds = torch.max(outputs, 1)
self.metric.update(preds, labels)
return {"test_accuracy": self.metric}
def on_test_epoch_end(self):
avg_accuracy = self.metric.compute()
self.log('test_accuracy', avg_accuracy, prog_bar=True)
def configure_optimizers(self):
optimizer = optim.Adam(self.parameters(), lr=self.hparams.learning_rate)
return optimizer
```
%% Cell type:markdown id: tags:
## Step 4: Training the Model
Now that our Lightning module is defined, we can easily train our model.
%% Cell type:code id: tags:
```
# Initialize the classifier
classifier = ClassificationModule()
# Create a logger
# HINT: Lightning has many different kinds of loggers, such
# as Tensorboard, WandB, Comet, etc.
# https://lightning.ai/docs/pytorch/stable/api_references.html#loggers
logger = pl.loggers.CSVLogger('./logs')
# Initialize a trainer
trainer = pl.Trainer(
deterministic=True,
accelerator='gpu' if torch.cuda.is_available() else 'cpu',
logger=logger,
max_epochs=10,
)
# Train the model
trainer.fit(classifier, train_loader, val_loader)
```
%% Cell type:markdown id: tags:
## Step 5: Testing the Model
Let's write a simple loop to test the model's predictions.
%% Cell type:code id: tags:
```
# TODO: Test the network performance with test_loader
acc = None
print(f"Accuracy: {(acc[0]['test_accuracy'] * 100):.2f}%")
```
%% Cell type:code id: tags:
```
import pandas as pd
log_file = './logs/lightning_logs/version_0/metrics.csv'
logs = pd.read_csv(log_file)
print(logs.head())
```
%% Cell type:code id: tags:
```
import matplotlib.pyplot as plt
def plot_metrics(df):
df_train = df[['epoch', 'step', 'train_loss']].dropna()
df_train = df_train.groupby('epoch').apply(lambda x: x.loc[x['step'].idxmax()])[['epoch', 'step', 'train_loss']]
df_val = df[['epoch', 'step', 'val_loss', 'val_acc']].dropna()
df_test = df[['epoch', 'step', 'test_accuracy']].dropna()
# Set up the figure and axes
fig, axs = plt.subplots(1, 2, figsize=(14, 5))
# Plot train_loss and val_loss on the first subplot
axs[0].plot(df_train['epoch'], df_train['train_loss'], label='Train Loss', color='blue')
axs[0].plot(df_val['epoch'], df_val['val_loss'], label='Validation Loss', color='red', linestyle='dashed')
axs[0].set_title('Train Loss vs Validation Loss')
axs[0].set_xlabel('Step')
axs[0].set_ylabel('Loss')
axs[0].legend()
# Plot val_acc and test_accuracy on the second subplot
axs[1].plot(df_val['epoch'], df_val['val_acc'], label='Validation Accuracy', color='green')
axs[1].plot(df_test['epoch'], df_test['test_accuracy'], label='Test Accuracy', color='orange', linestyle='dashed')
axs[1].set_title('Validation Accuracy vs Test Accuracy')
axs[1].set_xlabel('Step')
axs[1].set_ylabel('Accuracy')
axs[1].legend()
plt.tight_layout()
plt.show()
plot_metrics(logs)
```
%% Cell type:code id: tags:
```
# !rm -rf ./logs
```
%% Cell type:markdown id: tags:
## Homework/Exercise
1. Complete missing code and run the notebook.
2. Experiment with different network architectures and hyperparameters to try and improve the classification accuracy.
You are allowed to change:
- network architecture, including using other torchvision models
- optimizer
- learning rate
- batch size
- loss function
You can also experiment with [learning rate scheduling](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate). In [Lightning](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.core.LightningModule.html#lightning.pytorch.core.LightningModule), you will add the scheduler in the `configure_optimizer` function by returning something like `return {"optimizer": optimizer, "lr_scheduler": scheduler}`.
However, changing the number of training epochs is **not allowed**. Train your network for **10 epochs**!
The group with the best results gets a small prize :)
Good luck!
This diff is collapsed.
File added
\documentclass{article}
\usepackage{graphicx} % Required for inserting images
\usepackage{tikz}
\usetikzlibrary{positioning}
\usepackage{amsmath}
\usepackage{amssymb}
\title{Very Deep Learning \\ Exercise 1 \\ Group 4}
\author{Niklas Eberts - 409829 \\
Frederick Phillips - 404986 \\
Muhammad Saad Najib - 423595 \\
Rea Fernandes - 426401 \\
Mayank Chetan Ahuja - 426518 \\
Caina Rose Paul - 426291
}
\date{\today}
\begin{document}
\maketitle
\section{Computational Graphs}
Computational graphs are directed graphs that represent the dependencies between the variables and operations within a model or, more generally, a mathematical expression. \\
As an example, give the expression
\begin{center}
$f(x,y,z,w)=4(xyz+max(z,w))$
\end{center}
it has the following computational graph (figure \ref{fig:example_graph}):
\begin{figure}[h]
\centering
\begin{tikzpicture}[main/.style = {draw, circle}]
\node[main] (x) {$x$};
\node[main] (y) [below of=x] {$y$};
\node[main] (z) [below of=y] {$z$};
\node[main] (w) [below of=z] {$w$};
\node[main] (a) [below right of=x] {$a$};
\node[main] (b) [below right of=a] {$b$};
\node[main] (c) [below right of=z] {$c$};
\node[main] (d) [below right of=b] {$d$};
\node[main] (e) [right of=d] {$e$};
\draw[->] (x) -- (a);
\draw[->] (y) -- (a);
\draw[->] (a) -- (b);
\draw[->] (z) -- (b);
\draw[->] (z) -- (c);
\draw[->] (w) -- (c);
\draw[->] (c) -- (d);
\draw[->] (b) -- (d);
\draw[->] (d) -- (e);
\end{tikzpicture}
\caption{Computational graph for $f(x, y, z, w)$}
\label{fig:example_graph}
\end{figure}
\\
where $a = xy, b = az = xyz, c = max(z, w), d = b + c$ and $e = 4d$. \\\\
\textbf{Tasks:} Now, consider the following expressions:
\begin{align*}
A &= (x + (y \cdot z)) \cdot w \\
B &= \sqrt{x + y + z^3} - \log{(\frac{x}{w})} \hspace{0.75cm} \text{natural log} \\
C &= w + \exp((x+y) \cdot z)
\end{align*}
\subsection{Introduce intermediate variables for all three expressions such that each variable represents a single mathematical operation.}
\subsubsection{Expression A}
$A = (x + (y \cdot z)) \cdot w$ \\\\
$a = y \cdot z$ \\
$b = x + a$ \\
$c = b \cdot w$
\subsubsection{Expression B}
$B = \sqrt{x + y + z^3} - \log{(\frac{x}{w})} \hspace{0.75cm} \text{natural log}$ \\\\
$a = z \cdot z$ \\
$b = a \cdot z$ \\
$c = x + y$ \\
$d = c + b$ \\
$e = \sqrt{d}$ \\
$f = \frac{x}{w}$ \\
$g = \log(f)$ \\
$h = e - g$
\subsubsection{Expression C}
$C = w + \exp((x+y) \cdot z)$ \\\\
$a = x + y$ \\
$b = a \cdot z$ \\
$c = \exp(b)$ \\
$d = w + c$
\subsection{Using your chosen intermediate variables, draw computational graphs for all three expressions.}
\subsubsection{Expression A}
\begin{center}
\begin{tikzpicture}[main/.style = {draw, circle}]
\node[main] (x) {$x$};
\node[main] (y) [below of=x] {$y$};
\node[main] (z) [below of=y] {$z$};
\node[main] (w) [below of=z] {$w$};
\node[main] (a) [below right of=y, above right of=z] {$a$};
\node[main] (b) [above right of=a] {$b$};
\node[main] (c) [below right of=b] {$c$};
\draw[->] (y) -- (a);
\draw[->] (z) -- (a);
\draw[->] (a) -- (b);
\draw[->] (x) -- (b);
\draw[->] (b) -- (c);
\draw[->] (w) -- (c);
\end{tikzpicture}
\end{center}
\subsubsection{Expression B}
\begin{center}
\begin{tikzpicture}[main/.style = {draw, circle}]
\node[main] (x) {$x$};
\node[main] (y) [below of=x] {$y$};
\node[main] (z) [below of=y] {$z$};
\node[main] (w) [above of=x] {$w$};
\node[main] (a) [right of=z] {$a$};
\node[main] (b) [above right of=a] {$b$};
\node[main] (c) [below right of=x] {$c$};
\node[main] (d) [above right of=b] {$d$};
\node[main] (e) [right of=d] {$e$};
\node[main] (f) [above right of=x, below right of=w] {$f$};
\node[main] (g) [right of=f] {$g$};
\node[main] (h) [above right of=e] {$h$};
\draw[->] (z) to [bend right=20](a);
\draw[->] (z) to [bend left=20](a);
\draw[->] (a) -- (b);
\draw[->] (z) to [bend left=20] (b);
\draw[->] (x) -- (c);
\draw[->] (y) -- (c);
\draw[->] (c) -- (d);
\draw[->] (b) -- (d);
\draw[->] (d) -- (e);
\draw[->] (w) -- (f);
\draw[->] (x) -- (f);
\draw[->] (f) -- (g);
\draw[->] (e) -- (h);
\draw[->] (g) -- (h);
\end{tikzpicture}
\end{center}
\subsubsection{Expression C}
\begin{center}
\begin{tikzpicture}[main/.style = {draw, circle}]
\node[main] (x) {$x$};
\node[main] (y) [below of=x] {$y$};
\node[main] (z) [below of=y] {$z$};
\node[main] (w) [below of=z] {$w$};
\node[main] (a) [below right of=x, above right of=y] {$a$};
\node[main] (b) [below right of=a] {$b$};
\node[main] (c) [right of=b] {$c$};
\node[main] (d) [right of=w, below right of=c] {$d$};
\draw[->] (x) -- (a);
\draw[->] (y) -- (a);
\draw[->] (a) -- (b);
\draw[->] (z) -- (b);
\draw[->] (b) -- (c);
\draw[->] (w) -- (d);
\draw[->] (c) -- (d);
\end{tikzpicture}
\end{center}
\subsection{Given that $x = 3, y = 5, z = -1$ and $w = 1$. Perform forward propagation on any two of the three computational graphs you created in (ii).}
\subsubsection{Expression A}
\begin{center}
\begin{tikzpicture}[main/.style = {draw, circle}]
\node[main] (x) {$x$};
\node[main] (y) [below of=x] {$y$};
\node[main] (z) [below of=y] {$z$};
\node[main] (w) [below of=z] {$w$};
\node[main] (a) [below right of=y, above right of=z] {$a$};
\node[main] (b) [above right of=a] {$b$};
\node[main] (c) [below right of=b] {$c$};
\draw[->] (y) -- node[midway, above right, sloped, pos=0.3] {5} (a);
\draw[->] (z) -- node[midway, above, sloped, pos=0.3] {-1} (a);
\draw[->] (a) -- node[midway, above, sloped, pos=0.4] {-5} (b);
\draw[->] (x) -- node[midway, above right, sloped, pos=0.3] {3} (b);
\draw[->] (b) -- node[midway, above, sloped, pos=0.4] {-2} (c);
\draw[->] (w) -- node[midway, below, sloped, pos=0.5] {1} (c);
\draw[->] (c) --++(0:1cm) node[midway, above]{-2};
\end{tikzpicture}
\end{center}
\subsubsection{Expression C}
\begin{center}
\begin{tikzpicture}[main/.style = {draw, circle}]
\node[main] (x) {$x$};
\node[main] (y) [below of=x] {$y$};
\node[main] (z) [below of=y] {$z$};
\node[main] (w) [below of=z] {$w$};
\node[main] (a) [below right of=x, above right of=y] {$a$};
\node[main] (b) [below right of=a] {$b$};
\node[main] (c) [right of=b] {$c$};
\node[main] (d) [right of=w, below right of=c] {$d$};
\draw[->] (x) -- node[midway, above right, sloped, pos=0.3] {3} (a);
\draw[->] (y) -- node[midway, above, sloped, pos=0.3] {5} (a);
\draw[->] (a) -- node[midway, above, sloped, pos=0.4] {8} (b);
\draw[->] (z) -- node[midway, above right, sloped, pos=0.3] {-1} (b);
\draw[->] (b) -- node[midway, above, sloped, pos=0.4] {-8} (c);
\draw[->] (w) -- node[midway, above, sloped, pos=0.5] {1} (d);
\draw[->] (c) -- node[midway, above, sloped, pos=0.5] {0.0003} (d);
\draw[->] (d) --++(0:2cm) node[midway, above]{1.0003};
\end{tikzpicture}
\end{center}
\subsection{Perform backpropagation on the same two computational graphs you chose in (iii).}
\subsubsection{Expression A}
\begin{center}
\begin{tikzpicture}[main/.style = {draw, circle}]
\node[main] (x) {$x$};
\node[main] (y) [below of=x] {$y$};
\node[main] (z) [below of=y] {$z$};
\node[main] (w) [below of=z] {$w$};
\node[main] (a) [below right of=y, above right of=z] {$a$};
\node[main] (b) [above right of=a] {$b$};
\node[main] (c) [below right of=b] {$c$};
\draw[<-] (y) -- node[midway, above right, sloped, pos=0.3] {-1} (a);
\draw[<-] (z) -- node[midway, above, sloped, pos=0.3] {5} (a);
\draw[<-] (a) -- node[midway, above, sloped, pos=0.4] {1} (b);
\draw[<-] (x) -- node[midway, above right, sloped, pos=0.3] {1} (b);
\draw[<-] (b) -- node[midway, above, sloped, pos=0.4] {1} (c);
\draw[<-] (w) -- node[midway, below, sloped, pos=0.5] {-2} (c);
\draw[->] (x) --++(0:-1cm) node[midway, above]{1};
\draw[->] (y) --++(0:-1cm) node[midway, above]{-1};
\draw[->] (z) --++(0:-1cm) node[midway, above]{5};
\draw[->] (w) --++(0:-1cm) node[midway, above]{-2};
\end{tikzpicture}
\end{center}
\subsubsection{Expression C}
\begin{center}
\begin{tikzpicture}[main/.style = {draw, circle}, node distance=2cm]
\node[main] (x) {$x$};
\node[main] (y) [below of=x] {$y$};
\node[main] (z) [below of=y] {$z$};
\node[main] (w) [below of=z] {$w$};
\node[main] (a) [below right of=x, above right of=y] {$a$};
\node[main] (b) [below right of=a] {$b$};
\node[main] (c) [above right of=b] {$c$};
\node[main] (d) [right of=w, below right of=c] {$d$};
\draw[<-] (x) -- node[midway, above right, sloped, pos=0.3] {-2.718} (a);
\draw[<-] (y) -- node[midway, above, sloped, pos=0.3] {-2.718} (a);
\draw[<-] (a) -- node[midway, above, sloped, pos=0.5] {-2.718} (b);
\draw[<-] (z) -- node[midway, above right, sloped, pos=0.2] {21.744} (b);
\draw[<-] (b) -- node[midway, above, sloped, pos=0.5] {2.718} (c);
\draw[<-] (w) -- node[midway, above, sloped, pos=0.5] {1} (d);
\draw[<-] (c) -- node[midway, above, sloped, pos=0.5] {1} (d);
\draw[->] (x) --++(0:-2cm) node[midway, above]{-2.718};
\draw[->] (y) --++(0:-2cm) node[midway, above]{-2.718};
\draw[->] (z) --++(0:-2cm) node[midway, above]{21.744};
\draw[->] (w) --++(0:-2cm) node[midway, above]{1};
\end{tikzpicture}
\end{center}
\section{Vanishing and Exploding Gradients}
Consider a network with input $x \in \mathbb{R}$, 4 hidden layers, each having only one node, and one output $y \in \mathbb{R}$:
\begin{center}
\begin{tikzpicture}[main/.style = {draw, circle}]
\node[main] (x) {$x$};
\node[main] (a1) [right of=x] {$a_1$};
\node[main] (a2) [right of=a1] {$a_2$};
\node[main] (a3) [right of=a2] {$a_3$};
\node[main] (a4) [right of=a3] {$a_4$};
\node[main] (y) [right of=a4] {$y$};
\draw[->] (x) -- (a1);
\draw[->] (a1) -- (a2);
\draw[->] (a2) -- (a3);
\draw[->] (a3) -- (a4);
\draw[->] (a4) -- (y);
\end{tikzpicture}
\end{center}
In the network, each node corresponds to the sigmoid pf the preceding node multiplied with some weight: $a_i=\sigma(w_i \cdot a_{i-1}),i=1,\dots5$, where $a_0$ corresponds to the input $x$ and $a_5$ corresponds to the output $y$. \\\\
The sigmoid function is given by
\begin{equation*}
\sigma(x)=\frac{1}{1+e^{-x}}
\end{equation*}
\subsection{What is the vanishing gradients problem?}
When gradients become extremely small during the backpropogation process, the partial derivations guiding the parameter updates tend to approach zero. As a result, the weights of the neural network remain largely unchanged, leading to a lack of meaningful learning or weight updates. This phenomenon is often referred to as the \textbf{vanishing gradient problem} and can hinder the training of deep neural networks.
\subsection{Calculate the derivative $\sigma'$ of the sigmoid function. Determine the best possible lower and upper bounds for the derivative $\sigma'$.}
\begin{equation*}
\begin{split}
\sigma (x) &= \frac{1}{1+e^{-x}} \\
\sigma' (x) &= \frac{d}{dx} \frac{1}{1+e^{-x}} \\
&= \frac{d}{dx} (1+e^{-x})^{-1} \\
&= -(1+e^{-x})^{-2} \cdot \frac{d}{dx} (1+e^{-x}) \\
&= -(1+e^{-x})^{-2} \cdot (\frac{d}{dx}(1) + \frac{d}{dx}(e^{-x})) \\
&= -(1+e^{-x})^{-2} \cdot (e^{-x} \cdot \frac{d}{dx}(-x)) \\
&= -(1+e^{-x})^{-2} \cdot (e^{-x} \cdot (-1)) \\
&= (1+e^{-x})^{-2} \cdot (e^{-x}) \\
&= \frac{e^{-x}}{(1+e^{-x})^2} \\
&= \frac{1}{(1+e^{-x})} \cdot \frac{e^{-x}}{(1+e^{-x})} \\
&= \frac{1}{(1+e^{-x})} \cdot \frac{e^{-x}+1-1}{(1+e^{-x})} \\
&= \frac{1}{(1+e^{-x})} \cdot (\frac{(1+e^{-x})}{(1+e^{-x})}-\frac{1}{(1+e^{-x})}) \\
&= \frac{1}{(1+e^{-x})} \cdot (1-\frac{1}{(1+e^{-x})}) \\
&= \sigma(x) \cdot (1 - \sigma (x))
\end{split}
\end{equation*}
The bound is [0,0.25]
\subsection{By using the chain rule, calculate the gradient $\frac{\delta y}{\delta x}$ and express your results using the weights $w_i$ and the derivative $\sigma'$}
\begin{equation*}
\begin{split}
\frac{dy}{dx} &= \frac{dy}{da_4} \cdot \frac{da_4}{da_3} \cdot \frac{da_3}{da_2} \cdot \frac{da_2}{da_1} \cdot \frac{da_1}{dx} \\
&= \frac{da_5}{da_4} \cdot \frac{da_4}{da_3} \cdot \frac{da_3}{da_2} \cdot \frac{da_2}{da_1} \cdot \frac{da_1}{da_0} \\\\
a_1 &= \sigma(w_1 \cdot a_0) \\
a_2 &= \sigma(w_2 \cdot a_1) \\
a_3 &= \sigma(w_3 \cdot a_2) \\
a_4 &= \sigma(w_4 \cdot a_3) \\
a_5 &= \sigma(w_5 \cdot a_4) \\\\
\frac{da_5}{da_4} &= \sigma (a_4) \cdot (1-\sigma(a_4)) \cdot w_5 \\
\frac{da_4}{da_3} &= \sigma (a_3) \cdot (1-\sigma(a_3)) \cdot w_4 \\
\frac{da_3}{da_2} &= \sigma (a_2) \cdot (1-\sigma(a_2)) \cdot w_3 \\
\frac{da_2}{da_1} &= \sigma (a_1) \cdot (1-\sigma(a_1)) \cdot w_2 \\
\frac{da_1}{da_0} &= \sigma (a_0) \cdot (1-\sigma(a_0)) \cdot w_1 \\
\end{split}
\end{equation*}
\subsection{Do the upper or lower bounds of $\sigma'$ contribute to the vanishing gradients problem? If so, how?}
When the value is $0.25$, frequent multiplication of sigmoid activation results in gradients that tend to become extraordinary small, approaching $0$. This phenomenon, known as gradient saturation, particularly effects sigmoid activation when used across multiple layers in deep neural networks. For instance, if you have a deep neural network with multiple sigmoid activation layers, the repeated multiplications of small gradients can lead to the vanishing gradient problem hindering the effective training of the model.
\subsection{What is meant by exploding gradients? Why do we not want the gradients to explode? When can sigmoid activations have an exploding gradient?}
\begin{itemize}
\item Exploding gradient is the opposite of vanishing gradient where the gradient becomes too large. To be more precise exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training.
\item Large changes in weights at each iteration prevent gradient descent from finding minima.
\item Sigmoid can cause an exploding gradient when initial weights are very large.
\end{itemize}
\end{document}
Source diff could not be displayed: it is too large. Options to address this: view the blob.
This diff is collapsed.
File added
\documentclass{article}
\usepackage{graphicx} % Required for inserting images
\usepackage{tikz}
\usetikzlibrary{positioning}
\usepackage{amsmath}
\usepackage{amssymb}
\title{Very Deep Learning \\ Exercise 2 \\ Group 4}
\author{Niklas Eberts - 409829 \\
Frederick Phillips - 404986 \\
Muhammad Saad Najib - 423595 \\
Rea Fernandes - 426401 \\
Mayank Chetan Ahuja - 426518 \\
Caina Rose Paul - 426291
}
\date{\today}
\begin{document}
\maketitle
\section{Convolutions [3+1+1+1+1+1=8]}
Denote by $X$ the following $5 \times 5$ input image with one channel:
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
8 & 2 & 9 & 8 & 3 \\
\hline
3 & 4 & 6 & 2 & 7 \\
\hline
5 & 9 & 3 & 9 & 7 \\
\hline
6 & 1 & 3 & 6 & 8 \\
\hline
5 & 1 & 6 & 8 & 1 \\
\hline
\end{tabular}
\end{center}
Denoting a $3 \times 3$ filter by $K$. Assume we have the following $3 \times 3$ filters:
\begin{center}
\begin{tabular}{|c|c|c|}
\hline
1 & 0 & -1 \\
\hline
2 & 0 & -2 \\
\hline
1 & 0 & -1 \\
\hline
\end{tabular}
\hspace{1cm}
\begin{tabular}{|c|c|c|}
\hline
1 & 2 & 1 \\
\hline
0 & 0 & 0 \\
\hline
-1 & -2 & -1 \\
\hline
\end{tabular}
\hspace{1cm}
\begin{tabular}{|c|c|c|}
\hline
0 & 1 & 0 \\
\hline
1 & -4 & 1 \\
\hline
0 & 1 & 0 \\
\hline
\end{tabular}
\end{center}
We adopt the convention that for a 2D array $X$ (or $K$), $X_{i,j}$ denotes the element in \textbf{row} $i$ \textbf{and column} $j$, \textbf{counting from the top left corner}. In particular, $X_{3,4}=9$ in our example. Furthermore, unless otherwise stated, we assume zero padding, i.e. $X_{i,j} \equiv 0$ if $i < 0$ or $i >$ size($X$) (similarly for $j$).
\subsection{Apply all three filters (by cross-correlation) to the above dataset, i.e.:
\begin{equation*}
Y_{i,j}=(K \star X)_{i,j}=\displaystyle\sum^1_{m=-1}\sum^1_{n=-1}K_{m+2,n+2}X_{i+m,j+n}
\end{equation*} where $Y$ is the output. Use 'same' padding and a stride of one.}
Solution
\textbf{Filter $K_1$:}
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
-8 & -5 & -10 & 11 & 18 \\
\hline
-19 & -5 & 7 & 0 & 21 \\
\hline
-23 & 4 & -3 & -14 & 26 \\
\hline
-12 & 7 & -17 & -9 & 29 \\
\hline
-3 & -4 & -19 & 5 & 22 \\
\hline
\end{tabular}
\end{center}
\textbf{Filter $K_2$:}
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
-10 & -17 & -18 & -17 & -16 \\
\hline
-1 & -5 & 4 & 0 & -9 \\
\hline
-3 & 6 & 5 & -6 & -6 \\
\hline
8 & 13 & 3 & 5 & 13 \\
\hline
13 & 11 & 13 & 23 & 22 \\
\hline
\end{tabular}
\end{center}
\textbf{Filter $K_3$:}
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
-27 & 13 & -20 & 18 & 3 \\
\hline
5 & 4 & -6 & 22 & -16 \\
\hline
-2 & -23 & 4 & -18 & -4 \\
\hline
-13 & 15 & 4 & 4 & -18 \\
\hline
-13 & 8 & -12 & -19 & 12 \\
\hline
\end{tabular}
\end{center}
\subsection{Look at the structure of the filters. What do they do?}
Solution
All of them are Edge Detectors.
\textbf{Filter $K_1$:}
\begin{itemize}
\item It is a Vertical Sobel Filter.
\item Emphasizes changes in intensity from left to right.
\end{itemize}
\textbf{Filter $K_2$:}
\begin{itemize}
\item It is a Horizontal Sobel Filter.
\item Emphasizes changes in intensity from top to bottom.
\end{itemize}
\textbf{Filter $K_3$:}
\begin{itemize}
\item It is a Laplacian filter.
\item Emphasizes regions of rapid intensity change in all directions.
\end{itemize}
\subsection{What is the difference between 'valid' and 'same' padding?}
Solution
\textbf{VALID Padding:}
\begin{itemize}
\item No padding is added to the input image.
\item The filter window always stays inside the input image.
\item Loss of information may occur, especially on the right and bottom edges.
\item The size of the output image is less than or equal to the size of the input image.
\end{itemize}
\textbf{SAME Padding:}
\begin{itemize}
\item Input is half padded to ensure the filter is applied to all input elements.
\item Output size is the same as the input size when the stride is 1.
\item "SAME" is commonly used during model training for computational convenience.
\end{itemize}
\subsection{Why do we prefer Convolutional Neural Networks (CNNs) over a Multi-Layer Perceptron (MLP) for image data?}
In contrast to CNN, which uses a tensor as input, Multi-Layer Perceptron uses a vector. Consequently, CNN has a superior understanding of the spatial relation—the relationship between adjacent pixels in a picture. Hence, CNN will function better for complex images.
\subsection{Given an input image of size $H \times H$, a convolutional filter of size $k$, padding $p$ and stride $s$. Write down the formula for calculating the dimensions of the output of the convolution operation.}
\begin{equation*}
H_{\text{\small out}} = \frac{{H - k + 2p}}{s} + 1
\end{equation*}\text{Hence the dimensions will be }H_{\text{\small out}} \times H_{\text{\small out}}
\subsection{Given an input of size $(H,W,C)$ convolved with $N$ Conv2D filters of size $k$, what are the number of trainable parameters in this convolutional layer? Write down the formula. Assume both weights and biases are present.}
\[
((H \times W \times C) + 1) \times N
\]
\section{Loss Functions and Optimization [3+4=7]}
\subsection{Given $\hat{y}=softmax(z$) with
\begin{equation*}
\hat{y}_i=\frac{e^{z_i}}{\sum^N_{k=1}e^{z_k}}
\end{equation*} where $\hat{y} \in \mathbb{R}^N$ and $N$ is the number of classes of a classification problem. Calculate $\frac{\partial \hat{y}_i}{\partial z_j}$.}
Solution
\begin{equation*}
\begin{split}
\hat{y} &= S(z) \\
S(z_i) &= \frac{e^{z_i}}{\sum^N_{k=1}e^{z_k}}
\end{split}
\end{equation*}
For \(N = 3\), the softmax function is given by:
\[
S(z_1) = \frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}}
\]
Now, let's find the partial derivative with respect to \(z_j\):
The detailed computation for \(j = 1\) is as follows:
\[
\begin{aligned}
\frac{\partial S(z_1)}{\partial z_1} &= \frac{\partial}{\partial z_1}\left(\frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}}\right) \\
&= \frac{e^{z_1} \cdot (e^{z_1} + e^{z_2} + e^{z_3}) - e^{z_1} \cdot e^{z_1}}{(e^{z_1} + e^{z_2} + e^{z_3})^2} \\
&= \frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}} \cdot \frac{e^{z_1} + e^{z_2} + e^{z_3} - e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3}} \\
&= S(z_1) \cdot \left(1 - S(z_1)\right)
\end{aligned}
\]
\[
\begin{aligned}
\frac{\partial S(z_2)}{\partial z_1} &= \frac{\partial}{\partial z_1}\left(\frac{e^{z_2}}{\sum_{k=1}^N e^{z_k}}\right) \\
&= \frac{0 - e^{z_2} \cdot e^{z_1}}{\left(\sum_{k=1}^N e^{z_k}\right)^2} \\
&= -\frac{e^{z_1}e^{z_2}}{\left(\sum_{k=1}^N e^{z_k}\right)^2} \\
&= -\frac{e^{z_1}}{\sum_{k=1}^N e^{z_k}} \cdot \frac{e^{z_2}}{\sum_{k=1}^N e^{z_k}} \\
&= -S(z_1)S(z_2)
\end{aligned}
\]
\[
\begin{aligned}
\frac{\partial S(z_3)}{\partial z_1} &= \frac{\partial}{\partial z_1}\left(\frac{e^{z_3}}{\sum_{k=1}^N e^{z_k}}\right) \\
&= \frac{0 - e^{z_3} \cdot e^{z_1}}{\left(\sum_{k=1}^N e^{z_k}\right)^2} \\
&= -\frac{e^{z_1}e^{z_3}}{\left(\sum_{k=1}^N e^{z_k}\right)^2} \\
&= -\frac{e^{z_1}}{\sum_{k=1}^N e^{z_k}} \cdot \frac{e^{z_3}}{\sum_{k=1}^N e^{z_k}} \\
&= -S(z_1)S(z_3)
\end{aligned}
\]
So from this, we can see a pattern
\begin{equation*}
\begin{aligned}
\frac{\partial S(z_i)}{\partial z_j} &= \begin{cases}
S(z_i) \cdot (1 - S(z_i)), & \text{if } i = j \\
-S(z_i)S(z_j), & \text{if } i \neq j
\end{cases}
\end{aligned}
\end{equation*}
\begin{equation*}
\begin{aligned}
\frac{\partial \hat{y}_i}{\partial z_j} &= \begin{cases}
\hat{y}_i \cdot (1 - \hat{y}_j), & \text{if } i = j \\
-\hat{y}_i\hat{y}_j, & \text{if } i \neq j
\end{cases}
\end{aligned}
\end{equation*}
\subsection{Given $\hat{y}=softmax(z$), a target vector $\hat{y} \in \mathbb{R}^N$ and the cross-entropy loss function defined as
\begin{equation*}
L(y, \hat{y})=-\displaystyle\sum^N_{k=1}y_k\log\hat{y}_k
\end{equation*} Calculate $\frac{\partial L}{\partial z_i}$ and simplify your results as far as possible. \textit{Hint: Make use of the chain rule, note that $y_i$ are constants and $\sum_iy_i=1$.}}
\begin{align*}
\frac{\partial L}{\partial z_i} &= -\sum_{k} y_k \frac{\partial \log(\hat{y}_k)}{\partial z_i} \\
&= -\sum_{k} y_k \frac{\partial \log(\hat{y}_k)}{\partial \hat{y}_k} \frac{\partial \hat{y}_k}{\partial z_i} \\
&= -\sum_{k} y_k \frac{1}{\hat{y}_k} \frac{\partial \hat{y}_k}{\partial z_i} \\
&= -\sum_{k} \frac{y_k}{\hat{y}_k} \frac{\partial \hat{y}_k}{\partial z_i} \\
&= - y_i \left(\frac{1}{\hat{y}_i}\right) \hat{y}_i (1 - \hat{y}_i) - \sum_{k \neq i} y_k \left(\frac{1}{\hat{y}_k}\right) (-\hat{y}_k \hat{y}_i) \\
&= - y_i(1 - \hat{y}_i) = \sum_{k \neq i} -y_k \hat{y}_i \\
&= - y_i + y_i \hat{y}_i + \sum_{k \neq i} y_k \hat{y}_i\\
&= - y_i + \hat{y}_i (y_i + \sum_{k \neq i} y_k)\\
\text{since } \sum_{i} y_i = 1 \\
&= \hat{y}_j - y_j
\end{align*}
\section{Loss Functions with Regularization [5]}
Consider the input dataset $X \in \mathbb{R}^{n \times d}$ with $n$ samples of size $d$, a target vector $y \in \mathbb{R}^n$, a weight vector $w \in \mathbb{R}^d$ and a prediction $\hat{y}=Xw$. The \textbf{regularized} mean squared error (MSE) is given by: \\
\begin{equation*}
\begin{split}
L(y, \hat{y}) &= \frac{1}{n} \displaystyle\sum^n_{i=1}(\hat{y}_i-y_i)^2+\lambda\displaystyle\sum^d_{i=1}w^2_i \\
&= \frac{1}{n} \displaystyle\sum^n_{i=1}(\hat{y}_i-y_i)^2+\lambda \|w\|^2_2
\end{split}
\end{equation*}
where $\|w\|_2$ is the Euclidean norm of $w$ and $\lambda > 0$ is some given regularization parameter.
\subsection{Determine in closed form the vector $w$ that minimizes $L$. \\
\textit{Hint: You may find the Matrix Cookbook (available online) useful.}}
In order to find the closed form solution one needs to derive $L$ with respect to $w$ and set the derivative to $0$. After setting the derivative to $0$ one finds the solution by solving for $w$.
\begin{equation*}
\begin{split}
\frac{\delta L}{\delta w} &= \frac{2}{n}X^T(Xw-y)+2\lambda w \\
0 &= \frac{2}{n}X^T(Xw-y)+2\lambda w \\
0 &= \frac{2}{n}(X^TX)w -\frac{2}{n}X^Ty+2\lambda w \\
\frac{2}{n}X^Ty &= (\frac{2}{n}(X^TX)+2\lambda I)w \\
X^Ty &= (X^TX + \lambda n I)w \\
w &= (X^TX + \lambda n I)^{-1}X^Ty
\end{split}
\end{equation*}
\section{CIFAR Challenge [5]}
Follow the instructions in the Jupyter notebook Task\_2.4.ipynb to complete the CIFAR competition using PyTorch. Your task is to fill in the missing code annotated with TODO tags in the comments, and get an accuracy of at least 70\%.
\section{Depthwise Separable Convolutions [5]}
In this task, you will explore depth-wise separable convolutions, which is a special type of convolution. Follow the instructions in the Jupyter notebook Task\_2.5.ipynb and fill in the missing code annotated with TODO tags.
\end{document}
This diff is collapsed.
This diff is collapsed.
File added
\documentclass{article}
\usepackage{graphicx} % Required for inserting images
\usepackage{tikz}
\usetikzlibrary{positioning}
\usepackage{amsmath}
\usepackage{amssymb}
\title{Very Deep Learning \\ Exercise 3 \\ Group 4}
\author{Niklas Eberts - 409829 \\
Frederick Phillips - 404986 \\
Muhammad Saad Najib - 423595 \\
Rea Fernandes - 426401 \\
Mayank Chetan Ahuja - 426518 \\
Caina Rose Paul - 426291
}
\date{\today}
\begin{document}
\maketitle
\section{Backpropagation through Time [3+4+3+3=13]}
Consider the following RNN:
\begin{center}
\begin{tikzpicture}[main/.style = {draw, circle}, node distance=2cm]
\node[main] (ht-1) {$h_{t-1}$};
\node[main] (ht) [right of=ht-1] {$h_t$};
\node[main] (ht+1) [right of=ht] {$h_{t+1}$};
\draw[->] (ht-1) -- node[midway, above right, sloped, pos=0.3] {w} (ht);
\draw[->] (ht) -- node[midway, above, sloped, pos=0.3] {w} (ht+1);
\draw[->] (ht+1) --++(0:1cm) node[midway, above] {w};
\draw[<-] (ht-1) --++(0:-1cm) node[midway, above]{};
\draw[->] (ht-1.south) +(0,-1) -- +(0:0cm) node[midway, right] {$U$};
\draw[->] (ht.south) +(0,-1) -- +(0:0cm) node[midway, right] {$U$};
\draw[->] (ht+1.south) +(0,-1) -- +(0:0cm) node[midway, right] {$U$};
\draw[<-] (ht-1.south) +(0,0) -- +(90:-1cm) node[below] {$x_{t-1}$};
\draw[<-] (ht.south) +(0,0) -- +(90:-1cm) node[below] {$x_t$};
\draw[<-] (ht+1.south) +(0,0) -- +(90:-1cm) node[below] {$x_{t+1}$};
\draw[->] (ht-1.north) +(0,0) -- +(90:1cm) node[above] {$y_{t-1}$};
\draw[->] (ht.north) +(0,0) -- +(90:1cm) node[above] {$y_t$};
\draw[->] (ht+1.north) +(0,0) -- +(90:1cm) node[above] {$y_{t+1}$};
\end{tikzpicture}
\end{center}
Here, $h_0$ denotes the first hidden state which the user initializes, then for each $t \geq 1$, the
hidden state $h_t$ is given by:
\begin{equation*}
h_t = \sigma (Wh_{t-1}+Ux_t), \sigma (z) = \frac{1}{1 + e^{-z}}
\end{equation*}
Let $L$ be a loss function defined as the sum over the losses $L_t$ at every time step until some time $T \geq 1$, i.e. $L = \sum^T_{t=1}L_t$, where $L_t$ is a scalar loss depending on $h_t$. \\
In the following, we want to derive the gradient of this loss function with respect to the parameter $W$.
\subsection{Using the multivariate chain rule, show that \\ $\frac{\partial L}{\partial W} = \sum^T_{t=1}\sum^t_{k=1}\frac{\partial L_t}{\partial h_t}\frac{\partial h_t}{\partial h_k}\frac{\partial h_k}{\partial W}$.}
Solution
1. Start with the expression for $L$:
\[
L = \sum_{t=1}^{T} L_t
\]
2. Apply the chain rule for each $L_t$:
\[
\frac{\partial L_t}{\partial W} = \sum_{k=1}^{t} \frac{\partial L_t}{\partial h_t} \frac{\partial h_t}{\partial h_k} \frac{\partial h_k}{\partial W}
\]
3. Sum over all time steps $T$:
\[
\frac{\partial L}{\partial W} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial W} = \sum_{t=1}^{T} \sum_{k=1}^{t} \frac{\partial L_t}{\partial h_t} \frac{\partial h_t}{\partial h_k} \frac{\partial h_k}{\partial W}
\]
Thus proved the derivation using the multivariate chain rule.
\subsection{Given a function $f(h) = \sigma(Wh)$ where $h \in \mathbb{R}^d$ and \\ $W \in \mathbb{R}^{n \times d}$. Here the sigmoid function $\sigma$ is applied element-wise on a vector. Show that \\ $\frac{\partial f}{\partial h} = \text{diag}(\sigma'(Wh))W \in \mathbb{R}^{n \times d}$, where $\frac{\partial f}{\partial h}$ denotes the Jacobian matrix of $f$ with respect to $h$, and diag$(\sigma'(Wh))$ is the diagonal matrix of the vector $\sigma'(Wh)$.}
Solution
\begin{equation*}
\begin{aligned}
f(h) &= \sigma(Wh) \\
\frac{\partial f}{\partial h} &= \text{diag}(\sigma'(Wh)) \cdot \frac{\partial}{\partial h}(\sigma(Wh)) \\
&= \text{diag}(\sigma'(Wh)) \cdot \sigma(Wh) \cdot (1 - \sigma(Wh)) \cdot W \\
&= \text{diag}(\sigma'(Wh)) \cdot \sigma(Wh) \cdot (\sigma(Wh) - 1) \cdot W \\
&= \text{diag}(\sigma'(Wh)) \cdot (\sigma(Wh) - \sigma(Wh)\sigma(Wh)) \cdot W \\
&= \text{diag}(\sigma'(Wh)) \cdot (\sigma(Wh) - \sigma(Wh)^2) \cdot W \\
&= \text{diag}(\sigma'(Wh)) \cdot \sigma(Wh) \cdot (1 - \sigma(Wh)) \cdot W \\
&= \text{diag}(\sigma'(Wh)) \cdot W
\end{aligned}
\end{equation*}
Thus proved that
\begin{equation*}
\frac{\partial f}{\partial h} = \text{diag}(\sigma'(Wh)) \cdot W
\end{equation*}
\subsection{Write down $\frac{\partial L}{\partial W}$ as expanded sum for $T = 3$. Use the chain rule to show that we will need to multiply $T - 1$ matrices of the form $(\text{diag}(\sigma')W)$.}
\[
\frac{\partial L}{\partial W} = \frac{\partial L_1}{\partial h_1} \frac{\partial h_1}{\partial h_1} \frac{\partial h_1}{\partial W} + \frac{\partial L_2}{\partial h_2} \frac{\partial h_2}{\partial h_1} \frac{\partial h_1}{\partial W} + \frac{\partial L_3}{\partial h_3} \frac{\partial h_3}{\partial h_2} \frac{\partial h_2}{\partial h_1} \frac{\partial h_1}{\partial W}
\]
\[
\text{ We know } \frac{\partial f}{\partial h} = \text{diag}(\sigma')W
\]
\begin{align}
h_t & = \sigma(Wh_{t-1} + Ux_t) \\
\frac{\partial h_t}{\partial h_{t-1}} & = \text{diag}(\sigma'(\mathbf{Wh}_{t-1} + \mathbf{Ux}_t))\mathbf{W}
\end{align}
when \( T = 3 \), we multiply \(\frac{\partial h_3}{\partial h_2} \frac{\partial h_2}{\partial h_1}\) \\
So, two times \(\text{diag}(\sigma')W\) with itself multiplication .So for T=3 , T-1 time that is 2 times
\subsection{Let $\text{diag}(\sigma')W = A =
\begin{pmatrix}
0.58 & -0.24 \\
-0.24 & 0.72
\end{pmatrix}$. Its eigendecomposition is:
\begin{equation*}
A = Q\Lambda Q^{-1} =
\begin{pmatrix}
0.8 & -0.6 \\
0.6 & 0.8
\end{pmatrix}
\begin{pmatrix}
0.4 & 0 \\
0 & 0.9
\end{pmatrix}
\begin{pmatrix}
0.8 & 0.6 \\
-0.6 & 0.8
\end{pmatrix}
\end{equation*}
Calculate $A^{30}$. What do you observe? What happens in general if the absolute value of all eigenvalues of $A$ is smaller than $1$? What happens if the absolute value of any eigenvalue of $A$ is larger than $1$? What if all eigenvalues are $1$?}
\begin{equation*}
A^2 = Q \Lambda Q^{-1} Q \Lambda Q^{-1}
\end{equation*}
where $Q^{-1}Q = I$. Therefore,
\begin{equation*}
\begin{split}
A^{30} &= Q \Lambda^{30} Q^{-1} \\
A^{30} &=
\begin{bmatrix}
0.8 & -0.6 \\
0.6 & 0.8
\end{bmatrix}
\begin{bmatrix}
0.4^{30} & 0 \\
0 & 0.9^{30}
\end{bmatrix}
\begin{bmatrix}
0.8 & 0.6 \\
-0.6 & 0.8
\end{bmatrix} \\
&= \begin{bmatrix}
0.8 & -0.6 \\
0.6 & 0.8
\end{bmatrix}
\begin{bmatrix}
1.1529 \times 10^{-12} & 0 \\
0 & 0.04239
\end{bmatrix}
\begin{bmatrix}
0.8 & 0.6 \\
-0.6 & 0.8
\end{bmatrix} \\
&= \begin{bmatrix}
9.2234 \times 10^{-13} & -0.02543 \\
6.91753 \times 10^{-13} & 0.03391
\end{bmatrix}
\begin{bmatrix}
0.8 & 0.6 \\
-0.6 & 0.8
\end{bmatrix} \\
&= \begin{bmatrix}
0.01526 & -0.020344 \\
-0.02035 & 0.027128
\end{bmatrix}
\end{split}
\end{equation*}
observation weights have decreased significantly. \\\\
If the absolute value of all eigenvalues of $A$ is smaller than $1: \lim_{{\to 0}}$ can lead to vanishing gradient in the future.\\\\
If the absolute value of any eigenvalue of $A$ is larger than $1: \lim_{{\to \infty}}$ symbol can cause exploitations in gradient in the future.\\\\
If all eigenvalues are $1:$ It might be an identity matrix and may not contribute to learning or convergence.
\section{Gated recurrent units [4+3=7]}
Consider the following UGRNN cell. The gate values are given by:
\begin{center}
\includegraphics[scale=0.7]{images/UGRNN_cell.png}
\end{center}
\begin{equation*}
\begin{split}
u_t &= \sigma(w \cdot h_{t-1}+w \cdot x_t) \\
s_t &= w \cdot (h_{t-1}+x_t) \\
h_t &= u_t \cdot h_{t-1}+(1-u_t) \cdot s_t
\end{split}
\end{equation*}
Here, for simplicity, we assume every variable to be one dimensional, no bias and a single weight $w$ is shared by both gates $u_t$ and $s_t$.
\subsection{Calculate the partial derivative $\frac{\partial h_t}{\partial h_{t-1}}$ and express your result in the form $A_t \cdot w + B_t$ for suitable functions $A_t$ and $B_t$.}
\begin{equation*}
\begin{split}
\frac{\partial h_t}{\partial h_{t_1}} &= \frac{\partial}{\partial h_{t_1}} u_t \cdot h_{t-1} + \frac{\partial}{\partial h_{t_1}} (1-u_t) \cdot s_t \\
&= 1 \cdot u_t + h_{t-1} \cdot (\sigma w) + \frac{\partial}{\partial h_{t_1}} (1-u_t) \cdot s_t \\
&= \sigma(2w \cdot h_{t-1}+w \cdot x_t) + \frac{\partial}{\partial h_{t_1}} (1-u_t) \cdot s_t \\
&= \sigma(2w \cdot h_{t-1}+w \cdot x_t) + \frac{\partial}{\partial h_{t_1}} s_t - s_t \cdot u_t \\
&= \sigma(2w \cdot h_{t-1}+w \cdot x_t) + w - \frac{\partial}{\partial h_{t_1}} s_t \cdot u_t \\
&= \sigma(2w \cdot h_{t-1}+w \cdot x_t) + w - (w \cdot u_t + \sigma w \cdot s_t) \\
&= \sigma(2w \cdot h_{t-1}+w \cdot x_t) + w(1 - u_t - \sigma s_t) \\
&= \sigma(2w \cdot h_{t-1}+w \cdot x_t) + w(1 - 2 u_t) \\
&= \sigma w \cdot h_{t-1}+ u_t + w(1 - 2 u_t)
\end{split}
\end{equation*}
Therefore $A_t = 1-2u_t$ and $B_t = \sigma w \cdot h_{t-1}+ u_t$.
\subsection{Use the chain rule to write down an expression for the long term derivative $\frac{\partial h_t}{\partial h_0}$. Then explain why it is possible to avoid the vanishing gradient problem.}
\begin{equation*}
\frac{\partial h_t}{\partial h_0} = \frac{\partial h_t}{\partial h_{t-1}} \cdot \frac{\partial h_{t-1}}{\partial h_{t-2}} \cdot \frac{\partial h_{t-2}}{\partial h_{t-3}} \cdot \text{...} \cdot\frac{\partial h_{1}}{\partial h_0}
\end{equation*}
The vanishing gradient problem can be avoided by choosing a suitable $\sigma$ so that the gradient does not approach zero.
\section{CIFAR Challenge [5]}
In this exercise, you will learn how to visualize the hidden layers of convolutional neural networks. Follow the instructions in the Task3.3\_Visualization.ipynb notebook and complete all tasks marked as TODO.
\section{Depthwise Separable Convolutions [5]}
In this exercise, you will use recurrent neural networks to perform sentiment analysis on Amazon’s review dataset. Follow the instructions in the Task3.4\_NLP.ipynb notebook and complete all tasks marked as TODO.
\end{document}
%% Cell type:markdown id: tags:
# Programming Exercise 4: Transformers and Attention
## Very Deep Learning (VDL) - Winter Semester 2023/24
---
### Group Details:
- **Group Name:** Group 4
### Members:
- Frederick Phillips, 404986
- Niklas Eberts, 409829
- Muhammad Saad Najib, 423595
- Rea Fernandes, 426401
- Mayank Chetan Ahuja, 426518
- Caina Rose Paul, 426291
---
**Instructions**: The tasks in this notebook are a part of Sheet 4. Look for `TODO` tags throughout the notebook and complete the sections with missing code. Once done, ensure all outputs are visible and correctly displayed. Save your notebook and submit the `.ipynb` file together with the exercise sheet PDF in a single ZIP file.
%% Cell type:markdown id: tags:
## Introduction to Transformers
Transformers have revolutionized the field of natural language processing and beyond. This tutorial will guide you through the core concepts of transformer models, focusing on attention mechanisms.
Before diving into the practical aspects, familiarize yourself with the original Transformer paper: "Attention Is All You Need" by Vaswani et al. (2017). This will provide a solid theoretical foundation.
%% Cell type:code id: tags:
``` python
# Setup: Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
import math
# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
```
%% Cell type:markdown id: tags:
## Task 1: Implementing Scaled Dot-Product Attention
**Objective**: Implement the scaled dot-product attention mechanism as described in the Transformer paper.
- **Subtask 1**: Define a function for scaled dot-product attention (1).
- **Subtask 2**: Test the function with a small example (1).
%% Cell type:code id: tags:
``` python
# TODO: Implement Scaled Dot-Product Attention
class ScaledDotProductAttention(nn.Module):
def __init__(self):
super().__init__()
def forward(self, query, key, value, mask=None):
d_k = query.size(-1)
scores = query.matmul(key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = F.softmax(scores, dim=-1)
return p_attn.matmul(value), p_attn
# Initialize the attention mechanism
attention = ScaledDotProductAttention()
# Define query, key, value
query = torch.rand(10, 1, 512)
key = torch.rand(10, 1, 512)
value = torch.rand(10, 1, 512)
# Forward pass through the attention mechanism
output, attention_weights = attention(query, key, value)
print("Output shape: ", output.shape)
print("Attention Weights shape: ", attention_weights.shape)
```
%% Output
Output shape: torch.Size([10, 1, 512])
Attention Weights shape: torch.Size([10, 1, 1])
%% Cell type:markdown id: tags:
## Task 2: Multi-Head Attention
**Objective**: Understand and implement Multi-Head Attention.
- **Subtask 1**: Implement the Multi-Head Attention module (1).
- **Subtask 2**: Test the function with a small example (1).
%% Cell type:code id: tags:
``` python
# TODO: Implement Multi-Head Attention
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.head_dim = d_model // num_heads
self.q_linear = nn.Linear(d_model, d_model)
self.k_linear = nn.Linear(d_model, d_model)
self.v_linear = nn.Linear(d_model, d_model)
self.out = nn.Linear(d_model, d_model)
def forward(self, query, key, value):
N = query.shape[0]
# Get Q, K, V
Q = self.q_linear(query)
K = self.k_linear(key)
V = self.v_linear(value)
# Split the last dimension into (num_heads, head_dim)
Q = Q.reshape(N, -1, self.num_heads, self.head_dim)
K = K.reshape(N, -1, self.num_heads, self.head_dim)
V = V.reshape(N, -1, self.num_heads, self.head_dim)
# Compute scaled dot-product attention
energy = torch.einsum("nqhd,nkhd->nhqk", [Q, K])
attention = torch.softmax(energy / (self.d_model ** (1 / 2)), dim=3)
out = torch.einsum("nhql,nlhd->nqhd", [attention, V]).reshape(N, -1, self.d_model)
# Pass through the final linear layer
out = self.out(out)
return out
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.rand(64, 10, 512) # batch_size=64, sequence_length=10, d_model=512
out = mha(x, x, x)
print(out.shape) # Should print: torch.Size([64, 10, 512])
```
%% Output
torch.Size([64, 10, 512])
%% Cell type:markdown id: tags:
## Task 3: Positional Encoding
**Objective**: Implement positional encoding to add information about the sequence order.
- **Subtask 1**: Implement the positional encoding module (1).
- **Subtask 2**: Test the function with a small example (1).
%% Cell type:code id: tags:
``` python
# TODO: Implement Positional Encoding
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super(PositionalEncoding, self).__init__()
# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + self.pe[:, :x.size(1)]
return x
# Instantiate the class
pe = PositionalEncoding(20)
# Create a tensor of shape (1, 10, 20)
x = torch.zeros(1, 10, 20)
# Pass the tensor through the PositionalEncoding
y = pe(x)
print(y)
```
%% Output
tensor([[[ 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00,
1.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00,
1.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00],
[ 8.4147e-01, 5.4030e-01, 3.8767e-01, 9.2180e-01, 1.5783e-01,
9.8747e-01, 6.3054e-02, 9.9801e-01, 2.5116e-02, 9.9968e-01,
9.9998e-03, 9.9995e-01, 3.9811e-03, 9.9999e-01, 1.5849e-03,
1.0000e+00, 6.3096e-04, 1.0000e+00, 2.5119e-04, 1.0000e+00],
[ 9.0930e-01, -4.1615e-01, 7.1471e-01, 6.9942e-01, 3.1170e-01,
9.5018e-01, 1.2586e-01, 9.9205e-01, 5.0217e-02, 9.9874e-01,
1.9999e-02, 9.9980e-01, 7.9621e-03, 9.9997e-01, 3.1698e-03,
9.9999e-01, 1.2619e-03, 1.0000e+00, 5.0238e-04, 1.0000e+00],
[ 1.4112e-01, -9.8999e-01, 9.2997e-01, 3.6764e-01, 4.5775e-01,
8.8908e-01, 1.8816e-01, 9.8214e-01, 7.5285e-02, 9.9716e-01,
2.9995e-02, 9.9955e-01, 1.1943e-02, 9.9993e-01, 4.7547e-03,
9.9999e-01, 1.8929e-03, 1.0000e+00, 7.5357e-04, 1.0000e+00],
[-7.5680e-01, -6.5364e-01, 9.9977e-01, -2.1631e-02, 5.9234e-01,
8.0569e-01, 2.4971e-01, 9.6832e-01, 1.0031e-01, 9.9496e-01,
3.9989e-02, 9.9920e-01, 1.5924e-02, 9.9987e-01, 6.3395e-03,
9.9998e-01, 2.5238e-03, 1.0000e+00, 1.0048e-03, 1.0000e+00],
[-9.5892e-01, 2.8366e-01, 9.1320e-01, -4.0752e-01, 7.1207e-01,
7.0211e-01, 3.1027e-01, 9.5065e-01, 1.2526e-01, 9.9212e-01,
4.9979e-02, 9.9875e-01, 1.9904e-02, 9.9980e-01, 7.9244e-03,
9.9997e-01, 3.1548e-03, 1.0000e+00, 1.2559e-03, 1.0000e+00],
[-2.7942e-01, 9.6017e-01, 6.8379e-01, -7.2968e-01, 8.1396e-01,
5.8092e-01, 3.6960e-01, 9.2919e-01, 1.5014e-01, 9.8866e-01,
5.9964e-02, 9.9820e-01, 2.3884e-02, 9.9971e-01, 9.5092e-03,
9.9995e-01, 3.7857e-03, 9.9999e-01, 1.5071e-03, 1.0000e+00],
[ 6.5699e-01, 7.5390e-01, 3.4744e-01, -9.3770e-01, 8.9544e-01,
4.4518e-01, 4.2745e-01, 9.0404e-01, 1.7493e-01, 9.8458e-01,
6.9943e-02, 9.9755e-01, 2.7864e-02, 9.9961e-01, 1.1094e-02,
9.9994e-01, 4.4167e-03, 9.9999e-01, 1.7583e-03, 1.0000e+00],
[ 9.8936e-01, -1.4550e-01, -4.3251e-02, -9.9906e-01, 9.5448e-01,
2.9827e-01, 4.8360e-01, 8.7529e-01, 1.9960e-01, 9.7988e-01,
7.9915e-02, 9.9680e-01, 3.1843e-02, 9.9949e-01, 1.2679e-02,
9.9992e-01, 5.0476e-03, 9.9999e-01, 2.0095e-03, 1.0000e+00],
[ 4.1212e-01, -9.1113e-01, -4.2718e-01, -9.0417e-01, 9.8959e-01,
1.4389e-01, 5.3783e-01, 8.4305e-01, 2.2415e-01, 9.7455e-01,
8.9879e-02, 9.9595e-01, 3.5822e-02, 9.9936e-01, 1.4264e-02,
9.9990e-01, 5.6786e-03, 9.9998e-01, 2.2607e-03, 1.0000e+00]]])
%% Cell type:markdown id: tags:
## Additional Resources
Here are some additional resources to deepen your understanding:
- ["Illustrated Transformer" by Jay Alammar](https://jalammar.github.io/illustrated-transformer/).
- PyTorch official documentation and tutorials.
File added
\documentclass{article}
\usepackage{graphicx} % Required for inserting images
\usepackage{tikz}
\usetikzlibrary{positioning}
\usepackage{amsmath}
\usepackage{amssymb}
\title{Very Deep Learning \\ Exercise 4 \\ Group 4}
\author{Niklas Eberts - 409829 \\
Frederick Phillips - 404986 \\
Muhammad Saad Najib - 423595 \\
Rea Fernandes - 426401 \\
Mayank Chetan Ahuja - 426518 \\
Caina Rose Paul - 426291
}
\date{\today}
\begin{document}
\maketitle
\section{Language Models [2+2+2+2=8]}
Consider a vocabulary $\mathcal{V} = \{A, B, C\}$ and sequences of length $T=10$. \\
Assume $p(x)= \prod^T_{t=1}p(x_t)$ with $p(x_t=A)=0.2$, $p(x_t=B)=0.5$, and \\
$p(x_t=C)=0.3$ for the model distribution and $p(x_t=A)=0.5$, \\
$p(x_t=B)=0.5$, and $p(x_t=C)=0$ for the data distribution.
\subsection{Calculate the amount of information (in bits) needed to predict the next character in a sequence.}
\begin{align*}
H(P_{\text{data}}, P_{\text{model}}) &= -\frac{1}{10} \log_2 \left( \left( \frac{2}{10} \right)^5 \left( \frac{5}{10} \right)^5 \right) \\
&= -\frac{1}{10} \times (-16.609) \\
&= 1.661 \text{ bits}
\end{align*}
\subsection{Calculate the perplexity Perplexity($p_{data}$, $p_{model}$) for the same model. Remark on how certain the model is about which letter to predict next.}
\begin{align*}
\text{perplexity}(p_{\text{data}}, p_{\text{model}}) &= \left( \left( \frac{2}{10} \right)^5 \left( \frac{5}{10} \right)^5 \right)^{-\frac{1}{10}} \\
&= 3.16
\end{align*}
\text{ In this perplexity is larger than 3} \, (|v|=3) \\
\text{so the model fits badly. There is a confusion between A or B.}
\subsection{Let us now assume that the model distribution is the same as before, but the data distribution is $p(x_t = A) = 1$ now. Calculate the perplexity again. What does this new perplexity tell you about the model?}
\begin{align*}
\text{perplexity}(P_{\text{data}}, P_{\text{model}}) &= \left( \left( \frac{2}{10} \right)^{10} \right)^{-\frac{1}{10}} \\
&= 5
\end{align*}
\text{perplexity is larger than 3, as the model fits the data badly.}\\
\text{It is worse than the previous one.}
\subsection{How many bits do we need now to encode any possible outcome of $p_{data}$ using code optimized for $p_{model}$?}
\begin{align*}
-\frac{1}{10} \log_2 \left( (0.2)^{10} \right) \\
&= 2.322 \text{ bits}
\end{align*}
\section{Beam Search [2+4+2=8]}
Let $y = \{yes, ok, </s> \}$ be a vocabulary where $</s>$ is the end-of-string character. The following figure shows a search tree for generating the target string $T = t_1, t_2, \dots$ from this vocabulary. Each node represents the conditional probability $p(t_n|t_1, ..., t_{n-1}, \textbf{source})$, where source is the context variable.
\begin{center}
\begin{tikzpicture}[main/.style = {draw}, node distance=2cm]
\node[main] (start) at (0, 0) {start};
\node[main] (ok1) at (3, 1) {ok};
\node[main] (yes1) at (3, 0) {yes};
\node[main] (</s>1) at (3, -1) {$</s>$};
\node[main] (yes2) at (6, 2) {yes};
\node[main] (ok2) at (6, 3) {ok};
\node[main] (</s>2) at (6, 1) {$</s>$};
\node[main] (ok3) at (6, 0) {ok};
\node[main] (yes3) at (6, -1) {yes};
\node[main] (</s>3) at (6, -2) {$</s>$};
\node[main] (</s>4) at (9, 3) {$</s>$};
\node[main] (</s>5) at (9, 2) {$</s>$};
\node[main] (</s>6) at (9, 0) {$</s>$};
\node[main] (</s>7) at (9, -1) {$</s>$};
\node (t1) at (3, -3) {$t_1$};
\node (t2) at (6, -3) {$t_2$};
\node (t3) at (9, -3) {$t_3$};
\node (p1) at (1.5, 1.5) {$p(t_1|$source)};
\node (p2) at (4.2, 3) {$p(t_2|$source$, t_1)$};
\node (p3) at (7.5, 3.8) {$p(t_3|$source$, t_1, t_2)$};
\draw[->] (start) -- node[midway, sloped, pos=0.5, fill=white] {0.4} (ok1);
\draw[->] (start) -- node[midway, sloped, pos=0.5, fill=white] {0.6} (yes1);
\draw[->] (start) -- node[midway, sloped, pos=0.5, fill=white] {0} (</s>1);
\draw[->] (ok1) -- node[midway, sloped, pos=0.5, fill=white] {0.9} (ok2);
\draw[->] (ok1) -- node[midway, sloped, pos=0.5, fill=white] {0} (</s>2);
\draw[->] (ok1) -- node[midway, sloped, pos=0.5, fill=white] {0.1} (yes2);
\draw[->] (yes1) -- node[midway, sloped, pos=0.5, fill=white] {0.2} (ok3);
\draw[->] (yes1) -- node[midway, sloped, pos=0.5, fill=white] {0.5} (yes3);
\draw[->] (yes1) -- node[midway, sloped, pos=0.5, fill=white] {0.3} (</s>3);
\draw[->] (ok2) -- node[midway, sloped, pos=0.5, fill=white] {1.0} (</s>4);
\draw[->] (yes2) -- node[midway, sloped, pos=0.5, fill=white] {1.0} (</s>5);
\draw[->] (ok3) -- node[midway, sloped, pos=0.5, fill=white] {1.0} (</s>6);
\draw[->] (yes3) -- node[midway, sloped, pos=0.5, fill=white] {1.0} (</s>7);
\end{tikzpicture}
\end{center}
\subsection{What will be the generated target string in this case with the greedy search algorithm?}
yes yes $</s>$
\subsection{Execute the beam search algorithm with the beam size $k = 2$. What is the target string generated in this case? Compute all intermediate probabilities.}
Start with an empty list of hypotheses H
\begin{enumerate}
\item Step:
\begin{itemize}
\item Hypothesis 1: $h_1 = \{ ok \}$ with probability $0.4$
\item Hypothesis 2: $h_2 = \{ yes \}$ with probability $0.6$
\item Hypothesis 3: $h_3 = \{ </s> \}$ with probability $0$
\end{itemize}
The hypotheses at this step are $H = \{ h_1, h_2 \}$.
\item Step: \\
For $h_1:$
\begin{itemize}
\item Hypothesis 1: $h_{1,1} = \{ ok, ok \}$ with probability $0.36$
\item Hypothesis 2: $h_{1,2} = \{ ok, yes \}$ with probability $0.04$
\item Hypothesis 3: $h_{1,3} = \{ ok, </s> \}$ with probability $0$
\end{itemize}
For $h_2:$
\begin{itemize}
\item Hypothesis 1: $h_{2,1} = \{ yes, ok \}$ with probability $0.12$
\item Hypothesis 2: $h_{2,2} = \{ yes, yes \}$ with probability $0.3$
\item Hypothesis 3: $h_{2,3} = \{ yes, </s> \}$ with probability $0.18$
\end{itemize}
The hypotheses at this step are $H = \{ h_{1,1}, h_{2,2} \}$.
\item Step: \\
For $h_{1,1}$
\begin{itemize}
\item Hypothesis 1: $h_{1,1,1} = \{ ok, ok, </s> \}$ with probability $0.36$
\end{itemize}
For $h_{2,2}$
\begin{itemize}
\item Hypothesis 1: $h_{2,2,1} = \{ yes, yes, </s> \}$ with probability $0.3$
\end{itemize}
The hypotheses at this step are $H = \{ h_{1,1,1}, h_{2,2,1} \}$.
\end{enumerate}
The final hypothesis with the highest probability is $h_{1,1,1}$ with a probability of $0.36$. Therefore, the target string generated using the beam search algorithm with a beam size of $k = 2$ is \{ok, ok, $</s>$ \}.
\subsection{Did the two search algorithms yield the same result? What is the benefit of using beam search over the greedy approach?}
They did not yield the same result. Beam search explores multiple hypotheses in parallel. This helps in avoiding premature commitments to a single path and increases the likelihood of finding a globally better solution. Beam search often leads to better overall performance in terms of the quality of generated sequences. By considering a larger set of possibilities at each step, the algorithm is more likely to find higher-probability solutions, especially in cases where the optimal path involves making trade-offs between conflicting objectives. However, it's important to note that beam search comes with a computational cost, as the number of hypotheses grows exponentially with the beam width.
\section{Transformers and Attention [3]}
Consider an input matrix $X$, where each column represents an input vector, together with query, key, value matrices defined by:
\begin{equation*}
X =
\begin{pmatrix}
-3 & 0 & 2 & -2 \\
1 & 2 & 2 & 1 \\
3 & -2 & -1 & -5
\end{pmatrix}
, W_Q =
\begin{pmatrix}
2 & 1 & 0 \\
0 & 0 & 0 \\
2 & 1 & 2 \\
1 & 2 & 2
\end{pmatrix}
, W_K =
\begin{pmatrix}
1 & 1 & 2 \\
2 & 2 & 1 \\
2 & 1 & 1 \\
0 & 1 & 2
\end{pmatrix}
, W_V =
\begin{pmatrix}
3 & 0 & 1 \\
2 & 0 & 0 \\
4 & 3 & 3 \\
1 & 4 & 2
\end{pmatrix}
\end{equation*}
\subsection{Compute the corresponding self-attention output matrix, where again, each column represents the attention output of one input. You should show intermediate steps. \\\\
You may find the lecture slides on transformers or the paper Attention is all you need helpful.}
First, Computing the Query, Key, and Value matrices using the given input matrix \( X \) and the weight matrices \( W_Q \), \( W_K \), and \( W_V \):
\begin{align*}
Q &= XW_Q = \begin{pmatrix}
-4 & -5 & 0 \\
7 & 5 & 6 \\
-1 & -8 & -12
\end{pmatrix}, \\
K &= XW_K = \begin{pmatrix}
1 & -3 & -8 \\
9 & 8 & 8 \\
-3 & -7 & -7
\end{pmatrix}, \\
V &= XW_V = \begin{pmatrix}
-3 & -2 & -1 \\
16 & 10 & 9 \\
-4 & -23 & -10
\end{pmatrix}.
\end{align*}
% Calculation of Attention Scores
Next, Calculating the attention scores by multiplying the Query matrix with the transpose of the Key matrix:
\begin{align*}
\text{Attention Scores} &= QK^T = \begin{pmatrix}
11 & -76 & 47 \\
-56 & 151 & -98 \\
119 & -169 & 143
\end{pmatrix}.
\end{align*}
% Application of Softmax
Applying the softmax function to normalize the attention scores:
\begin{align*}
\text{Softmax}(\text{Attention Scores}) &= \text{softmax}(\begin{pmatrix}
11 & -76 & 47 \\
-56 & 151 & -98 \\
119 & -169 & 143
\end{pmatrix})
\end{align*}
\section{Scaled Dot Product in Transformers [5]}
\subsection{Given a random key, query pair $k, q \in \mathbb{R}^d$. Assume for simplicity that for any $1 \le i, j \le d, k_i$ and $q_j$ are independent random variables with mean zero and variance 1. Determine the mean and variance of the dot product:
\begin{equation*}
<k,q> = \sum^d_{i=1}k_iq_i
\end{equation*}
Then explain why we would scale $<k,q> \rightarrow \frac{<k,q>}{\sqrt{d}}$ in the transformer architecture.}
\begin{equation*}
\begin{aligned}
E[q \cdot k] &= E \left[ \sum_{i=1}^{d} q_i k_i \right] \\
&= \sum_{i=1}^{d} E[q_i k_i] \\
&= \sum_{i=1}^{d} E[q_i] E[k_i] \\
&= 0
\end{aligned}
\end{equation*}
\begin{equation*}
\begin{aligned}
\text{var}[q \cdot k] &= \text{var} \left[ \sum_{i=1}^{d} q_i k_i \right] \\
&= \sum_{i=1}^{d} \text{var}[q_i k_i] \\
&= \sum_{i=1}^{d} \text{var}[q_i] \text{var}[k_i] \\
&= \sum_{i=1}^{d} 1 \\
&= d
\end{aligned}
\end{equation*}
Control Variance: As we've calculated, the variance of the dot product \( \langle k, q \rangle \) is \( d \), where \( d \) is the dimensionality of the key and query vectors. When \( d \) is large, the variance of the dot product can become quite large, leading to extremely large or small values. Scaling by \( \frac{1}{\sqrt{d}} \) effectively controls this variance, bringing it back to a more manageable range (specifically to 1), which is important for maintaining numerical stability.
By scaling with \( \frac{1}{\sqrt{d}} \), the attention scores become less dependent on the dimensionality of the key and query vectors. This means the model's performance becomes more consistent and less sensitive to changes in the dimensionality of the input space.
\section{Programming Task [6]}
Complete all the tasks in the notebook Task\_4.5.ipynb provided with the sheet.
\end{document}
File added
\documentclass{article}
\usepackage{graphicx} % Required for inserting images
\usepackage{tikz}
\usetikzlibrary{positioning}
\usepackage{amsmath}
\usepackage{amssymb}
\title{Very Deep Learning \\ Exercise 5 \\ Group 4}
\author{Niklas Eberts - 409829 \\
Frederick Phillips - 404986 \\
Muhammad Saad Najib - 423595 \\
Rea Fernandes - 426401 \\
Mayank Chetan Ahuja - 426518 \\
Caina Rose Paul - 426291
}
\date{\today}
\begin{document}
\maketitle
\section{Batch Normalization [5+5+5=15]}
We have seen that gradient management is important for very deep networks. Batch and layer normalizations are useful and widely applicable techniques that help people train networks with hundreds of layers while keeping the gradients under control. In this exercise we develop some intuition on how batch normalization manages gradients. \\\\
Given a batch $\mathcal{B}= \{a_1, \dots, a_N\}$ of some data, recall that a (simplified) batch normalization layer $BN_{\gamma, \beta}$ works as follows:
\begin{itemize}
\item Input: $a_i, 1 \leq i \leq N$. Learnable parameters $\gamma, \beta$.
\item Calculations
\begin{equation}
\mu = \frac{1}{N}\displaystyle\sum^N_{i=1}a_i
\end{equation}
\begin{equation}
\sigma^2 = \frac{1}{N}\displaystyle\sum^N_{i=1}(a_i-\mu)^2
\end{equation}
\begin{equation}
\hat{a_i} = \frac{a_i-\mu}{\sqrt{\sigma^2+\epsilon}}
\end{equation}
\item Output: $BN_{\gamma, \beta}[a_i]=\gamma\hat{a_i}+\beta$
\end{itemize}
\begin{figure}
\centering
\includegraphics[scale=1]{images/5.1.png}
\caption{Exercise 5.1}
\label{fig:5.1}
\end{figure}
For simplicity, we assume all variables to be one dimensional. Consider a hidden layer $g$
of some neural network with associated weight $w$ and let the previous layer activation be
$x$. For $1 \leq i \leq N$ , the layer performs $g_i = \sigma(BN_{\gamma, \beta}[wx_i])$
\subsection{Show that batch normalization is independent of weight scaling, i.e. if $w \neq 0, k > 0$ and $\epsilon$ negligible, we have $BN_{\gamma, \beta}[(kw)x_i]=BN_{\gamma, \beta}[wx_i]$.}
TODO
\subsection{Deduce that the gradient $\frac{\partial BN_{\gamma, \beta}[w x_i]}{\partial x_i}$ is independent of weight scaling. What can then be concluded for gradient propagation to earlier layers when weights are scaled?}
TODO
\subsection{Show that $\frac{\partial BN_{\gamma, \beta}[(kw) x_i]}{\partial(kw)} \rightarrow 0$ as $k \rightarrow + \infty$. What does this mean in terms of weight explosion prevention?}
TODO
\section{Semantic Segmentation [5+5+5=15]}
\subsection{Perform transpose convolution on a $2 \times 2$ encoded feature map which needs to be upsampled to a $3 \times 3$ feature map. Kernel size is $2 \times 2$, stride is $1$ with no padding. \\
The $2 \times 2$ input feature map is: \\
\begin{center}$
\begin{array}{|c|c|}
- & - \\
5 & 9 \\
- & - \\
0 & 3 \\
- & -
\end{array}$
\end{center}
and the $2 \times 2$ kernel is: \\
\begin{center}$
\begin{array}{|c|c|}
- & - \\
1 & 6 \\
- & - \\
3 & 7 \\
- & -
\end{array}$
\end{center}}
\begin{center}$
\begin{array}{|c|c|c|}
- & - & -\\
5 & 39 & 54 \\
- & - & - \\
15 & 65 & 81 \\
- & - & -\\
0 & 9 & 21 \\
\end{array}$
\end{center}}
\subsection{Consider a 1D transpose convolution scenario. The input vector is [a, b, c] and the filter is [w, x, y, z]. Assuming again stride 1 and no padding, write down the result of transpose convolution.}
\[
\begin{bmatrix}
aw \\
ax + bw \\
ay + bx + cw \\
az + by + cx \\
bz + cy \\
cz
\end{bmatrix}
\]
\subsection{In this part, we will apply the most widely used upsampling techniques.\\
Given the following $2 \times 2$ input feature map:
\begin{center}$
\begin{array}{|c|c|}
- & - \\
8 & 16 \\
- & - \\
24 & 32 \\
- & -
\end{array}$
\end{center}
and the Final output feature map of size $4 \times 4$:
\begin{center}$
\begin{array}{|c|c|c|c|}
- & - & - & - \\
&&& \\
- & - & - & - \\
&&& \\
- & - & - & - \\
&&& \\
- & - & - & - \\
&&& \\
- & - & - & -
\end{array}$
\end{center}
Apply the following up-sampling techniques to convert the given $2 \times 2$ feature map into a $4 \times 4$ output.
\begin{enumerate}
\item Nearest Neighbor
\item Bed of Nails
\item Max-Unpooling : For Max-Unpooling, the saved indices are marked with $x$ in the following matrix:
\begin{center}$
\begin{array}{|c|c|c|c|}
- & - & - & - \\
&&x& \\
- & - & - & - \\
&x&& \\
- & - & - & - \\
&&&x \\
- & - & - & - \\
x&&& \\
- & - & - & -
\end{array}$
\end{center}
\end{enumerate}}
1
\begin{center}$
\begin{array}{|c|c|c|c|}
- & - & - & - \\
8 & 8 & 16 & 16 \\
- & - & - & - \\
8 & 8 & 16 & 16\\
- & - & - & - \\
24 & 24 & 32 & 32 \\
- & - & - & - \\
24 & 24 & 32 & 32 \\
- & - & - & -
\end{array}$
\end{center}
2
\begin{center}$
\begin{array}{|c|c|c|c|}
- & - & - & - \\
8 & 0 & 16 & 0 \\
- & - & - & - \\
0 & 0 & 0 & 0\\
- & - & - & - \\
24 & 0 & 32 & 0 \\
- & - & - & - \\
0 & 0 & 0 & 0 \\
- & - & - & -
\end{array}$
\end{center}
3
3
\begin{center}$
\begin{array}{|c|c|c|c|}
- & - & - & - \\
0 & 0&16&0 \\
- & - & - & - \\
0&8&0&0 \\
- & - & - & - \\
0&0&0&32 \\
- & - & - & - \\
24&0&0&0 \\
- & - & - & -
\end{array}$
\end{center}
\end{document}
This diff is collapsed.