# Programming Exercise 4: Transformers and Attention

## Very Deep Learning (VDL) - Winter Semester 2023/24

---

### Group Details:

- **Group Name:** Group 4

### Members:

- Frederick Phillips, 404986
- Niklas Eberts, 409829
- Muhammad Saad Najib, 423595
- Rea Fernandes, 426401
- Mayank Chetan Ahuja, 426518
- Caina Rose Paul, 426291
---

**Instructions**: The tasks in this notebook are a part of Sheet 4. Look for `TODO` tags throughout the notebook and complete the sections with missing code. Once done, ensure all outputs are visible and correctly displayed. Save your notebook and submit the `.ipynb` file together with the exercise sheet PDF in a single ZIP file.

## Introduction to Transformers
Transformers have revolutionized the field of natural language processing and beyond. This tutorial will guide you through the core concepts of transformer models, focusing on attention mechanisms.

Before diving into the practical aspects, familiarize yourself with the original Transformer paper: "Attention Is All You Need" by Vaswani et al. (2017). This will provide a solid theoretical foundation.

In [7]:
# Setup: Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
import math

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


## Task 1: Implementing Scaled Dot-Product Attention
**Objective**: Implement the scaled dot-product attention mechanism as described in the Transformer paper.

- **Subtask 1**: Define a function for scaled dot-product attention (1).
- **Subtask 2**: Test the function with a small example (1).

In [8]:
# TODO: Implement Scaled Dot-Product Attention

class ScaledDotProductAttention(nn.Module):
 def __init__(self):
 super().__init__()

 def forward(self, query, key, value, mask=None):
 d_k = query.size(-1)
 scores = query.matmul(key.transpose(-2, -1)) / math.sqrt(d_k)
 
 if mask is not None:
 scores = scores.masked_fill(mask == 0, -1e9)
 
 p_attn = F.softmax(scores, dim=-1)
 return p_attn.matmul(value), p_attn

# Initialize the attention mechanism
attention = ScaledDotProductAttention()

# Define query, key, value
query = torch.rand(10, 1, 512)
key = torch.rand(10, 1, 512)
value = torch.rand(10, 1, 512)

# Forward pass through the attention mechanism
output, attention_weights = attention(query, key, value)

print("Output shape: ", output.shape)
print("Attention Weights shape: ", attention_weights.shape)

Output shape: torch.Size([10, 1, 512])
Attention Weights shape: torch.Size([10, 1, 1])


## Task 2: Multi-Head Attention
**Objective**: Understand and implement Multi-Head Attention.

- **Subtask 1**: Implement the Multi-Head Attention module (1).
- **Subtask 2**: Test the function with a small example (1).

In [9]:
# TODO: Implement Multi-Head Attention

class MultiHeadAttention(nn.Module):
 def __init__(self, d_model, num_heads):
 super(MultiHeadAttention, self).__init__()
 assert d_model % num_heads == 0

 self.d_model = d_model
 self.num_heads = num_heads
 self.head_dim = d_model // num_heads

 self.q_linear = nn.Linear(d_model, d_model)
 self.k_linear = nn.Linear(d_model, d_model)
 self.v_linear = nn.Linear(d_model, d_model)
 self.out = nn.Linear(d_model, d_model)

 def forward(self, query, key, value):
 N = query.shape[0]

 # Get Q, K, V
 Q = self.q_linear(query)
 K = self.k_linear(key)
 V = self.v_linear(value)

 # Split the last dimension into (num_heads, head_dim)
 Q = Q.reshape(N, -1, self.num_heads, self.head_dim)
 K = K.reshape(N, -1, self.num_heads, self.head_dim)
 V = V.reshape(N, -1, self.num_heads, self.head_dim)

 # Compute scaled dot-product attention
 energy = torch.einsum("nqhd,nkhd->nhqk", [Q, K])
 attention = torch.softmax(energy / (self.d_model ** (1 / 2)), dim=3)
 out = torch.einsum("nhql,nlhd->nqhd", [attention, V]).reshape(N, -1, self.d_model)

 # Pass through the final linear layer
 out = self.out(out)

 return out

mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.rand(64, 10, 512) # batch_size=64, sequence_length=10, d_model=512
out = mha(x, x, x)
print(out.shape) # Should print: torch.Size([64, 10, 512])

torch.Size([64, 10, 512])


## Task 3: Positional Encoding
**Objective**: Implement positional encoding to add information about the sequence order.

- **Subtask 1**: Implement the positional encoding module (1).
- **Subtask 2**: Test the function with a small example (1).

In [10]:
# TODO: Implement Positional Encoding

class PositionalEncoding(nn.Module):
 def __init__(self, d_model, max_len=5000):
 super(PositionalEncoding, self).__init__()

 # Compute the positional encodings once in log space.
 pe = torch.zeros(max_len, d_model)
 position = torch.arange(0, max_len).unsqueeze(1)
 div_term = torch.exp(torch.arange(0, d_model, 2) *
 -(math.log(10000.0) / d_model))
 pe[:, 0::2] = torch.sin(position * div_term)
 pe[:, 1::2] = torch.cos(position * div_term)
 pe = pe.unsqueeze(0)
 self.register_buffer('pe', pe)

 def forward(self, x):
 x = x + self.pe[:, :x.size(1)]
 return x

# Instantiate the class
pe = PositionalEncoding(20)

# Create a tensor of shape (1, 10, 20)
x = torch.zeros(1, 10, 20)

# Pass the tensor through the PositionalEncoding
y = pe(x)

print(y)


tensor([[[ 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00,
 1.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00,
 1.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00],
 [ 8.4147e-01, 5.4030e-01, 3.8767e-01, 9.2180e-01, 1.5783e-01,
 9.8747e-01, 6.3054e-02, 9.9801e-01, 2.5116e-02, 9.9968e-01,
 9.9998e-03, 9.9995e-01, 3.9811e-03, 9.9999e-01, 1.5849e-03,
 1.0000e+00, 6.3096e-04, 1.0000e+00, 2.5119e-04, 1.0000e+00],
 [ 9.0930e-01, -4.1615e-01, 7.1471e-01, 6.9942e-01, 3.1170e-01,
 9.5018e-01, 1.2586e-01, 9.9205e-01, 5.0217e-02, 9.9874e-01,
 1.9999e-02, 9.9980e-01, 7.9621e-03, 9.9997e-01, 3.1698e-03,
 9.9999e-01, 1.2619e-03, 1.0000e+00, 5.0238e-04, 1.0000e+00],
 [ 1.4112e-01, -9.8999e-01, 9.2997e-01, 3.6764e-01, 4.5775e-01,
 8.8908e-01, 1.8816e-01, 9.8214e-01, 7.5285e-02, 9.9716e-01,
 2.9995e-02, 9.9955e-01, 1.1943e-02, 9.9993e-01, 4.7547e-03,
 9.9999e-01, 1.8929e-03, 1.0000e+00, 7.5357e-04, 1.0000e+00],
 [

## Additional Resources
Here are some additional resources to deepen your understanding:

- ["Illustrated Transformer" by Jay Alammar](https://jalammar.github.io/illustrated-transformer/).
- PyTorch official documentation and tutorials.