Fixed formatting issues

aad7850b · eberts · 77bf9c1b · aad7850b · aad7850b
Commit aad7850b authored 1 year ago by eberts
--- a/exercises/exercise4/VDL_Exercise_4.pdf
+++ b/exercises/exercise4/VDL_Exercise_4.pdf
--- a/exercises/exercise4/exercise4.tex
+++ b/exercises/exercise4/exercise4.tex
@@ -214,12 +214,15 @@ Applying the softmax function to normalize the attention scores:
    -56 & 151 & -98 \\
    119 & -169 & 143
 \end{pmatrix}) 
+\end{align*}

 \section{Scaled Dot Product in Transformers [5]}
 \subsection{Given a random key, query pair $k, q \in \mathbb{R}^d$. Assume for simplicity that for any $1 \le i, j \le d, k_i$ and $q_j$ are independent random variables with mean zero and variance 1. Determine the mean and variance of the dot product:
 \begin{equation*}
    <k,q> = \sum^d_{i=1}k_iq_i
 \end{equation*}
+Then explain why we would scale $<k,q> \rightarrow \frac{<k,q>}{\sqrt{d}}$ in the transformer architecture.}
+
 \begin{equation*}
 \begin{aligned}
 E[q \cdot k] &= E \left[ \sum_{i=1}^{d} q_i k_i \right] \\
@@ -241,7 +244,6 @@ E[q \cdot k] &= E \left[ \sum_{i=1}^{d} q_i k_i \right] \\



-Then explain why we would scale $<k,q> \rightarrow \frac{<k,q>}{\sqrt{d}}$ in the transformer architecture.}

 Control Variance: As we've calculated, the variance of the dot product \( \langle k, q \rangle \) is \( d \), where \( d \) is the dimensionality of the key and query vectors. When \( d \) is large, the variance of the dot product can become quite large, leading to extremely large or small values. Scaling by \( \frac{1}{\sqrt{d}} \) effectively controls this variance, bringing it back to a more manageable range (specifically to 1), which is important for maintaining numerical stability.

@@ -253,3 +255,4 @@ Complete all the tasks in the notebook Task\_4.5.ipynb provided with the sheet.


 \end{document}
+