@@ -214,12 +214,15 @@ Applying the softmax function to normalize the attention scores:
-56 & 151 & -98 \\
119 & -169 & 143
\end{pmatrix})
\end{align*}
\section{Scaled Dot Product in Transformers [5]}
\subsection{Given a random key, query pair $k, q \in\mathbb{R}^d$. Assume for simplicity that for any $1\le i, j \le d, k_i$ and $q_j$ are independent random variables with mean zero and variance 1. Determine the mean and variance of the dot product:
\begin{equation*}
<k,q> = \sum^d_{i=1}k_iq_i
\end{equation*}
Then explain why we would scale $<k,q> \rightarrow\frac{<k,q>}{\sqrt{d}}$ in the transformer architecture.}
\begin{equation*}
\begin{aligned}
E[q \cdot k] &= E \left[ \sum_{i=1}^{d} q_i k_i \right]\\
Then explain why we would scale $<k,q> \rightarrow\frac{<k,q>}{\sqrt{d}}$ in the transformer architecture.}
Control Variance: As we've calculated, the variance of the dot product \(\langle k, q \rangle\) is \( d \), where \( d \) is the dimensionality of the key and query vectors. When \( d \) is large, the variance of the dot product can become quite large, leading to extremely large or small values. Scaling by \(\frac{1}{\sqrt{d}}\) effectively controls this variance, bringing it back to a more manageable range (specifically to 1), which is important for maintaining numerical stability.
...
...
@@ -253,3 +255,4 @@ Complete all the tasks in the notebook Task\_4.5.ipynb provided with the sheet.