Skip to content
Snippets Groups Projects
Commit aad7850b authored by eberts's avatar eberts
Browse files

Fixed formatting issues

parent 77bf9c1b
Branches
No related tags found
No related merge requests found
No preview for this file type
......@@ -214,12 +214,15 @@ Applying the softmax function to normalize the attention scores:
-56 & 151 & -98 \\
119 & -169 & 143
\end{pmatrix})
\end{align*}
\section{Scaled Dot Product in Transformers [5]}
\subsection{Given a random key, query pair $k, q \in \mathbb{R}^d$. Assume for simplicity that for any $1 \le i, j \le d, k_i$ and $q_j$ are independent random variables with mean zero and variance 1. Determine the mean and variance of the dot product:
\begin{equation*}
<k,q> = \sum^d_{i=1}k_iq_i
\end{equation*}
Then explain why we would scale $<k,q> \rightarrow \frac{<k,q>}{\sqrt{d}}$ in the transformer architecture.}
\begin{equation*}
\begin{aligned}
E[q \cdot k] &= E \left[ \sum_{i=1}^{d} q_i k_i \right] \\
......@@ -241,7 +244,6 @@ E[q \cdot k] &= E \left[ \sum_{i=1}^{d} q_i k_i \right] \\
Then explain why we would scale $<k,q> \rightarrow \frac{<k,q>}{\sqrt{d}}$ in the transformer architecture.}
Control Variance: As we've calculated, the variance of the dot product \( \langle k, q \rangle \) is \( d \), where \( d \) is the dimensionality of the key and query vectors. When \( d \) is large, the variance of the dot product can become quite large, leading to extremely large or small values. Scaling by \( \frac{1}{\sqrt{d}} \) effectively controls this variance, bringing it back to a more manageable range (specifically to 1), which is important for maintaining numerical stability.
......@@ -253,3 +255,4 @@ Complete all the tasks in the notebook Task\_4.5.ipynb provided with the sheet.
\end{document}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment