LLMs From Scratch: Day 3

May 25, 2026

Today we get to start working on Multi-Headed Self-Attention! Time permitting I may work on cross-attention as well since the mechanism is fairly similar, but tbd.

Copying from my notes on AIAYN, the key difference between regular self-attention and multi-headed is that instead of performing a single attention function with $d_{model}$ key/value/queries, you can linearly project them $h$ times with learned linear projections to $d_k$ , $d_k$ , and $d_v$ , respectively.

Thinking about the input after it passes through the norm (after embedding), the shape should be ( $batch_size$ , $seq_length$ , $d_{model}$ ). We validated that shape in the embedding file itself. For now, we'll just assume a batch size of 1 to keep the math easy. In AIAYN, they use $h=8$ heads, with $d_k=d_v=d_{model} / h = 64$ . That means that after applying the linear projection, our input should go from the aforementioned shape to ( $batch_size$ , $seq\_length$ , $64$ , $8$ ) although the ordering for the last two dimensions may change as we work through this. It's probably also worth noting that in AIAYN, they maintain the large $K,V$ tensors as well, and use learned weight matrices to project those, rather than having completely separate $K,V$ matrices for each head. Naively, you could just initialize the $K$ and $V$ tensors, then intialize each head's weight matrices, and loop through the heads during a forward pass and concat after looping. However, I think this is a good time to use vmap instead! We can stack each of our weight matrices and iterate through them, applying them to the appropriate axes on our input tensor $Q$ . Additionally, after we project and do attention with each head, we have to reshape the results to take the dimesions from (..., 64, 8) to (512), and our last weight matrix $W_O$ will have shape (512, 512).

All things considered, this was fairly easy to implement! Aside from figuring out how to correctly use in & out axes arguments in vmap, the work so far has set me up well to implement this with relatively little new math.

The full code for Multi-Headed Self Attention is here. There will have to be some modifications to the specific functions once we're ready to train to incorporate things like dropout, but aside from that the bones for the full encoder and decoder are essentially in place. Multi-Headed Cross Attention is effectively the same code, with the outputs of the encoder replacing the values for $K$ and $V$ .

See you on day 4.