← Writing

LLMs From Scratch: Day 5

May 31, 2026

I setup most of the work today in yesterday's. The bulk of the work went into implementing the Trie structure for searching valid tokenizer substrings via the merge file, and the dynamic programming function to find the optimal combination of substrings that cover a full word. The Trie was actually fairly easy to implement, and was definitely a good refresher on that type of data structure. The DP substring finder was more difficult, if only because I haven't practiced DP in a while so some of the tricks common to DP (like padding the array with an extra index) took me some time to get. Fortunately, testing the tokenizer functions is a little easier than the linear algebra ones. One interesting quirk with the BPE files were that the merge ops use </w> to denote the end of a word, while the vocab file uses @@ at the end of a string to denote a token that is not the end of a word (e.g. the word "cat" would just be "cat", but the "cat" from "catastrophic" would be "cat@@"). This is easy enough to handle in the logic, but I found it interesting that the tokenizer makes a distinction there. I will need to add some minor functionality later for special characters, like <start> and <EOS>, but those can easily be added after constructing the base tokenizer. The last two days have set us up well. I realized it's worth implementing an optimizer in JAX, which we can test on the feed-forward network, but after that I'll be very close to implementing the full pipeline.

The code for today is here. See you in day 6.