Tokenization from scratch from Andrej Karapathy
Tokenization from scratch from Andrej Karapathy
Link: https://youtu.be/zduSFxRajkE
Context
What a beautiful piece of content. Archive and store it in a museum. The depth with which he explained it, the low-level details, the pythonic bits, is so fun and contagious to watch, and feel. I learnt a few tricks about interaction with LLMs and understood certain quirks. This could give a intuition for why certain LLMs won’t be able to give good completions for certain tasks. I also don’t quite liked the Sentence piece tokenization logic. But I can see where it could be probably come handy, in PDFs for example, the scope of sentence is well defined. In arbitrary piece of text on the internet, it might not be.
Source: techstructive-weekly-61