1. Character
chatbot โ c h a t b o t
chatbot
โ Too many tokens
2. Word
chatbot works โ 2 tokens
chatbotworks
โ Rare words break
3. Subword โ
chatbot โ chat + bot
chatbot
โ Best balance
1๏ธโฃ Character tokenization
Text: chatbot โ Tokens: ["c","h","a","t","b","o","t"]
โ Too many tokens. โ Hard to learn meaning.
2๏ธโฃ Word tokenization
Text: chatbot works well โ Tokens: ["chatbot","works","well"]
โ Vocabulary explodes. โ Rare words break system.
3๏ธโฃ Subword tokenization (modern standard)
Text: chatbot โ Tokens: ["chat","bot"]
โ Best balance. โ Handles rare words. โ Smaller vocabulary. Used by modern LLMs.