โ† Guide

๐ŸŒฑ Beginner โ€” AI Fundamentals

Chapter 7 of 24

โœ‚๏ธ Chapter 7: Types of Tokenization

Character, word, and subword tokenization

1. Character

chatbot โ†’ c h a t b o t

chatbot

โŒ Too many tokens

2. Word

chatbot works โ†’ 2 tokens

chatbotworks

โŒ Rare words break

3. Subword โœ“

chatbot โ†’ chat + bot

chatbot

โœ… Best balance

1๏ธโƒฃ Character tokenization

Text: chatbot โ†’ Tokens: ["c","h","a","t","b","o","t"]

โŒ Too many tokens. โŒ Hard to learn meaning.

2๏ธโƒฃ Word tokenization

Text: chatbot works well โ†’ Tokens: ["chatbot","works","well"]

โŒ Vocabulary explodes. โŒ Rare words break system.

3๏ธโƒฃ Subword tokenization (modern standard)

Text: chatbot โ†’ Tokens: ["chat","bot"]

โœ… Best balance. โœ… Handles rare words. โœ… Smaller vocabulary. Used by modern LLMs.