โ† Guide

๐ŸŒฑ Beginner โ€” AI Fundamentals

Chapter 9 of 24

โš™๏ธ Chapter 9: Popular Tokenization Algorithms

BPE, WordPiece, SentencePiece

BPE

Used in: GPT

Merge frequent letter pairs

WordPiece

Used in: BERT

Merge by likelihood

SentencePiece

Used in: T5, multilingual

Handles no-space languages

๐Ÿ”น BPE (Byte Pair Encoding)

Used by GPT models. Merge most frequent letter pairs. Purely frequency-based.

๐Ÿ”น WordPiece

Used by BERT. Merge based on likelihood improvement. Slightly smarter merging.

๐Ÿ”น SentencePiece

Used in multilingual models (e.g. T5). Handles languages without spaces like Chinese.