Let's start simple. Input text:
Input text
"The cat sat on the mat"
1Tokenization (Breaking Into Pieces)
Thecatsatonthemat
AI does NOT see text like we do. It sees chunks called tokens.
2Convert Tokens to IDs
IDs
464282833323192622603
These numbers are just labels (e.g. catβ5, dogβ12). The numbers themselves mean nothing.
3Convert IDs into Embeddings
Each word becomes a vector like [0.234, -0.891, 0.445, ..., 0.672]. Usually 768β1536 dimensions. Now the word has mathematical meaning.
4The Model Understands Meaning
The model doesn't understand words. It understands vector geometry.