LLM Fundamentals | Visual Explainer

Similarity search needs a metric: how "close" are two vectors? The two main choices are cosine similarity (angle) and Euclidean distance (length of the line between them).

Cosine similarity

Measures the angle between two vectors. Range: -1 to 1 (or 0 to 1 for positive vectors). Same direction → 1; orthogonal → 0. If vectors are normalized (unit length), cosine = dot product, so it's fast. Most vector DBs use cosine or dot product internally.

Euclidean distance (L2)

Length of the straight line between two points. Lower = closer. Sensitive to vector length; cosine is not. Use when magnitude matters (e.g. counts); use cosine when only direction (meaning) matters.

Cosine similarity vs Euclidean distance

A: x 0.6y 0.3

B: x 0.4y 0.7

Cosine similarity (angle)

83.2%

Same direction → 1; orthogonal → 0

Euclidean distance

0.45

Length of line between A and B

Normalized vectors: cosine = dot product. DBs often use cosine for similarity.

Similarity search in practice

You choose: metric (cosine vs distance), top-k (e.g. top 5 results), data type (text, code, mixed) for filtering, and optional weights and labels per document. (What are weight and labeling? See the "Key terms" box in Chapter 12. For sampling, see the Temperature simulator and chapter.)

Similarity search: metric, top-k, data type & weights

Metric

Top-k

Data type filter

Document labels & weights (top-5 results)

Rank	Label	Type	Weight	Similarity
1	Doc A	text	1	92%
2	Doc B	text	1	88%
3	Doc D	text	1	82%
4	Doc E	mixed	0.9	70%
5	Doc C	code	0.8	68%

Weights can boost or downweight docs (e.g. by source or freshness). Data type helps filter before/after search.

📐 Chapter 14: Cosine Similarity