Guide

🌱 Beginner — AI Fundamentals

Chapter 14 of 24

📐 Chapter 14: Cosine Similarity

Industry standard for similarity

Similarity search needs a metric: how "close" are two vectors? The two main choices are cosine similarity (angle) and Euclidean distance (length of the line between them).

Cosine similarity

Measures the angle between two vectors. Range: -1 to 1 (or 0 to 1 for positive vectors). Same direction → 1; orthogonal → 0. If vectors are normalized (unit length), cosine = dot product, so it's fast. Most vector DBs use cosine or dot product internally.

Euclidean distance (L2)

Length of the straight line between two points. Lower = closer. Sensitive to vector length; cosine is not. Use when magnitude matters (e.g. counts); use cosine when only direction (meaning) matters.

Cosine similarity vs Euclidean distance

AB

Cosine similarity (angle)

83.2%

Same direction → 1; orthogonal → 0

Euclidean distance

0.45

Length of line between A and B

Normalized vectors: cosine = dot product. DBs often use cosine for similarity.

Similarity search in practice

You choose: metric (cosine vs distance), top-k (e.g. top 5 results), data type (text, code, mixed) for filtering, and optional weights and labels per document. (What are weight and labeling? See the "Key terms" box in Chapter 12. For sampling, see the Temperature simulator and chapter.)

Similarity search: metric, top-k, data type & weights

Metric

Top-k

5

Data type filter

Document labels & weights (top-5 results)

RankLabelTypeWeightSimilarity
1Doc Atext192%
2Doc Btext188%
3Doc Dtext182%
4Doc Emixed0.970%
5Doc Ccode0.868%

Weights can boost or downweight docs (e.g. by source or freshness). Data type helps filter before/after search.