Similarity search needs a metric: how "close" are two vectors? The two main choices are cosine similarity (angle) and Euclidean distance (length of the line between them).
Cosine similarity
Measures the angle between two vectors. Range: -1 to 1 (or 0 to 1 for positive vectors). Same direction → 1; orthogonal → 0. If vectors are normalized (unit length), cosine = dot product, so it's fast. Most vector DBs use cosine or dot product internally.
Euclidean distance (L2)
Length of the straight line between two points. Lower = closer. Sensitive to vector length; cosine is not. Use when magnitude matters (e.g. counts); use cosine when only direction (meaning) matters.
Cosine similarity vs Euclidean distance
Cosine similarity (angle)
83.2%
Same direction → 1; orthogonal → 0
Euclidean distance
0.45
Length of line between A and B
Normalized vectors: cosine = dot product. DBs often use cosine for similarity.
Similarity search in practice
You choose: metric (cosine vs distance), top-k (e.g. top 5 results), data type (text, code, mixed) for filtering, and optional weights and labels per document. (What are weight and labeling? See the "Key terms" box in Chapter 12. For sampling, see the Temperature simulator and chapter.)
Similarity search: metric, top-k, data type & weights
Metric
Top-k
5Data type filter
Document labels & weights (top-5 results)
| Rank | Label | Type | Weight | Similarity |
|---|---|---|---|---|
| 1 | Doc A | text | 1 | 92% |
| 2 | Doc B | text | 1 | 88% |
| 3 | Doc D | text | 1 | 82% |
| 4 | Doc E | mixed | 0.9 | 70% |
| 5 | Doc C | code | 0.8 | 68% |
Weights can boost or downweight docs (e.g. by source or freshness). Data type helps filter before/after search.