tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > Vector Databases > Vector Similarity Search - Cosine and Euclidean Distance

Vector Similarity Search - Cosine and Euclidean Distance

Author: Venkata Sudhakar

Vector similarity search is the process of finding vectors in a database that are most similar to a query vector. In the context of AI and RAG pipelines, text is converted to high-dimensional embedding vectors (typically 768 to 3072 dimensions), and similarity search finds the documents whose embeddings are closest to the query embedding. The two most common distance metrics are cosine similarity and Euclidean (L2) distance. Understanding the difference between these metrics is important for choosing the right index type in your vector database and for interpreting search results.

Cosine similarity measures the angle between two vectors, ignoring their magnitude. Two vectors pointing in the same direction have cosine similarity 1.0, perpendicular vectors have similarity 0.0, and opposite vectors have similarity -1.0. Cosine similarity is the standard metric for text embeddings because two documents about the same topic have similar directional orientation in embedding space regardless of their length. Euclidean (L2) distance measures the straight-line distance between two points and is sensitive to vector magnitude. For normalised embeddings (unit vectors), cosine similarity and L2 distance give equivalent rankings.

The below example shows how to compute cosine similarity manually, then demonstrates semantic search using ChromaDB and pgvector with metadata filtering - the two most common production patterns.


It gives the following output,

Query: How do I capture database changes in real time?

[1] cos=0.7823 l2=0.6598 | Debezium reads the MySQL binlog for change data cap...
[2] cos=0.6241 l2=0.8672 | Flyway manages database schema migrations with SQL ...
[3] cos=0.5912 l2=0.9034 | Apache Kafka is a distributed event streaming platf...
[4] cos=0.5103 l2=0.9891 | pgvector adds vector similarity search to PostgreSQ...
[5] cos=0.4231 l2=1.0712 | LangChain builds LLM-powered applications and agent...

It gives the following output,

=== Basic search ===
  [0.089] Debezium reads the MySQL binlog to capture CDC events and ...
  [0.312] Flyway applies versioned SQL migrations tracked in flyway_...
  [0.391] Kafka Streams processes events in real time using the KStr...

=== Filtered (category=migration) ===
  [deploy_guide.pdf] Blue-Green deployment switches load balancer fr...
  [schema_guide.pdf] Flyway applies versioned SQL migrations tracked...

# Metadata filtering narrows the search space BEFORE computing similarity
# This is essential when your vector DB has documents from multiple domains
# and you want to restrict results to a specific category, product, or tenant

Choosing the right similarity metric and index:

Cosine similarity (hnsw:space=cosine in ChromaDB, vector_cosine_ops in pgvector) - Best for text embeddings. Invariant to vector length so it compares semantic direction rather than magnitude. Use this for RAG, semantic search, and document similarity.

L2 / Euclidean distance (hnsw:space=l2, vector_l2_ops in pgvector) - Best for image embeddings and other cases where magnitude carries meaning. For normalised text embeddings, L2 and cosine give equivalent rankings, but cosine is the convention.

HNSW index vs IVFFlat index - HNSW (Hierarchical Navigable Small World) builds a graph structure offline and gives fast, accurate queries with good recall. IVFFlat clusters vectors into buckets and is faster to build but slightly less accurate. For most RAG applications with under 10 million vectors, HNSW is the recommended choice.


 
  


  
bl  br