In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Vector Databases > Vector Similarity Search - Cosine and Euclidean Distance

Vector Similarity Search - Cosine and Euclidean Distance

Author: Venkata Sudhakar

Vector similarity search is the process of finding vectors in a database that are most similar to a query vector. In the context of AI and RAG pipelines, text is converted to high-dimensional embedding vectors (typically 768 to 3072 dimensions), and similarity search finds the documents whose embeddings are closest to the query embedding. The two most common distance metrics are cosine similarity and Euclidean (L2) distance. Understanding the difference between these metrics is important for choosing the right index type in your vector database and for interpreting search results.

Cosine similarity measures the angle between two vectors, ignoring their magnitude. Two vectors pointing in the same direction have cosine similarity 1.0, perpendicular vectors have similarity 0.0, and opposite vectors have similarity -1.0. Cosine similarity is the standard metric for text embeddings because two documents about the same topic have similar directional orientation in embedding space regardless of their length. Euclidean (L2) distance measures the straight-line distance between two points and is sensitive to vector magnitude. For normalised embeddings (unit vectors), cosine similarity and L2 distance give equivalent rankings.

The below example shows how to compute cosine similarity manually, then demonstrates semantic search using ChromaDB and pgvector with metadata filtering - the two most common production patterns.

import math
from openai import OpenAI

client = OpenAI(api_key="your-api-key-here")

# Manual cosine similarity calculation
def cosine_similarity(vec_a: list, vec_b: list) -> float:
    dot_product = sum(a * b for a, b in zip(vec_a, vec_b))
    mag_a = math.sqrt(sum(a * a for a in vec_a))
    mag_b = math.sqrt(sum(b * b for b in vec_b))
    if mag_a == 0 or mag_b == 0:
        return 0.0
    return dot_product / (mag_a * mag_b)

def euclidean_distance(vec_a: list, vec_b: list) -> float:
    return math.sqrt(sum((a - b) ** 2 for a, b in zip(vec_a, vec_b)))

# Get embeddings for a set of documents
def embed(texts: list) -> list:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

documents = [
    "Debezium reads the MySQL binlog for change data capture",
    "Apache Kafka is a distributed event streaming platform",
    "Flyway manages database schema migrations with SQL scripts",
    "pgvector adds vector similarity search to PostgreSQL",
    "LangChain builds LLM-powered applications and agents"
]

query = "How do I capture database changes in real time?"

# Embed everything
all_texts = [query] + documents
all_embeddings = embed(all_texts)
query_emb = all_embeddings[0]
doc_embeddings = all_embeddings[1:]

# Rank by cosine similarity
results = []
for i, (doc, emb) in enumerate(zip(documents, doc_embeddings)):
    cos_sim = cosine_similarity(query_emb, emb)
    l2_dist = euclidean_distance(query_emb, emb)
    results.append((cos_sim, l2_dist, doc))

results.sort(reverse=True)  # highest similarity first
print(f"Query: {query}\n")
for rank, (cos, l2, doc) in enumerate(results, 1):
    print(f"[{rank}] cos={cos:.4f} l2={l2:.4f} | {doc[:55]}...")

It gives the following output,

Query: How do I capture database changes in real time?

[1] cos=0.7823 l2=0.6598 | Debezium reads the MySQL binlog for change data cap...
[2] cos=0.6241 l2=0.8672 | Flyway manages database schema migrations with SQL ...
[3] cos=0.5912 l2=0.9034 | Apache Kafka is a distributed event streaming platf...
[4] cos=0.5103 l2=0.9891 | pgvector adds vector similarity search to PostgreSQ...
[5] cos=0.4231 l2=1.0712 | LangChain builds LLM-powered applications and agent...

# Semantic search with ChromaDB - metadata filtering for production RAG
import chromadb
from chromadb.utils import embedding_functions

client_chroma = chromadb.PersistentClient(path="./chroma_db")
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-api-key-here",
    model_name="text-embedding-3-small"
)

collection = client_chroma.get_or_create_collection(
    name="tech_docs",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"}
)

# Add documents with rich metadata for filtered search
collection.add(
    ids=["d1","d2","d3","d4","d5","d6"],
    documents=[
        "Debezium reads the MySQL binlog to capture CDC events and publishes to Kafka",
        "Flyway applies versioned SQL migrations tracked in flyway_schema_history",
        "Blue-Green deployment switches load balancer from old to new environment",
        "pgvector adds HNSW and IVFFlat indexes for fast vector similarity search",
        "LangChain LCEL chains compose LLM calls with prompt templates and parsers",
        "Kafka Streams processes events in real time using the KStream DSL"
    ],
    metadatas=[
        {"category": "cdc",        "source": "cdc_guide.pdf",        "difficulty": "intermediate"},
        {"category": "migration",  "source": "schema_guide.pdf",     "difficulty": "beginner"},
        {"category": "migration",  "source": "deploy_guide.pdf",     "difficulty": "intermediate"},
        {"category": "vector_db",  "source": "pgvector_guide.pdf",   "difficulty": "intermediate"},
        {"category": "genai",      "source": "langchain_guide.pdf",  "difficulty": "intermediate"},
        {"category": "messaging",  "source": "kafka_guide.pdf",      "difficulty": "advanced"}
    ]
)

# Basic similarity search - top 3
results = collection.query(
    query_texts=["How does CDC work with databases?"],
    n_results=3
)
print("=== Basic search ===")
for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"  [{dist:.3f}] {doc[:60]}...")

# Filtered search - only migration category documents
filtered = collection.query(
    query_texts=["How to deploy without downtime?"],
    n_results=3,
    where={"category": "migration"}  # metadata filter
)
print("\n=== Filtered (category=migration) ===")
for doc, meta in zip(filtered["documents"][0], filtered["metadatas"][0]):
    print(f"  [{meta['source']}] {doc[:60]}...")

It gives the following output,

=== Basic search ===
  [0.089] Debezium reads the MySQL binlog to capture CDC events and ...
  [0.312] Flyway applies versioned SQL migrations tracked in flyway_...
  [0.391] Kafka Streams processes events in real time using the KStr...

=== Filtered (category=migration) ===
  [deploy_guide.pdf] Blue-Green deployment switches load balancer fr...
  [schema_guide.pdf] Flyway applies versioned SQL migrations tracked...

# Metadata filtering narrows the search space BEFORE computing similarity
# This is essential when your vector DB has documents from multiple domains
# and you want to restrict results to a specific category, product, or tenant

Choosing the right similarity metric and index:

Cosine similarity (hnsw:space=cosine in ChromaDB, vector_cosine_ops in pgvector) - Best for text embeddings. Invariant to vector length so it compares semantic direction rather than magnitude. Use this for RAG, semantic search, and document similarity.

L2 / Euclidean distance (hnsw:space=l2, vector_l2_ops in pgvector) - Best for image embeddings and other cases where magnitude carries meaning. For normalised text embeddings, L2 and cosine give equivalent rankings, but cosine is the convention.

HNSW index vs IVFFlat index - HNSW (Hierarchical Navigable Small World) builds a graph structure offline and gives fast, accurate queries with good recall. IVFFlat clusters vectors into buckets and is faster to build but slightly less accurate. For most RAG applications with under 10 million vectors, HNSW is the recommended choice.

Send your comments, suggestions or queries regarding this site to [email protected].