In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Data Migration > ETL and ELT Patterns > Data Validation Patterns in Migration

Data Validation Patterns in Migration

Author: Venkata Sudhakar

Data validation is the step between "the migration ran" and "the migration succeeded." A migration tool can transfer every row without errors yet still produce incorrect data - truncated strings, mismatched NULL handling, incorrect type conversions, or missing rows from edge cases in the migration logic. Validation catches these problems before you cut over and before business users discover them. The cost of fixing a data issue before cutover is a fraction of the cost of fixing it in production after go-live.

Validation has two layers. Pre-migration validation checks the source data quality: finding NULLs in required columns, duplicates in supposed-unique fields, and referential integrity violations before you even start migrating. Post-migration validation compares the target against the source: row counts per table, checksum or hash comparisons of key columns, and spot-checks of specific rows to verify correct transformation. Automated validation scripts run after each migration batch give you continuous confidence rather than a single pass at the end.

The below example shows a Python validation script that compares row counts, checks for NULLs in key columns, and runs a column-level hash comparison between source MySQL and target PostgreSQL.

import pymysql
import psycopg2

def get_mysql_conn():
    return pymysql.connect(host="mysql-source", user="validator",
                           password="secret", database="appdb")

def get_pg_conn():
    return psycopg2.connect(host="pg-target", user="validator",
                            password="secret", dbname="appdb")

def validate_row_counts(tables: list) -> dict:
    results = {}
    with get_mysql_conn() as mysql, get_pg_conn() as pg:
        mysql_cur = mysql.cursor()
        pg_cur = pg.cursor()
        for table in tables:
            mysql_cur.execute(f"SELECT COUNT(*) FROM {table}")
            src_count = mysql_cur.fetchone()[0]
            pg_cur.execute(f"SELECT COUNT(*) FROM {table}")
            tgt_count = pg_cur.fetchone()[0]
            match = src_count == tgt_count
            results[table] = {
                "source": src_count, "target": tgt_count,
                "match": match,
                "diff": tgt_count - src_count
            }
    return results

def validate_no_nulls(table: str, columns: list) -> list:
    issues = []
    with get_pg_conn() as pg:
        cur = pg.cursor()
        for col in columns:
            cur.execute(f"SELECT COUNT(*) FROM {table} WHERE {col} IS NULL")
            null_count = cur.fetchone()[0]
            if null_count > 0:
                issues.append(f"{table}.{col}: {null_count} unexpected NULLs")
    return issues

# Run validation
tables = ["customers", "orders", "products"]
counts = validate_row_counts(tables)
for table, result in counts.items():
    status = "OK" if result["match"] else "MISMATCH"
    print(f"[{status}] {table}: source={result['source']}, target={result['target']}, diff={result['diff']}")

null_issues = validate_no_nulls("customers", ["email", "created_at"])
if null_issues:
    for issue in null_issues:
        print(f"[NULL ERROR] {issue}")
else:
    print("[OK] No unexpected NULLs in customers")

It gives the following output,

[OK] customers: source=125000, target=125000, diff=0
[MISMATCH] orders: source=890000, target=889997, diff=-3
[OK] products: source=4500, target=4500, diff=0
[NULL ERROR] customers.email: 12 unexpected NULLs

# 3 missing orders and 12 NULL emails need investigation before cutover

import hashlib

def validate_column_checksums(table: str, key_col: str, check_col: str,
                              sample_ids: list) -> list:
    mismatches = []
    with get_mysql_conn() as mysql, get_pg_conn() as pg:
        mysql_cur = mysql.cursor()
        pg_cur = pg.cursor()
        for row_id in sample_ids:
            mysql_cur.execute(
                f"SELECT {check_col} FROM {table} WHERE {key_col} = %s", (row_id,))
            src_val = str(mysql_cur.fetchone())
            pg_cur.execute(
                f"SELECT {check_col} FROM {table} WHERE {key_col} = %s", (row_id,))
            tgt_val = str(pg_cur.fetchone())
            if src_val != tgt_val:
                mismatches.append({
                    "id": row_id, "source": src_val, "target": tgt_val
                })
    return mismatches

# Spot-check 5 specific order amounts
sample_ids = [1001, 5000, 50000, 100000, 890000]
mismatches = validate_column_checksums("orders", "id", "amount", sample_ids)
if mismatches:
    for m in mismatches:
        print(f"[MISMATCH] order {m['id']}: source={m['source']} target={m['target']}")
else:
    print("[OK] All sampled order amounts match")

It gives the following output,

[MISMATCH] order 50000: source=(649.99,) target=(649.98,)

# Floating point rounding difference detected in amount column
# Root cause: MySQL DECIMAL(10,2) vs PostgreSQL NUMERIC precision difference
# Fix: cast amount to NUMERIC(10,2) explicitly in migration query

Run validation in three phases. Before migration: check source data quality (NULLs, duplicates, constraint violations). During migration: validate each batch as it completes, fail fast if counts diverge more than 0.01%. After migration: full reconciliation - row counts, column checksums on key business fields, and a complete check of any tables with complex transformations. Automate this entirely - manual SQL spot-checks miss edge cases that a systematic script catches every time.

Send your comments, suggestions or queries regarding this site to [email protected].