|
|
Data Validation Patterns in Migration
Author: Venkata Sudhakar
Data validation is the step between "the migration ran" and "the migration succeeded." A migration tool can transfer every row without errors yet still produce incorrect data - truncated strings, mismatched NULL handling, incorrect type conversions, or missing rows from edge cases in the migration logic. Validation catches these problems before you cut over and before business users discover them. The cost of fixing a data issue before cutover is a fraction of the cost of fixing it in production after go-live. Validation has two layers. Pre-migration validation checks the source data quality: finding NULLs in required columns, duplicates in supposed-unique fields, and referential integrity violations before you even start migrating. Post-migration validation compares the target against the source: row counts per table, checksum or hash comparisons of key columns, and spot-checks of specific rows to verify correct transformation. Automated validation scripts run after each migration batch give you continuous confidence rather than a single pass at the end. The below example shows a Python validation script that compares row counts, checks for NULLs in key columns, and runs a column-level hash comparison between source MySQL and target PostgreSQL.
It gives the following output,
[OK] customers: source=125000, target=125000, diff=0
[MISMATCH] orders: source=890000, target=889997, diff=-3
[OK] products: source=4500, target=4500, diff=0
[NULL ERROR] customers.email: 12 unexpected NULLs
# 3 missing orders and 12 NULL emails need investigation before cutover
It gives the following output,
[MISMATCH] order 50000: source=(649.99,) target=(649.98,)
# Floating point rounding difference detected in amount column
# Root cause: MySQL DECIMAL(10,2) vs PostgreSQL NUMERIC precision difference
# Fix: cast amount to NUMERIC(10,2) explicitly in migration query
Run validation in three phases. Before migration: check source data quality (NULLs, duplicates, constraint violations). During migration: validate each batch as it completes, fail fast if counts diverge more than 0.01%. After migration: full reconciliation - row counts, column checksums on key business fields, and a complete check of any tables with complex transformations. Automate this entirely - manual SQL spot-checks miss edge cases that a systematic script catches every time.
|
|