In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Data Migration > Change Data Capture > GCP Datastream for CDC and Database Migration

GCP Datastream for CDC and Database Migration

Author: Venkata Sudhakar

Google Cloud Datastream is a managed, serverless change data capture (CDC) and replication service that continuously replicates data from operational databases into Google Cloud storage systems such as BigQuery, Cloud Spanner, Cloud SQL, and GCS. Unlike self-managed Debezium which requires Kafka infrastructure, Datastream is fully managed - you configure it through the GCP Console or API and Google handles the underlying replication infrastructure, scaling, and fault tolerance. Datastream supports Oracle, MySQL, PostgreSQL, and SQL Server as source databases.

Datastream is particularly powerful for migrations from on-premises Oracle or MySQL to BigQuery or Cloud Spanner. It captures a full initial snapshot of existing data first, then switches to log-based CDC to capture ongoing changes with sub-second latency. This makes it the backbone of Google Cloud Database Migration Service (DMS) for continuous migration workflows. Datastream writes change events to GCS as Avro or JSON files, or streams directly into BigQuery using the Storage Write API.

The below example shows how to set up a Datastream stream from MySQL to BigQuery using the gcloud CLI, then validate the replication and perform a cutover.

# Step 1: Create a MySQL connection profile for Datastream
gcloud datastream connection-profiles create mysql-prod-profile \
  --location=us-central1 \
  --display-name="MySQL Production" \
  --type=mysql \
  --mysql-hostname=mysql-prod.internal \
  --mysql-port=3306 \
  --mysql-username=datastream_user \
  --mysql-password=SecretPass123 \
  --static-ip-connectivity

# Allowlist Datastream IPs in your MySQL firewall:
# 34.72.28.29, 34.67.234.134, 34.67.6.157, 34.72.239.218

# Step 2: Create BigQuery destination profile
gcloud datastream connection-profiles create bigquery-analytics-profile \
  --location=us-central1 \
  --display-name="BigQuery Analytics" \
  --type=bigquery

# Step 3: Create the stream config file (source-config.json)
# Lists which databases and tables to replicate
# {
#   "includeObjects": {
#     "mysqlDatabases": [{
#       "database": "appdb",
#       "mysqlTables": [
#         {"table": "customers"},
#         {"table": "orders"}
#       ]
#     }]
#   }
# }

# Step 4: Create and start the stream
gcloud datastream streams create mysql-to-bq \
  --location=us-central1 \
  --display-name="MySQL to BigQuery Migration" \
  --source=mysql-prod-profile \
  --mysql-source-config=source-config.json \
  --destination=bigquery-analytics-profile \
  --bigquery-destination-config=dest-config.json \
  --backfill-all

gcloud datastream streams start mysql-to-bq --location=us-central1

It gives the following output,

Creating connection profile mysql-prod-profile... done.
Creating connection profile bigquery-analytics-profile... done.
Creating stream mysql-to-bq... done.
Starting stream mysql-to-bq... done.

# Check stream status
gcloud datastream streams describe mysql-to-bq --location=us-central1
state: RUNNING
displayName: MySQL to BigQuery Migration
errors: []

# BigQuery tables created automatically with Datastream metadata:
# analytics_dataset.appdb_customers
# analytics_dataset.appdb_orders
#
# Each table includes extra columns added by Datastream:
# datastream_metadata.source_timestamp  - when change happened in MySQL
# datastream_metadata.uuid              - unique change event ID
# datastream_metadata.is_deleted        - true for DELETE events

# Step 5: Monitor replication lag and validate data

# Check current stream state and errors
gcloud datastream streams describe mysql-to-bq \
  --location=us-central1 \
  --format="table(state,errors)"

# Validate row counts in BigQuery vs MySQL
# Run in BigQuery:
bq query --use_legacy_sql=false \
  "SELECT COUNT(*) AS bq_rows,
          MAX(datastream_metadata.source_timestamp) AS latest_change
   FROM analytics_dataset.appdb_orders
   WHERE NOT IFNULL(datastream_metadata.is_deleted, false)"

# Run in MySQL to compare:
# SELECT COUNT(*) FROM appdb.orders;

# Step 6: Cutover procedure
# 1. Stop writes to MySQL (maintenance window or feature flag)
# 2. Wait for Datastream to fully catch up (lag = 0)
# 3. Run final validation query above
# 4. Switch application to use BigQuery
# 5. Pause the Datastream stream:
gcloud datastream streams pause mysql-to-bq --location=us-central1

# Optional: delete stream after cutover confirmed stable
# gcloud datastream streams delete mysql-to-bq --location=us-central1

It gives the following output after cutover validation,

# BigQuery validation result:
bq_rows    latest_change
892341     2024-01-15 14:30:00 UTC

# MySQL source count:
892341

# Row counts match - safe to cut over!

# After pausing the stream:
gcloud datastream streams pause mysql-to-bq --location=us-central1
Updating stream mysql-to-bq... done.
state: PAUSED

Datastream vs Debezium - when to choose each:

Choose Datastream when you are fully on GCP, want zero infrastructure to manage, need Oracle support (Datastream supports Oracle LogMiner natively), and your target is BigQuery, Cloud Spanner, or Cloud SQL. Datastream integrates natively with GCP services and is the recommended path for GCP-native migrations.

Choose Debezium when you need to stream changes to Apache Kafka (for downstream microservices, not just data warehouses), require multi-cloud or on-premises deployment, need to fan out CDC events to multiple consumers simultaneously, or want to use the open-source ecosystem around Kafka Connect for transformations and routing.

Send your comments, suggestions or queries regarding this site to [email protected].