tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Data Migration > Change Data Capture > GCP Datastream for CDC and Database Migration

GCP Datastream for CDC and Database Migration

Author: Venkata Sudhakar

Google Cloud Datastream is a managed, serverless change data capture (CDC) and replication service that continuously replicates data from operational databases into Google Cloud storage systems such as BigQuery, Cloud Spanner, Cloud SQL, and GCS. Unlike self-managed Debezium which requires Kafka infrastructure, Datastream is fully managed - you configure it through the GCP Console or API and Google handles the underlying replication infrastructure, scaling, and fault tolerance. Datastream supports Oracle, MySQL, PostgreSQL, and SQL Server as source databases.

Datastream is particularly powerful for migrations from on-premises Oracle or MySQL to BigQuery or Cloud Spanner. It captures a full initial snapshot of existing data first, then switches to log-based CDC to capture ongoing changes with sub-second latency. This makes it the backbone of Google Cloud Database Migration Service (DMS) for continuous migration workflows. Datastream writes change events to GCS as Avro or JSON files, or streams directly into BigQuery using the Storage Write API.

The below example shows how to set up a Datastream stream from MySQL to BigQuery using the gcloud CLI, then validate the replication and perform a cutover.


It gives the following output,

Creating connection profile mysql-prod-profile... done.
Creating connection profile bigquery-analytics-profile... done.
Creating stream mysql-to-bq... done.
Starting stream mysql-to-bq... done.

# Check stream status
gcloud datastream streams describe mysql-to-bq --location=us-central1
state: RUNNING
displayName: MySQL to BigQuery Migration
errors: []

# BigQuery tables created automatically with Datastream metadata:
# analytics_dataset.appdb_customers
# analytics_dataset.appdb_orders
#
# Each table includes extra columns added by Datastream:
# datastream_metadata.source_timestamp  - when change happened in MySQL
# datastream_metadata.uuid              - unique change event ID
# datastream_metadata.is_deleted        - true for DELETE events

It gives the following output after cutover validation,

# BigQuery validation result:
bq_rows    latest_change
892341     2024-01-15 14:30:00 UTC

# MySQL source count:
892341

# Row counts match - safe to cut over!

# After pausing the stream:
gcloud datastream streams pause mysql-to-bq --location=us-central1
Updating stream mysql-to-bq... done.
state: PAUSED

Datastream vs Debezium - when to choose each:

Choose Datastream when you are fully on GCP, want zero infrastructure to manage, need Oracle support (Datastream supports Oracle LogMiner natively), and your target is BigQuery, Cloud Spanner, or Cloud SQL. Datastream integrates natively with GCP services and is the recommended path for GCP-native migrations.

Choose Debezium when you need to stream changes to Apache Kafka (for downstream microservices, not just data warehouses), require multi-cloud or on-premises deployment, need to fan out CDC events to multiple consumers simultaneously, or want to use the open-source ecosystem around Kafka Connect for transformations and routing.


 
  


  
bl  br