|
|
GCP Datastream for CDC and Database Migration
Author: Venkata Sudhakar
Google Cloud Datastream is a managed, serverless change data capture (CDC) and replication service that continuously replicates data from operational databases into Google Cloud storage systems such as BigQuery, Cloud Spanner, Cloud SQL, and GCS. Unlike self-managed Debezium which requires Kafka infrastructure, Datastream is fully managed - you configure it through the GCP Console or API and Google handles the underlying replication infrastructure, scaling, and fault tolerance. Datastream supports Oracle, MySQL, PostgreSQL, and SQL Server as source databases. Datastream is particularly powerful for migrations from on-premises Oracle or MySQL to BigQuery or Cloud Spanner. It captures a full initial snapshot of existing data first, then switches to log-based CDC to capture ongoing changes with sub-second latency. This makes it the backbone of Google Cloud Database Migration Service (DMS) for continuous migration workflows. Datastream writes change events to GCS as Avro or JSON files, or streams directly into BigQuery using the Storage Write API. The below example shows how to set up a Datastream stream from MySQL to BigQuery using the gcloud CLI, then validate the replication and perform a cutover.
It gives the following output,
Creating connection profile mysql-prod-profile... done.
Creating connection profile bigquery-analytics-profile... done.
Creating stream mysql-to-bq... done.
Starting stream mysql-to-bq... done.
# Check stream status
gcloud datastream streams describe mysql-to-bq --location=us-central1
state: RUNNING
displayName: MySQL to BigQuery Migration
errors: []
# BigQuery tables created automatically with Datastream metadata:
# analytics_dataset.appdb_customers
# analytics_dataset.appdb_orders
#
# Each table includes extra columns added by Datastream:
# datastream_metadata.source_timestamp - when change happened in MySQL
# datastream_metadata.uuid - unique change event ID
# datastream_metadata.is_deleted - true for DELETE events
It gives the following output after cutover validation,
# BigQuery validation result:
bq_rows latest_change
892341 2024-01-15 14:30:00 UTC
# MySQL source count:
892341
# Row counts match - safe to cut over!
# After pausing the stream:
gcloud datastream streams pause mysql-to-bq --location=us-central1
Updating stream mysql-to-bq... done.
state: PAUSED
Datastream vs Debezium - when to choose each: Choose Datastream when you are fully on GCP, want zero infrastructure to manage, need Oracle support (Datastream supports Oracle LogMiner natively), and your target is BigQuery, Cloud Spanner, or Cloud SQL. Datastream integrates natively with GCP services and is the recommended path for GCP-native migrations. Choose Debezium when you need to stream changes to Apache Kafka (for downstream microservices, not just data warehouses), require multi-cloud or on-premises deployment, need to fan out CDC events to multiple consumers simultaneously, or want to use the open-source ecosystem around Kafka Connect for transformations and routing.
|
|