tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Messaging > Apache Kafka > Kafka Streams - Real-Time Stream Processing

Kafka Streams - Real-Time Stream Processing

Author: Venkata Sudhakar

Kafka Streams is a client library built into Apache Kafka for building real-time stream processing applications. Unlike batch processing systems that operate on stored data at scheduled intervals, Kafka Streams processes records as they arrive - one record at a time or in micro-batches - with very low latency. A Kafka Streams application reads from one or more Kafka topics, applies transformations and aggregations, and writes the results to output topics. The critical advantage over other stream processing frameworks like Apache Flink or Spark Streaming is that Kafka Streams runs as a standard Java library inside your application - no separate cluster to deploy or manage.

The Kafka Streams API provides two levels of abstraction. The high-level DSL (Domain-Specific Language) provides stream operations familiar from functional programming: filter, map, flatMap, groupBy, aggregate, join, and windowed operations. The low-level Processor API gives you full control over record processing logic. KStream represents an unbounded sequence of records (like a Java Stream but infinite). KTable represents a changelog stream - a table view of the latest value for each key, automatically updated as new records arrive. KGroupedStream and KTable.groupBy() enable aggregations like counts and sums per key.

The below example shows a Kafka Streams application that processes order events in real time, computing per-product order counts in a 1-minute tumbling window and flagging high-value orders.


It gives the following output in the product-order-counts topic,

# Input records on order-events topic:
{"orderId":"ORD-1001","productId":"PROD-A","amount":650.00,"status":"CREATED"}
{"orderId":"ORD-1002","productId":"PROD-B","amount":120.00,"status":"CREATED"}
{"orderId":"ORD-1003","productId":"PROD-A","amount":890.00,"status":"CREATED"}
{"orderId":"ORD-1004","productId":"PROD-A","amount":45.00,"status":"CREATED"}

# high-value-orders topic output:
{"orderId":"ORD-1001","productId":"PROD-A","amount":650.00,...,"flagged":true}
{"orderId":"ORD-1003","productId":"PROD-A","amount":890.00,...,"flagged":true}

# product-order-counts topic output (per 1-minute window):
PROD-A: 3 orders in last minute
PROD-B: 1 orders in last minute

It gives the following output on the enriched-orders topic,

# Input on order-events (key=customerId):
C-500: {"orderId":"ORD-1001","customerId":"C-500","amount":650.00}

# Input on customer-updates (KTable, key=customerId):
C-500: {"customerId":"C-500","name":"Alice Johnson","email":"[email protected]"}

# Output on enriched-orders topic:
C-500: {"orderId":"ORD-1001","customerId":"C-500","amount":650.00,"customerName":"Alice Johnson"}

Kafka Streams vs Apache Flink vs Spark Streaming:

Choose Kafka Streams when your data is already in Kafka, your application is a Java/Kotlin service, and you want zero additional infrastructure. Kafka Streams is part of your application JAR - you deploy it the same way as any microservice, scale it by running more instances, and it handles partition rebalancing automatically.

Choose Apache Flink when you need complex event time processing, very high throughput (billions of events/day), exactly-once semantics across multiple external systems, or support for non-Kafka sources. Flink requires its own cluster (or managed service like Amazon Kinesis Data Analytics or Confluent Cloud).


 
  


  
bl  br