In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Messaging > Apache Kafka > Kafka Streams - Real-Time Stream Processing

Kafka Streams - Real-Time Stream Processing

Author: Venkata Sudhakar

Kafka Streams is a client library built into Apache Kafka for building real-time stream processing applications. Unlike batch processing systems that operate on stored data at scheduled intervals, Kafka Streams processes records as they arrive - one record at a time or in micro-batches - with very low latency. A Kafka Streams application reads from one or more Kafka topics, applies transformations and aggregations, and writes the results to output topics. The critical advantage over other stream processing frameworks like Apache Flink or Spark Streaming is that Kafka Streams runs as a standard Java library inside your application - no separate cluster to deploy or manage.

The Kafka Streams API provides two levels of abstraction. The high-level DSL (Domain-Specific Language) provides stream operations familiar from functional programming: filter, map, flatMap, groupBy, aggregate, join, and windowed operations. The low-level Processor API gives you full control over record processing logic. KStream represents an unbounded sequence of records (like a Java Stream but infinite). KTable represents a changelog stream - a table view of the latest value for each key, automatically updated as new records arrive. KGroupedStream and KTable.groupBy() enable aggregations like counts and sums per key.

The below example shows a Kafka Streams application that processes order events in real time, computing per-product order counts in a 1-minute tumbling window and flagging high-value orders.

import org.apache.kafka.streams.*;
import org.apache.kafka.streams.kstream.*;
import org.apache.kafka.common.serialization.Serdes;
import java.time.Duration;
import java.util.Properties;

public class OrderStreamProcessor {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, "order-processor-v1");
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG,
                  Serdes.String().getClass());
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG,
                  Serdes.String().getClass());

StreamsBuilder builder = new StreamsBuilder();

// Source: read raw order events from Kafka topic
        KStream<String, String> orders = builder.stream("order-events");

// Example 1: Filter and transform - flag high-value orders
        KStream<String, String> highValueOrders = orders
            .filter((key, value) -> {
                // Parse amount from JSON: {"orderId":"ORD-1","amount":650.00,...}
                double amount = parseAmount(value);
                return amount > 500.0;
            })
            .mapValues(value -> {
                // Enrich the event with a HIGH_VALUE flag
                return value.replace("}", ", \"flagged\": true}");
            });

// Publish high-value orders to a separate topic
        highValueOrders.to("high-value-orders");

// Example 2: Count orders per product in 1-minute tumbling windows
        KStream<String, String> ordersWithProductKey = orders
            .selectKey((orderId, value) -> extractProductId(value));

KTable<Windowed<String>, Long> orderCountsPerProduct = ordersWithProductKey
            .groupByKey()
            .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(1)))
            .count();

// Write window counts to output topic
        orderCountsPerProduct
            .toStream()
            .map((windowedKey, count) -> KeyValue.pair(
                windowedKey.key(),
                windowedKey.key() + ": " + count + " orders in last minute"
            ))
            .to("product-order-counts");

KafkaStreams streams = new KafkaStreams(builder.build(), props);
        streams.start();
        Runtime.getRuntime().addShutdownHook(
            new Thread(streams::close));
    }

static double parseAmount(String json) {
        try {
            int idx = json.indexOf("amount") + 9;
            int end = json.indexOf(",", idx);
            if (end == -1) end = json.indexOf("}", idx);
            return Double.parseDouble(json.substring(idx, end).trim());
        } catch (Exception e) { return 0.0; }
    }

static String extractProductId(String json) {
        try {
            int idx = json.indexOf("productId") + 12;
            int end = json.indexOf(""", idx);
            return json.substring(idx, end);
        } catch (Exception e) { return "UNKNOWN"; }
    }
}

It gives the following output in the product-order-counts topic,

# Input records on order-events topic:
{"orderId":"ORD-1001","productId":"PROD-A","amount":650.00,"status":"CREATED"}
{"orderId":"ORD-1002","productId":"PROD-B","amount":120.00,"status":"CREATED"}
{"orderId":"ORD-1003","productId":"PROD-A","amount":890.00,"status":"CREATED"}
{"orderId":"ORD-1004","productId":"PROD-A","amount":45.00,"status":"CREATED"}

# high-value-orders topic output:
{"orderId":"ORD-1001","productId":"PROD-A","amount":650.00,...,"flagged":true}
{"orderId":"ORD-1003","productId":"PROD-A","amount":890.00,...,"flagged":true}

# product-order-counts topic output (per 1-minute window):
PROD-A: 3 orders in last minute
PROD-B: 1 orders in last minute

// Example 2: KTable join - enrich order events with customer data
import org.apache.kafka.streams.*;
import org.apache.kafka.streams.kstream.*;

public class OrderEnrichmentStream {
    public static void build(StreamsBuilder builder) {
        // KTable: always holds the LATEST customer record for each customerId
        // Updated automatically as new records arrive on customer-updates topic
        KTable<String, String> customerTable =
            builder.table("customer-updates"); // key=customerId

// KStream: continuous stream of order events (key=customerId)
        KStream<String, String> orderStream =
            builder.stream("order-events")
                   .selectKey((k, v) -> extractCustomerId(v));

// Stream-Table join: enrich each order with the latest customer data
        // The join is automatic - no polling or querying needed
        KStream<String, String> enrichedOrders = orderStream.join(
            customerTable,
            (orderJson, customerJson) -> mergeJson(orderJson, customerJson)
        );

enrichedOrders.to("enriched-orders");
    }

static String extractCustomerId(String json) {
        int idx = json.indexOf("customerId") + 13;
        int end = json.indexOf(""", idx);
        return json.substring(idx, end);
    }

static String mergeJson(String order, String customer) {
        // Remove closing brace from order, append customer fields
        String customerName = customer.contains("name") ?
            customer.substring(customer.indexOf("name") + 7,
                               customer.indexOf(""", customer.indexOf("name") + 7)) : "Unknown";
        return order.replace("}", ", "customerName": "" + customerName + ""}");
    }
}

It gives the following output on the enriched-orders topic,

# Input on order-events (key=customerId):
C-500: {"orderId":"ORD-1001","customerId":"C-500","amount":650.00}

# Input on customer-updates (KTable, key=customerId):
C-500: {"customerId":"C-500","name":"Alice Johnson","email":"[email protected]"}

# Output on enriched-orders topic:
C-500: {"orderId":"ORD-1001","customerId":"C-500","amount":650.00,"customerName":"Alice Johnson"}

Kafka Streams vs Apache Flink vs Spark Streaming:

Choose Kafka Streams when your data is already in Kafka, your application is a Java/Kotlin service, and you want zero additional infrastructure. Kafka Streams is part of your application JAR - you deploy it the same way as any microservice, scale it by running more instances, and it handles partition rebalancing automatically.

Choose Apache Flink when you need complex event time processing, very high throughput (billions of events/day), exactly-once semantics across multiple external systems, or support for non-Kafka sources. Flink requires its own cluster (or managed service like Amazon Kinesis Data Analytics or Confluent Cloud).

Send your comments, suggestions or queries regarding this site to [email protected].