Software AG no longer operates as a stock corporation, but as Software GmbH (company with limited liability). Despite the change of name, we continue to offer our goods and services under the registered trademarks .
11 MINUTE READ
What you need to know about Kafka stream processing in Python
Building an effective streaming architecture can be challenging. Get the step by step implementation for Kafka stream processing in Python here.
What You Need to Know About Kafka Stream Processing in Python
Instant notifications, product recommendations and updates, and fraud detection are practical use-cases of stream processing. With stream processing, data streaming and analytics occur in real-time, which helps drive fast decision-making. However, building an effective streaming architecture to handle data needs can be challenging because of the multiple data sources, destinations, and formats involved with event streaming.

The basics of Apache Kafka

Apache Kafka is an open-source distributed (pub/sub) messaging and event streaming platform that helps two systems communicate effectively, making data exchange more accessible. Kafka deploys to Virtual Machines (VMs), hardware or on the cloud and consists of clients and servers communicating via a TCP protocol. With Apache Kafka, consumers subscribe to a topic, and producers post to these topics for consumers’ consumption. Kafka employs a decoupling method whereby the producers operate and exist independently of the consumer, i.e., producers only need to worry about publishing topics, not how those topics get to the consumers.

Currently, over 39,000 companies use Kafka for their event streaming needs, a 7.24% increase in adoption from 2021. These companies include eBay, Netflix, Target, amongst others. Apache Kafka is fast, can process millions of messages daily, and is highly scalable with high throughput and low latency.

However, because Kafka was built on Java and Scala, Python engineers found it challenging to process data and send it to Kafka Streams. This challenge led to the development of the kafka-python client, which enables engineers to process Python in Kafka. This article introduces Kafka, its benefits, setting up a Kafka cluster, and stream processing with kafka-python.

Essential Kafka concepts

Some essential concepts you’ll need to know for streaming Python with Kafka include:

Topics

Topics act as a store for events. An event is an occurrence or record like a product update or launch. Topics are like folders with files as the events. Unlike traditional messaging systems that delete messages after consumption, Kafka lets topics store events for as long as defined in a per-topic configuration. Kafka topics are divided into partitions and replicated across multiple replicas in brokers so that various consumers can consume these topics simultaneously without affecting throughput.

Producers

Producers refer to applications that publish messages to Kafka. Kafka decouples producers from consumers, which helps with scalability. With Kafka, producers need only post messages, with no concern for how Kafka handles these topics.

Consumers

Consumers consume the topics produced by the producers.

Clusters

A cluster refers to a collection of brokers. Clusters organize Kafka topics, brokers, and partitions to ensure equal workload distribution and continuous availability.

Brokers

Brokers exist within a cluster. A cluster contains one or more brokers working together. A single broker represents a Kafka server.

Zookeeper

A zookeeper acts as the record that stores information about a Kafka cluster. It houses information like user info, naming configuration, and access control lists for topics.

Partition

Brokers contain one or more partitions. Each partition holds a topic replicated across other partitions in the brokers. There exists a leader partition, which coordinates writes to a topic. If the leader fails, the replica takes over. This principle of partitioning across brokers helps make Kafka fault-tolerant.

Kafka provides three primary services to its users:

  • Publish messages to subscribers
  • Message store while storing in order of arrival
  • Perform analysis of real-time data streams

Benefits of Apache Kafka

Fault tolerance and high availability

In Kafka, a replication factor specifies the number of copies of partitions needed for a topic. In the event of a node failure in one broker, replicas in other partitions pick up the load and ensure messages get delivered. This feature helps ensure fault tolerance and high availability.

High throughput and low-latency

Kafka ensures high throughput and low latency of as low as 2ms while handling high volume and high-velocity data streams.

Scalability and reliability

The decoupled and independent nature of producers and consumers makes it easier to scale Kafka seamlessly. Additionally, nodes within a cluster can be scaled up/down to accommodate growing demands.

Real-time data streaming and processing

Kafka can quickly identify data changes by utilizing change data Capture(CDC) approaches like triggers and logs. This method helps reduce compute load whereby traditional batch systems load the data each time to identify changes. Kafka also performs aggregations, joins and other data processing methods on event streams.

Ease of integration through multiple connections

Kafka uses its unique connectors to connect and integrate with multiple event sources and sinks like ElasticSearch, AWS S3 and Postgres.

Setting up a Kafka instance

A Kafka instance represents a single broker in a cluster. Let’s set up Kafka on our local machine and create a Kafka instance.

1. Download Kafka and extract the folder.

Note: You must have java8+ installed to use Kafka.

2. Navigate to the Kafka folder.
3. Start the zookeeper service.
This is the result:
4. Open another terminal session and start the broker service.
This starts up a single broker server. Our broker uses localhost:9092 for connections.

Create a Kafka topic

To publish messages to consumers, we need to create a topic to store our messages.

  1. Start another terminal session.
  2. Create two topics called sample topic and second-topic.
A message confirms the topic has been successfully created.
3. Run the following command to get a list of all the topics in your cluster.
4. Use the describe argument to get more details on a topic.

Kafka-Python processing

For most data scientists and engineers, Python is a go-to language for data processing and machine learning because it is:

  1. Easy to learn and use: Python uses a simplified syntax, making it easier to learn and use.
  2. Flexible: Python uses dynamic typing, making it easy to build and use in applications.
  3. Supported by a large open-source community and excellent documentation: The Python community ranges from beginner to expert level users. Python also has a robust documentation guide to help its developers maximize the language.

As mentioned earlier, since Kafka was built on Java and Scala, it can pose a challenge for data scientists who want to process data and stream it to brokers with Python. Kafka-python was built to solve this problem, with Kafka-Python acting like the java client, but with additional pythonic interfaces, like consumer iterators.

Setting up Kafka-Python stream processing

Prerequisites

  1. Python3. Follow this guide to help install python3.
  2. Kafka—already installed from the previous step.

Installing Kafka-Python

Let’s set up Kafka for streaming in python.

  1. Open another terminal session.
  2. Run the command to install Kafka-python.
This shows a successful installation.

Next, we write and process a CSV file and write it to a consumer.

Writing to a Producer

KafkaProducer functions similarly to the java client, and sends messages to clusters.

For our sample, we’ll use python to convert each line of a csv file(data.csv) to json and sends to a topic(sample-topic)

Python code

This is our result:

Writing to a Kafka Consumer

Next, we check Kafka Consumer to see if the topic was successfully posted.

Python code

Here’s the output from our python file displaying our file now in json format.

Summary

The services provided by Kafka enable organizations to act instantly on event streams, build data pipelines and seamlessly distribute messages, all of which drive vital business decision-making. Kafka is scalable, fault-tolerant and can stream millions of messages daily. It employs multiple connectors, allowing for various event sources to connect to it.

The Kafka-Python client is a Python client for Kafka which helps data scientists process and send streams to Kafka. With the Kafka-Python client, data engineers can now process data streams and send them to Kafka for consumption or storage, ‌improving data integration. The StreamSets data integration platform further improves this by providing an interface allowing organizations to stream data from multiple origins to various destinations.

Where StreamSets, Kafka and Python come together

StreamSets allows organizations to make the most out of Kafka services. Using StreamSets, organizations can build effective data pipelines that process messages from producers to send to multiple consumers without writing code. For instance, with the use of Kafka Origin, engineers can process and send large messages to multiple destinations like AWS S3, ElastiSearch, Google storage and TimescaleDB.

StreamSets also eases the burden of building stream processing pipelines from scratch by providing an interactive interface that allows engineers to connect to Kafka, stream to destinations and build effective and reusable data pipelines. This feature empowers your engineers and developers to do more with your data.

StreamSets

Accelerate decision-making with analytics-ready data

Related Articles

5 Examples of Data Fabric Architecture in Action
App & Data Integration
5 examples of data fabric architecture in action
Data can span multiple locations, and the central management layer can be a data fabric. See examples of data fabric architecture in action.
Read Blog
Difference Between Slowly Changing Dimensions and Change Data Capture
App & Data Integration
The difference between Slowly Changing Dimensions and Change Data Capture
The differences between slowly changing dimensions (SDC) and change data capture (CDC) are subtle. Learn the technical differences here.
Read Blog
Data Integration Architecture
App & Data Integration
Data integration architecture
A data integration architecture aims to solve the heterogeneity feature from various data sources, locations, and interfaces. See how it helps!
Read Blog
SUBSCRIBE TO SOFTWARE AG'S BLOG

Find out what Software AG’s solutions can do for your business

Thanks for Subscribing 🎉

ICS JPG PDF WRD XLS