The basics of Apache Kafka
Apache Kafka is an open-source distributed (pub/sub) messaging and event streaming platform that helps two systems communicate effectively, making data exchange more accessible. Kafka deploys to Virtual Machines (VMs), hardware or on the cloud and consists of clients and servers communicating via a TCP protocol. With Apache Kafka, consumers subscribe to a topic, and producers post to these topics for consumers’ consumption. Kafka employs a decoupling method whereby the producers operate and exist independently of the consumer, i.e., producers only need to worry about publishing topics, not how those topics get to the consumers.
Currently, over 39,000 companies use Kafka for their event streaming needs, a 7.24% increase in adoption from 2021. These companies include eBay, Netflix, Target, amongst others. Apache Kafka is fast, can process millions of messages daily, and is highly scalable with high throughput and low latency.
However, because Kafka was built on Java and Scala, Python engineers found it challenging to process data and send it to Kafka Streams. This challenge led to the development of the kafka-python client, which enables engineers to process Python in Kafka. This article introduces Kafka, its benefits, setting up a Kafka cluster, and stream processing with kafka-python.
Essential Kafka concepts
Some essential concepts you’ll need to know for streaming Python with Kafka include:
Topics
Topics act as a store for events. An event is an occurrence or record like a product update or launch. Topics are like folders with files as the events. Unlike traditional messaging systems that delete messages after consumption, Kafka lets topics store events for as long as defined in a per-topic configuration. Kafka topics are divided into partitions and replicated across multiple replicas in brokers so that various consumers can consume these topics simultaneously without affecting throughput.
Producers
Producers refer to applications that publish messages to Kafka. Kafka decouples producers from consumers, which helps with scalability. With Kafka, producers need only post messages, with no concern for how Kafka handles these topics.
Consumers
Consumers consume the topics produced by the producers.
Clusters
A cluster refers to a collection of brokers. Clusters organize Kafka topics, brokers, and partitions to ensure equal workload distribution and continuous availability.
Brokers
Brokers exist within a cluster. A cluster contains one or more brokers working together. A single broker represents a Kafka server.
Zookeeper
A zookeeper acts as the record that stores information about a Kafka cluster. It houses information like user info, naming configuration, and access control lists for topics.
Partition
Brokers contain one or more partitions. Each partition holds a topic replicated across other partitions in the brokers. There exists a leader partition, which coordinates writes to a topic. If the leader fails, the replica takes over. This principle of partitioning across brokers helps make Kafka fault-tolerant.
Kafka provides three primary services to its users:
- Publish messages to subscribers
- Message store while storing in order of arrival
- Perform analysis of real-time data streams
Benefits of Apache Kafka
Fault tolerance and high availability
In Kafka, a replication factor specifies the number of copies of partitions needed for a topic. In the event of a node failure in one broker, replicas in other partitions pick up the load and ensure messages get delivered. This feature helps ensure fault tolerance and high availability.
High throughput and low-latency
Kafka ensures high throughput and low latency of as low as 2ms while handling high volume and high-velocity data streams.
Scalability and reliability
The decoupled and independent nature of producers and consumers makes it easier to scale Kafka seamlessly. Additionally, nodes within a cluster can be scaled up/down to accommodate growing demands.
Real-time data streaming and processing
Kafka can quickly identify data changes by utilizing change data Capture(CDC) approaches like triggers and logs. This method helps reduce compute load whereby traditional batch systems load the data each time to identify changes. Kafka also performs aggregations, joins and other data processing methods on event streams.
Ease of integration through multiple connections
Kafka uses its unique connectors to connect and integrate with multiple event sources and sinks like ElasticSearch, AWS S3 and Postgres.
Setting up a Kafka instance
A Kafka instance represents a single broker in a cluster. Let’s set up Kafka on our local machine and create a Kafka instance.
1. Download Kafka and extract the folder.
Note: You must have java8+ installed to use Kafka.
Create a Kafka topic
To publish messages to consumers, we need to create a topic to store our messages.
- Start another terminal session.
- Create two topics called sample topic and second-topic.
Kafka-Python processing
For most data scientists and engineers, Python is a go-to language for data processing and machine learning because it is:
- Easy to learn and use: Python uses a simplified syntax, making it easier to learn and use.
- Flexible: Python uses dynamic typing, making it easy to build and use in applications.
- Supported by a large open-source community and excellent documentation: The Python community ranges from beginner to expert level users. Python also has a robust documentation guide to help its developers maximize the language.
As mentioned earlier, since Kafka was built on Java and Scala, it can pose a challenge for data scientists who want to process data and stream it to brokers with Python. Kafka-python was built to solve this problem, with Kafka-Python acting like the java client, but with additional pythonic interfaces, like consumer iterators.
Setting up Kafka-Python stream processing
Prerequisites
- Python3. Follow this guide to help install python3.
- Kafka—already installed from the previous step.
Installing Kafka-Python
Let’s set up Kafka for streaming in python.
- Open another terminal session.
- Run the command to install Kafka-python.
Next, we write and process a CSV file and write it to a consumer.
Writing to a Producer
KafkaProducer functions similarly to the java client, and sends messages to clusters.
For our sample, we’ll use python to convert each line of a csv file(data.csv) to json and sends to a topic(sample-topic)
Python code
Writing to a Kafka Consumer
Next, we check Kafka Consumer to see if the topic was successfully posted.
Python code
Summary
The services provided by Kafka enable organizations to act instantly on event streams, build data pipelines and seamlessly distribute messages, all of which drive vital business decision-making. Kafka is scalable, fault-tolerant and can stream millions of messages daily. It employs multiple connectors, allowing for various event sources to connect to it.
The Kafka-Python client is a Python client for Kafka which helps data scientists process and send streams to Kafka. With the Kafka-Python client, data engineers can now process data streams and send them to Kafka for consumption or storage, improving data integration. The StreamSets data integration platform further improves this by providing an interface allowing organizations to stream data from multiple origins to various destinations.
Where StreamSets, Kafka and Python come together
StreamSets allows organizations to make the most out of Kafka services. Using StreamSets, organizations can build effective data pipelines that process messages from producers to send to multiple consumers without writing code. For instance, with the use of Kafka Origin, engineers can process and send large messages to multiple destinations like AWS S3, ElastiSearch, Google storage and TimescaleDB.
StreamSets also eases the burden of building stream processing pipelines from scratch by providing an interactive interface that allows engineers to connect to Kafka, stream to destinations and build effective and reusable data pipelines. This feature empowers your engineers and developers to do more with your data.