Question 1

What is streaming data?

Accepted Answer

Streaming data is the continuous flow of information from disparate sources to a destination for real-time processing and analytics.
What is a data stream example?
Real-time data streaming is beneficial when new data is generated continually. For example, credit card companies can use streaming transaction data to detect irregularities and stop fraud before it happens. Or applications can present recommendations to a customer based on their real-time choices leading to better customer experience (a la Netflix, Amazon or YouTube).
Personalizing a web experience like this, calculating optimal truck routes or reporting on sleep patterns are examples of real-time analytics. Streaming data used to promote a product add-on during checkout, auto-drive the truck or soothe a baby back to sleep are examples of real-time applications.
For the purposes of this article, we will focus on streaming data used for analytics, including sentiment analysis, predictive analytics and machine learning/AI.

Question 2

Streaming data and real-time analytics

Accepted Answer

To put streaming data into perspective, each person creates 2.5 quintillion bytes of data per day according to current estimates. And data isn’t just coming from people. IDC estimates that there will be 41.6 billion devices connected to the “Internet of Things” by 2025. From airplanes to soil sensors to fitness bands, devices generate a continuous flow of streaming data for real-time analytics and applications.
Everyone wants their slice of that data to do what they do better:

Sales and marketing can offer real-time suggestions for a next best action
Operations and customer service cut repair and build time with more efficiency
Security and compliance can detect fraud and take action before damage is done

They depend on a continuous flow of data coming from sources that are subject to change and often out of IT’s control. On the destination side of the data value chain, data consumers use many different systems, designed for particular types of analysis. In the middle is the data engineer, tasked with creating connections and making sure data stays correct and consistent.
So, what is the benefit of streaming data?
To put it simply, the real-time nature of stream data processing allows data teams to deliver continuous insights to business users across the organization.

Question 3

How data processing works: Preparing data for analytics

Accepted Answer

Before data can be used for analysis, the destination system has to understand what the data is and how to use it. Data flows through a series of zones with different requirements and functions:
Raw zone
The raw zone stores large amounts of data in its originating state, usually in its original format (Avro, JSON or CSV, for example). Data come into the raw zone through a process of ingestion as streaming data, a batch of data, or through a change data capture process where only changes to previously loaded data are updated.
Clean zone
The clean zone (or the refined zone) is a filter zone where transformations may be used to improve data quality or enrich data. Common transformations include data type definition and conversion, removing unnecessary columns, masking identifiable data, etc. The organization of this zone is determined by the business needs of the end users, for example, the zone may be organized by region, date, department, etc.
Curated zone
The curated zone is the consumption zone, optimized for analytics rather than data processing. This zone stores data in denormalized data marts and is best suited for analysts or data scientists who want to run ad hoc queries, analysis or advanced analytics.
Conformed zone
The conformed zone houses data transformed and structured for business intelligence and analytics queries.

Question 4

From Apache Kafka to object stores

Accepted Answer

Apache Kafka is an open source distributed event streaming platform, known as a “pub/sub” messaging system. A streaming data source starts publishing or streaming data and a destination system subscribes to receive the data. The publisher doesn’t wait for subscribers and subscribers jump into the stream when they need it. Kafka is fast, scalable, durable and was a pillar of on-premises big data deployment.
Cloud platforms introduced a new way of storing unstructured data called an object store. Producers became decoupled from consumers, and the cost of storage became negligible. You could keep all the data you wanted as objects to be accessed when needed. For example, Amazon Kinesis integrates directly with Amazon Redshift (an analytics database) and Amazon S3 for streaming data.
Learn more about how Kinesis compares to Kafka for data engineers. Or, take a look at this article for an example of a Kafka-enabled streaming pipeline in StreamSets.

Question 5

Stream processing vs batch processing

Accepted Answer

To make streaming data useful requires a different approach to data than traditional batch processing data integration techniques. Think of batch processing as producing a movie. The production has a beginning, middle and an end. When the work is complete, there is a whole, finished product that will not change in the future. Stream processing is more like an episodic show. All of the production tasks still happen, but on a rolling time frame with endless permutations.
In batch processing, data sets are extracted from sources, processed or transformed to make them useful, and loaded into a destination system. ETL processing creates snapshots of the business in time, stored in data warehouses or data marts for reporting and analytics. Batch processing works for reporting and applications that can tolerate latency of hours or even days before data becomes available downstream.
With the demand for more timely information, batches grew smaller and smaller until a batch became a single event and stream processing emerged. Without a beginning or an end, sliding window processing developed so you could run analytics on any time interval across the stream.
Handling both stream and batch processing has become essential to a modern approach to data engineering. At DNB, Norway’s largest financial services group, data engineers use streaming instead of batch wherever possible as a data engineering best practice.

Question 6

Stream processing frameworks

Accepted Answer

Stream processing frameworks give developers stream abstractions on which they can build applications. There are at least 5 major open source stream processing frameworks and a managed service from Amazon. Each one implements its own streaming abstraction with trade-offs in latency, throughput, code complexity, programming language, etc. What do they have in common? Developers use these environments to implement business logic in code.

Apache Flink for stateful computing over data streams
Apache Ignite for high performance computing with in-memory speed
Apache Samza for stateful applications that process data in real-time
Apache Spark for scalable, fault-tolerant streaming applications
Apache Storm for distributed real-time computations
Amazon Kinesis Data Streams for real-time managed data streaming

Apache Spark is the most commonly used of these frameworks due to its native language support (SQL, Python, Scala and Java), distributed processing power, performance at scale, and sleek in-memory architecture. Apache Spark processes data in micro-batches.

Question 7

Streaming data pipeline examples

Accepted Answer

A data pipeline is the series of steps required to make data from one system useful in another. A streaming data pipeline flows data continuously from source to destination as it is created, making it useful along the way. Streaming data pipelines are used to populate data lakes or data warehouses, or to publish to a messaging system or data stream.
The following examples are streaming data pipelines for analytics use cases.
Sending Kafka messages to S3
Where your data comes from and where it goes can quickly become a criss-crossing tangle of streaming data pipelines. Streaming data pipelines that can handle multiple sources and destinations allow you to scale your deployment both horizontally and vertically, without the complexity. Find out how to manage large workloads and scale Kafka messages to S3.

Question 8

Challenges to streaming data

Accepted Answer

Before you choose a tool or start hand coding streaming data pipelines for mission critical analytics consider these decision points.
The tyranny of change
Data will drift and you need a plan to handle it. Schemas change, semantics change and infrastructure changes. When your analytics depend on real-time data, you can’t take a pipeline out of production to update it. You need to make updates and preview changes without stopping and starting the data flow. Better yet, you need the ability to automate data drift handling as much as possible to ensure continuous data.
What about hand coding?
While technologies like Kafka and Spark simplify many aspects of stream processing, working with any one of them still requires specialized coding skills and plenty of experience with Java, Python, Scala and more. Finding skilled developers in any single stream processing technology is difficult, but building a team with expertise in more than one? Not for everyone’s budget. Hand coding limits your team’s ability to scale and democratize data access.
The wild innovation ride
As new stream processing frameworks solve streaming data challenges, you need to be able to adapt and optimize your data pipelines. Cloud-based solutions work well natively, but what about streaming data across platforms or to multiple destinations? You might have to go back to hand coding your own connectors, or end up with multiple, separate systems to monitor and maintain.
Follow the business logic
These questions focus on the “how” of the data pipeline implementation details. How will data get from point A to point B and be useful? What happens when there are lots of As, lots of Bs, and the data never stops flowing? How do you stay ahead of the “what” without getting bogged down in the “how”?
The majority of business logic that drives the modern enterprise resides in the integration between 1000s of specialized applications across multiple platforms. Your analytics and operations become the most vulnerable points in modern business operations.
A data engineering approach to building smart data pipelines allows you to focus on the what of the business logic instead of the how of implementation details. Ideally, your streaming data pipeline platform makes it easy to scale out a dynamic architecture and read from any processor and connect to multi-cloud destinations.

The StreamSets and webMethods platforms have now been acquired by IBM

What are stream processing, streaming data, and streaming data pipelines?

How stream processing and streaming data pipelines turn digital actions into real-time analytics

What is streaming data?

What is a data stream example?

Creating Order from Chaos

Governance in the Data Wild West

Streaming data and real-time analytics

So, what is the benefit of streaming data?

How data processing works: Preparing data for analytics

Raw zone

Clean zone

Curated zone

Conformed zone

From Apache Kafka to object stores

Stream processing vs batch processing

“We encourage our data engineers to use streaming mode wherever possible. The downstream pipeline can be run as per the requirement, but it always gives us the option of running it more frequently than once a day to a near real-time by using this approach.”

Stream processing frameworks

Design Considerations

for Apache Spark Deployments

Streaming data pipeline examples

Sending Kafka messages to S3

Protecting credit card data in an Amazon Kinesis Stream

From Twitter to Kafka to machine learning on Azure

Machine learning data pipelines with Tensorflow

Challenges to streaming data

The tyranny of change

What about hand coding?

The wild innovation ride

Follow the business logic

Building streaming data pipelines

Welcome

Discover

Connect

Hear from our CEO: The time has come for a Super iPaaS