IBM's Self-Service Data Transformation with StreamSets

The StreamSets platform has enabled IBM to handle data at scale, allowing teams to access data as and when they need it, boosting innovation and overall operations.

With over 400,000 employees across the world, IBM has one of the largest corporate networks in existence. This network acts as a foundation service with many unique and complex technologies running across it. Network health is critical for both ongoing operations and IBM’s financial health.

Charged with the health of the network, the CIO Network Engineering team delivers automation and tooling services that provide visibility into and reliable operations of the environment. To do this they need continuous, reliable, and transparent operational data that teams around the world can use in real-time.

"StreamSets was one of the tools we looked at and, in the end, it naturally filled the space for us. People had already started using it; the adoption rate was very high, and—guess what—they built pipelines which actually worked. So, it became a de facto ingredient in our DataOps practice.”

- Divya Yashwanth, Software Developer, CIO Network Engineering, IBM

Challenge

IBM’s global network has over 1000 sites in more than 160 countries and processes terabytes worth of data and metrics each day, with some devices generating 10K-12K records per second—that’s high volume and high-velocity data that the team needs in real-time. In addition, the global network is heterogeneous; there are thousands of different hardware and software systems from hundreds of vendors, with hundreds of different data formats.

To parse and get meaning out of all this network equipment and operations data, the team needs to ingest, transform, and move the data to the nearest data lake. Local teams work with data on regional data lakes to reduce bandwidth costs, while corporate visibility ensures oversight and reporting.

The team uses a lambda architecture, which allows for both batch and stream processing. Initially, they ran an ELT-based log collection platform, using Elasticsearch, Logstash, and Kibana. However, Logstash’s code-intensive interface, with no visualization component, made integration with various sources and destinations exceedingly difficult. The team began a POC to look at alternative tools to replace Logstash “because it lacked DataOps capabilities”—in IBM’s case, the ability to enable local teams to easily build and manage their pipelines by operationalizing continuous data management and integration.

Solution

The team is using StreamSets to develop pipelines, create jobs, and make sure the jobs are running on specific StreamSets pods. They have done a lot of work in automating and operationalizing their StreamSets deployment, setting up automatic provisioning agents, pipeline template fragments and topologies, as well as doing StreamSets CI/CD integration with Docker and GitHub.

Results

By using StreamSets, the team can empower colleagues around the world to innovate from the edge. There are no waiting lines for data. Smart data pipelines enable self-service data for everyone.

StreamSets’s easy-to-use, drag-and-drop UI helps people with less coding skills gain confidence by allowing them to easily build pipelines and deliver something better than they could with traditional programming. The HQ team now has global visibility into 20,000+ data pipelines and billions of streaming records, while local teams can immediately access the data they need to keep local network operations running smoothly. “We use StreamSets because it’s the only technology that handles volume at scale,” shared Stephan Barabasi, big data, cloud architect, and data scientist, IBM, who has plans to scale StreamSets beyond the CIO Network Engineering team.

IBM adopts self-service data to support operational excellence

Immediate access to data sparks innovation

"StreamSets was one of the tools we looked at and, in the end, it naturally filled the space for us. People had already started using it; the adoption rate was very high, and—guess what—they built pipelines which actually worked. So, it became a de facto ingredient in our DataOps practice.”

The StreamSets platform has enabled IBM to handle data at scale, allowing teams to access data as and when they need it, boosting innovation and overall operations.

Research Report

The Business Value of Data Engineering

White paper

The Data Integration Advantage: Building a Foundation for Scalable AI

eBook

Five Principles for Agile Data & Operational Analytics

Welcome

Discover

Connect

Hear from our CEO: The time has come for a Super iPaaS