Introduction
The data dependency pressure cooker
Data is at the very core of digital transformation. Businesses are dependent on insights from data to meet strategic and operational goals. Without data, enterprises cannot make smart real-time decisions, stay competitive, or accelerate innovation. Data leaders and practitioners know that to meet business demand for digitalization, information assets must move seamlessly and at speed throughout an organization.
But this is easier said than done. The modern data ecosystem is enormous, complex, and dynamic. It’s also constantly evolving as data architectures become increasingly fluid. Building pipelines that connect data from source to destination requires rules to integrate, transform, and process data across multiple environments. The data supply chain is not fixed from cloud applications and services to on-premises mainframe and legacy systems. All of which has made the job of building resilient data pipelines considerably harder.
Under-resourced technical teams are struggling to keep up with the volume of requests for data from the business without ceding control. And business teams simply want data on demand to inform their operations and advance their digitalization initiatives. The end result is frustration on both sides of the aisle.
StreamSets wanted to lift the lid on the hidden problem of data integration friction and find out what it means for today’s modern enterprises. And who better to ask than those on the data front lines? We surveyed 653 data decision makers and practitioners from large enterprises in the US, UK, Germany, France, Spain, Italy and Australia to understand the challenges of delivering data to the business. In this report, we explore the results and shine a light on the burden data leaders and practitioners face.
Demand for data is outstripping supply
Access to data is critical to every aspect of an organization’s digital and strategic objectives. Whether navigating turbulent economic headwinds and volatile supply chains, launching new products and services, or simply staying competitive, organizations require real-time data analytics fueled by large volumes of accurate and timely data.
While traditionally technical teams would expect departments like finance and sales to request data frequently, this research shows that all lines of business are consuming more data as digital transformation continues at pace. Almost half (48%) of admin and operations, and customer service departments request data at least weekly. They are followed by accounting and financing (44%), other IT and digital teams (43%), and sales and marketing (40%).
The increase in requests means a classic supply and demand problem exists. The demand for data is higher than the ability of most technical teams to provide it. More than half (59%) of respondents say the acceleration of digital transformation priorities has created major data supply chain challenges.
The problem of meeting demand for data is compounded by the complexity of enterprises’ ecosystems. Data engineers must take many steps to connect, transform and process data to build pipelines that meet the individual needs of different departments. But when data is siloed in multiple systems with inconsistent formats, creating bespoke data pipelines at scale is a huge challenge. Almost two-thirds of respondents (65%) say this data complexity and friction can have a crippling impact on digital transformation.
As a result, there is often a disconnect between the expectations of line of business teams and what can actually be delivered. Data leaders and practitioners are frustrated that non-experts expect data on demand and have little understanding of the scale of the data integration challenge.
DATA DEMAND: The IT and Business disconnect:
Data chaos is holding businesses back
In a bid to mitigate data integration friction, many businesses have invested heavily in “enabling” technologies to increase agility and drive digital transformation. These include everything from moving to the cloud and implementing AI to adopting elastic and hyper-scalable data platforms. However, keeping up with constant change introduced by technology is hard and can, in fact, add to the difficulties of data integration friction.
As data sources and technology platforms proliferate, businesses end up with a patchwork of systems where data becomes increasingly siloed. Whether legacy systems, point solutions, custom-built tools or solutions from a cloud service provider, the result is a fragmented and chaotic data environment.
As a result, what should be a simple pipeline-building task becomes a complex job requiring expensive expert skills. Inevitably, this hampers technical teams and slows them down. The research finds that over two-thirds (68%) of data leaders say data friction is preventing them from delivering data at the speed the business requests it. And more than four-in-ten (43%) say data friction is a “chronic problem” in their organization.
Several factors contribute to this friction.The most cited issue by respondents was the variety of data formats, both structured and unstructured (38%). This was followed by the speed at which data is created (36%) and the presence of legacy technologies (30%).
For many, getting the data “out” of legacy systems is the biggest obstacle, but it’s also the biggest gain when it comes to business insight.
Legacy technologies
To further underline the point around legacy technologies, 51% of respondents say data in legacy systems, such as mainframes or on-premises databases, are hard to access for cloud analytics, so they often “don’t bother” to include it when creating data pipelines—a considerable risk.
Legacy systems typically have many decades of valuable business insights held within them. This statistic also highlights that cloud analytics are not a panacea. Many analytics products focused on sourcing data from SaaS applications are unable to extract data completely from complex multi- and hybrid cloud environments, let alone data trapped in legacy systems.
No modern enterprise today can afford to simply ignore legacy data, especially because legacy systems typically hold years of proprietary data and much of a company’s IP. This data is the“secret sauce” that gives a business the edge. This specific, transactional and granular data ensures the insights from machine learning and AI models drive optimal business decisions. For many, getting the data “out” of legacy systems is the biggest obstacle, but it’s also the biggest gain when it comes to business insight.
Without the assurance that data from all sources is collated, businesses cannot fully trust their data. And in today’s world, data must come without caveats. However chaotic the data ecosystem, technical teams need the capacity to run dynamic data pipelines in any cloud or on-premises environment to unlock insights that drive innovation.
Beyond friction: Cracks in the pipelines
As we’ve discussed, the chaos of modern data ecosystems makes building smart and resilient data pipelines hugely difficult. For many businesses, establishing a pipeline is labor intensive and requires expert data engineers to hand-code one-off solutions that can’t be templatized or reused. These pipelines are not automatically insulated from unexpected shifts in the environment, resulting in brittle pipelines that are vulnerable to breakage.
The research finds that 39% of data leaders and practitioners admit their pipelines are too brittle and crack at the first bump in the road. A noteworthy 87% of respondents have experienced data pipeline breaks at least once a year, with more than a third (36%) saying their pipelines break every week, and worryingly, 14% say they break at least once a day.
Considering that large enterprises can have thousands of critical data pipelines in place, this represents a huge amount of disruption. It leaves line of business teams working with outdated information and technical teams dealing with a mountain of repair work. The business impact of broken data pipelines can be significant. For example, a supply chain director working with old data may over or under order goods. Customer facing teams in a haulage and logistics company cannot accurately inform customers of when to expect their orders. And a trader in a financial services firm may be left to make stock picks on out-of-date intel.
Pipelines break when they are not resilient to changes in the environment. The most cited reasons for breakage by data leaders and practitioners in our research include bugs and errors being introduced during a change (44%), infrastructure changes such as moving to a new cloud (33%), and credentials changing or expiring (31%).
It is not surprising to see cloud so high on this list. Migrating to the cloud can cause as many problems as it can solve if businesses don’t have a clear data strategy for multi and hybrid cloud environments. When companies opt for a basic “lift-and-shift” approach to a cloud move, technical teams are forced to carry out extensive rework to orchestrate systems and connect them to data pipelines.
Without this remedial work, every change to how data is stored and consumed heightens the risk of breaking the data flow. But this data drift—the unexpected changes to data structure, semantics, and infrastructures— is a fact of modern data architectures. When businesses can’t evolve their data architecture in tandem with the infrastructure and platform choices made, sub-optimal data pipelines are the result.
Businesses need to be able to introduce changes without having to worry about the stability of their data pipelines. They need the capability to ingest more data without needing to build more infrastructure. When they can do this, data engineers are freed up to complete high value work, empowering line of business teams with data that allows them to innovate. But the research shows that today, many technical teams are not equipped with the right tools to satisfy this need. Almost half (46%) say their ability to tackle broken data pipelines lags behind other areas of data engineering, and 43% say they struggle to fix data pipelines in motion.
The human impact of data integration friction:
Stop apologizing for your data.
Trust in the data you use is essential. Imagine being able to complete forecasting P&L, predicting sales, analyzing marketing campaigns, and reporting financial results to the board with complete confidence in the data.
No one wants to have to apologize for their data. Or justify why the last quarter’s figures aren’t included. Or caveat their data sets with explanations of why it’s not quite up to date. It makes recommendations less powerful and can even damage a department’s reputation.
Yesterday’s data is not the same as today’s. Businesses must be powered by resilient data pipelines that automatically ingest the most up-to-date information and serve it on demand, wherever it is needed. This gives data users complete confidence in the trustworthiness and accuracy of data, so they can stop apologizing for it.
The true cost of data integration friction
Unleash the power of data across the enterprise
Data is a critical success factor in the modern enterprise. It drives digital transformation, experimentation and prototyping, and real-time analytics to keep businesses competitive and thriving. But as the results of this research have demonstrated, data integration friction is holding technical teams back and stopping them from keeping up with “need-it-now” business demands. Data leaders are aware of the scale of the problem, with 70% of respondents believing that smarter data pipelines would enable them to deliver data to the business at pace.
Instead, the complex and brittle architectures that most businesses are working with are driving up costs and impeding agility. This leaves enterprises struggling to meet the demand for data and empower all lines of business with data insights that accelerate innovation and real-time decision making.
Businesses can reduce the load on already stretched technical data teams by enabling individual business units and end-users to do more “last mile” data collection and analysis themselves. The clear majority of respondents agree; 70% of data decision makers and practitioners today are currently responsible for the last mile of data delivery, but 86% would prefer line of business teams be empowered to do this independently.
GSK:
Powering drug discovery with self-service data
(Formerly GlaxoSmithKline plc)
Solution
Bringing a new drug to market costs billions of dollars and can take up to 20 years. At every phase of drug discovery research, from compound investigation to post-market monitoring, data is critical. Scientists require multidisciplinary data that is accurate, relevant and trusted, to inform their research. GSK wanted to give its 10,000+ scientists engaged in R&D around the world access to such data. To achieve this, it needed to de-silo its data sources and deliver it in a single enterprise-wide platform that allowed individual teams to consume data-on-demand. GSK worked with StreamSets to bring this vision to fruition, building a Data Center of Excellence to accelerate the delivery of clean data from 1,000s of data sources. Using StreamSets, GSK has automated data pipeline creation and data drift handling without interrupting the critical flow of self-service data for scientists.
Result
- Onboarding time for new data sources was reduced by 98%.
- New product discovery time was reduced by 96%.
- Accelerated time to market for new drugs by almost 3 years.