The StreamSets and webMethods platforms have now been acquired by IBM 

The Data Integration Advantage: Building a Foundation for Scalable AI

Authored by Girish Pancha, Co-Founder, Chief Executive Officer & Arvind Prabhakar, Co-Founder, Chief Product Officer


For the last decade or so, data has been the business world’s darling. Curious why your customers are unhappy? Look at the data. Wondering what your next market should be? The data will tell you. Want to find out who your best-performing employees are? You know what to do.

Now, there’s a (not so) new kid in town that’s dominating the conversation: AI. Generative AI has ignited imaginations across the world. As the first widely available application that lets anyone talk to an AI about anything—and get coherent, even clever answers—AI has moved from the abstract to an everyday reality. 

But while AI may be overtaking public discourse, data is (of course) not going anywhere. That’s because the success of AI projects is not simply a result of innovative algorithms or machine learning models; it fundamentally relies on mass quantities of accessible, reliable data. AI, ML, and analytics output are meaningful only if the data they operate on is valid and observable across the whole lifecycle—sample data for exploration, test and training data for experimentation, and production data for evaluation.

As AI initiatives become more ambitious and scale across organizations, the demand for connected, quality, governed data increases in parallel. Modern data integration is the critical backbone for successfully scaling AI. And with 72% of Fortune 500 business leaders planning to incorporate generative AI within the next three years1, it’s time to get data integration right. 

In this piece, we’ll explore:

  • The state of AI in the enterprise
  • Challenges of scaling AI
  • How modern data integration can remove AI scaling challenges
  • Moving beyond data integration for even better AI results

Read on to learn about data integration’s vital role in the quest to scale AI.

72% of technology executives say that should their companies fail to achieve their AI goals, data issues are more likely than not to be the reason.

- CIO Vision 2025: Bridging the Gap Between BI and AI, MIT Technology Review Insights

The State of AI in the Enterprise

For years, enterprises have been using AI in pockets around the enterprise. It’s made great strides in:

  • Improving customer experience through chatbots and virtual assistants powered by natural language processing (NLP) that provide instant, personalized customer service 24x7. 
  • Optimizing supply chain processes by predicting demand, optimizing delivery routes, and identifying potential disruptions.
  • Identifying when machinery is likely to fail (predictive maintenance) to carry out maintenance before a breakdown occurs.
  • Expediting research and development processes, reducing the time to market for products and services.
  • Detecting fraud, evaluating credit risk, and anticipating market changes with machine learning algorithms that identify patterns in historical data.

However, most enterprise AI usage is limited to very specific use cases and departments. BCG found that only 11% of companies have realized significant value from AI initiatives, and most have failed to scale AI beyond pilots.2

Their 2022 digital acceleration index — a survey of 2700 companies — paints a picture of AI initiatives stuck in the early stages.3

However, there were ‘leaders’ in scaling and generating AI value among this group. BCG found that one of the primary characteristics of those leaders was making “data and technology accessible across the organization, avoiding siloed and incompatible tech stacks and standalone databases that impede scaling.”

Maturity of AI uses cases across industries

Source: BCG Digital Acceleration Index global study, 2022

78% of enterprise technology leaders said that scaling AI and machine learning use cases to create business value is the top priority of their enterprise data strategy over the next three years4 

- CIO Vision 2025: Bridging the Gap Between BI and AI, MIT Technology Review Insights

The Challenges of Scaling AI

While there are many challenges in scaling AI — cost, lack of talent, trust and ethics — data quality and availability are arguably the biggest hurdles. In fact, 72% of technology executives surveyed in a recent MIT study say that should their companies fail to achieve their AI goals, data issues are more likely than not to be the reason5, and 61% of respondents in an IBM survey said their data is not ready for AI.6

AI models rely on a constant influx of high-quality data for training and inference. But, organizations often grapple with data quality issues such as incomplete and inaccurate data. Another problem is integrating relevant data from different sources across the organization, such as mainframes, customer relationship  management (CRM) systems, enterprise data warehouses and data lakes, business intelligence platforms, external systems, third-party data, and more.

To make matters even more complex, AI/ML models are not static; they require ongoing monitoring and maintenance to ensure performance and reliability. Monitoring for concept drift, model decay, and performance degradation is essential. Regular updates and retraining may be necessary to adapt models to evolving data patterns or changes in the operational environment. As such, organizations must establish processes to manage version control, model updates, and performance tracking.

Today, most organizations handle these processes manually. They create manual workflows around retraining data, use new datasets, identify boundary conditions or fringe predictions that don’t match the norm, and then make the best guess as to the right time to retrain the model. Clearly, this is an imprecise science that can lead to subpar outcomes.Given these challenges, a solid data foundation is essential for AI/ML models to function properly over the long term. The ability to easily access and share high-quality data — real-time or batch — across the organization securely is essential for building an AI-powered application that’s relevant, accurate, and scalable.

How Modern Data Integration Solutions Can Remove AI Scaling Challenges

A recent PWC survey7 found that the top tech-related challenge for AI is identifying, collecting, or aggregating data from across the company, ensuring its completeness and accuracy in preparation for use in AI. This was followed closely by making sure all data in AI systems meets regulatory requirements for privacy and data protection and integrating AI and analytics systems to gain business insights.

As you upgrade your technology and architecture, they suggest focusing on two imperatives: integration and data. “With technology tools that help you overcome your data challenges, you can achieve much faster (and much more cost-effective) operationalizing of AI.” 

Let’s look at how data integration technology can help with challenges specific to scaling AI/ML.

AI Scaling Challenge
How Modern Data Integration Helps
Data silos
Data gets trapped in departmental silos, legacy systems, and cloud apps in varying formats. This data fragmentation makes it hard to aggregate the large, diverse datasets needed to train accurate AI models.
A modern data integration solution will provide connectors to gather data from various data stores and infrastructure, including legacy systems like mainframes. It can then transform disparate data formats into a consistent, analysis-ready format.
Data quality and availability
AI systems rely heavily on vast amounts of high-quality and relevant data for training and making accurate predictions. Data often has issues like missing fields, outliers, duplicates, inconsistencies, and lack of context. Low-quality data leads to poor model performance.
With data integration, businesses can automate data cleansing tasks like handling nulls, deduplication, normalization, and validation. Cleaning the data used for AI training and decision-making reduces the risk of biased or inaccurate models.
Data security and privacy
Training data may contain personal and sensitive information requiring protections like encryption, anonymization, and access control.
Data integration tools can secure data movement with encryption and anonymize data by masking fields. They should be compatible with data access and LDAP tools for extra security.
Data context
AI models rely on metadata like data definitions, datatypes, hierarchical relationships, etc., to function optimally. Lack of context can lead to misinterpretations.
A modern data integration platform ingests and manages metadata to provide richer context and meaning to data for AI models.
Observability, Monitoring, and Explainability
Many AI models, such as deep neural networks, are considered “black boxes” because their decision-making processes are difficult to interpret. Lack of interpretability can cause trust issues and ethical questions, especially in highly regulated industries or when making critical decisions. Lack of transparency poses challenges for observing and monitoring the behavior of AI models, which can lead to performance degradation.
Data integration tools can ensure that input data used for AI models is reliable, accurate, and representative of real-world scenarios. These tools also help explainability by providing complete visibility into where AI model data came from and what changes happened before entering the model.
Integration with Existing Infrastructure
AI often needs to be integrated with existing systems to be effective. This can be complex and time-consuming, particularly for large enterprises with legacy systems.
Data integration platforms provide tools to easily integrate diverse data, allowing AI systems to securely access and analyze the needed data while respecting existing IT policies and systems.
Scalable Infrastructure
Scaling AI models necessitates substantial compute resources, especially during the training and inference phases. The complexity and workload of AI models can vary, requiring dynamic allocation and optimization of resources. The challenge lies in optimizing the allocation based on the varying needs of different AI models and managing the operational costs associated with it.
Modern data integration platforms facilitate the uniform distribution of data across compute clusters and cloud infrastructure. This ensures that AI models have the necessary resources for training and inference. By optimizing data storage, processing, and transfer, data integration solutions let organizations allocate resources more efficiently, manage costs, and improve the overall efficiency of AI development.
Governance and Regulation
The adoption of AI often raises legal and regulatory concerns, particularly regarding privacy, security, and data protection. Businesses must navigate a complex landscape of regulations such as the General Data Protection Regulation (GDPR) and ensure compliance to avoid legal consequences and reputational damage.
Modern data integration tools are governance-ready. They provide topologies that show organizations how systems are connected and data flows across the enterprise. A centralized “mission control” console delivers deep visibility into pipelines, enabling organizations to consistently apply governance and security controls to create, process, and distribute data according to policy. They should also integrate with data lineage, governance, access, and policy control systems.
Cost and ROI
Scaling AI involves substantial data storage, processing, and transfer costs. As the volume of data grows, organizations face the challenge of managing these escalating costs while ensuring the efficiency and effectiveness of AI models. The costs are not just associated with hardware or cloud services but also with the operational management of data, such as ensuring data availability, reliability, and security.
Modern data integration solutions optimize data storage, facilitate efficient data processing, and minimize data transfer costs. This allows organizations to focus on innovation and development rather than operational management. It can also minimize data acquisition, storage, processing, and maintenance costs.

Beyond Modern Data Integration

The right modern data integration solution provides a solid foundation for scaling your AI initiatives. It supplies the consistent, quality, explainable data AI/ML models need for reliable and trustworthy results. Other essential components include data governance and access control solutions, which the right data integration solution will support.

But you can take your foundation to the next level with an enterprise integration platform, which adds application, API, B2B, and event integration to data integration. This is a new category of integration called Super iPaaS and it ensures that all the data in an organization is clean, correct, and accessible for AI/ML models. It establishes a common data structure so AI systems can use diverse data types and sources.

Super iPaaS will also improve visibility into how data flows into various AI models and should have:

  • Develop anywhere, deploy anywhere capabilities so teams can work how they like and eliminate duplicate efforts
  • Central control with distributed execution for faster time-to-market, simpler compliance, and better control of your integration landscape
  • Closed loop app and data integration so organizations can capitalize on past, present, and future data with connectivity from apps to analytics
  • A unified experience across all iPaaS components to simplify learning, managing, and collaboration across APIs, apps, data, B2B, and events
  • Composable business architecture with APIs and events that gives your team a flexible set of building blocks to deliver faster
  • Generative AI throughout the integration lifecycle to make the most common integration activities 10x faster, from creation to operation

It’s time for a new way to think about integration

Say hello to the Super iPaas

A Super iPaaS finally brings together application, data, APIs, B2B, and events integrations in the same unified platform. It is powerful enough for integration specialists, but easy enough for citizen integrators. It is built for the future of business.

Data Integration + AI = Enterprise-wide Success

As artificial intelligence and machine learning become more pervasive across industries, organizations must build a solid foundation to support enterprise-wide initiatives. Ensuring that AI leads to results you can trust requires ensuring the integrity and consistency of data coming into your AI infrastructure.

The right modern data integration solution provides critical functionality to overcome these hurdles and enable AI success at scale. With a focus on agility, automation, and observability, data integration streamlines and optimizes data flows to deliver high-quality, trustworthy data to AI models. With the right data foundation, AI models can deliver continuous value across the business through accurate predictions, automated decision-making, and data-driven optimization.

Getting Started

If you’re ready to build your foundation for scalable AI, the StreamSets platform provides an easy on-ramp. Data-driven organizations like Humana, IBM, GSK, and many more use the StreamSets data integration and transformation platform to rapidly deliver high-quality data for analytics, reporting, and data science.

1 Beyond Hypotheticals: Understanding the Real Possibilities of Generative AI, Insight
2 Artificial Intelligence, Ready to Ride the Wave?, BCG
3 Scaling AI Pays Off, No Matter the Investment, BCG
4 CIO Vision 2025: Bridging the Gap Between BI and AI,MIT Technology Review Insights
5 CIO Vision 2025: Bridging the Gap Between BI and AI,MIT Technology Review Insight
6 AI in the Enterprise, IBM
7 To operationalize AI, reorganize in these three ways, PWC 
You may also like:
CONNX data integration platform
CONNX data integration platform
Comprehensive data integration across distributed and heterogeneous data landscapes delivers many benefits—data access, virtualization, movement, migration and replication—in one integrated environment.
Customer story
Georgia-Pacific: More than just paper savings
Moving from legacy systems to modern, cloud-based ones doesn't happen overnight. See how Georgia-Pacific uses its legacy infrastructure for information and reporting to keep its business going.
See CONNX in action
Whether you’re looking to access your data, virtualize it or move it, schedule a demo to see how our CONNX data integration solution can help.
Analyst Report
The modernized mainframe has a bright future as Enterprise Server 3.0
Learn about the value of the mainframe, how to modernize your mainframe, and explore the evolution and potential of Enterprise Server 3.0 in this Bloor Research report.
CONNX data integration solution
See why analysts rank us a champion among pure-play data integration vendors
Bloor Research: Pure-play Data Integration MarketUpdate
Discover why pure-play data integration solutions continue to thrive as application environments move to the cloud—and learn why Software AG is recognized a Champion.
Are you ready to unlock your data?
Resilient data pipelines help you integrate your data, without giving up control, to power your cloud analytics and digital innovation.