The StreamSets and webMethods platforms have now been acquired by IBM 

What are machine learning and MLOps? Key concepts, tools, and data challenges

Why going from model to production in machine learning depends on DataOps.

What is machine learning?

Machine learning (ML) is a type of artificial intelligence (AI) that use algorithms to become more accurate over time without human intervention. Instead of hard coding or defining the outcome, a machine learning model uses data to learn how to make a decision and then incorporates feedback to improve accuracy over time.

The more data the algorithm has to work with and the faster it can process the feedback, the more accurate the results will be. This article focuses on what machine learning means for data teams, the people responsible for making sure there is a continuous flow of fresh, reliable data for the machine learning analysts and engineers to use.


Creating Order from Chaos

Governance in the Data Wild West

Machine learning use cases

Use of machine learning and AI to learn, predict, and automate responses has transformed many industries.

  • Healthcare and life sciences companies train crash carts to understand when someone is going to have a heart attack before it happens.
  • Financial services and insurance companies enable rapid approval of loan applications and credit cards, root out fraud, and protect against cyberattacks.
  • Gaming and entertainment companies generate real-time leaderboards and interactions, flag objectionable text and interactions, as well as keep people engaged with recommended next activities.
  • Logistics and transportation companies use machine learning to optimize routes, prevent fraud, and reduce fuel costs.

Anytime a decision needs to be made, machine learning helps data scientists through algorithms trained to make classifications or predictions, uncovering key insights within data mining projects. The promise of machine learning is to make our lives richer, solve big problems like climate change and global poverty, and cure cancer. But the outcomes depend on the inputs and the way the model works. Bad data and bias can take hold without intention or understanding.

Before we get to the challenges of machine learning and how to improve data, let’s take a deeper look at how machine learning works.

How machine learning works

There are 3 basic types of machine learning and each one uses data in a different way.

Supervised learning

A model is constructed based on input-output pairs using historical data with known labels. Once the model is trained, it can be used in production on similar datasets. Supervised learning works well on structured data where you can control the inputs.

Common business problems addressed by supervised learning include:

  • Will a customer buy a particular product or not?
  • Is the tumor malignant or benign?
  • Is a piece of text insulting, threatening, or obscene?
  • What is the predicted selling price of a house?

Unsupervised learning

When labels on past data are unavailable or unknown, the model is constructed by clustering data based on relationships between the variables present in the data. Unsupervised learning, allows machine learning to be applied to problems with little or no idea of what the outputs should look like. Unsupervised learning might be used on sensor data or web logs, unstructured or continuous data coming from inside or outside of your organization.

Questions that might be answered by unsupervised learning models include:

  • Which customers will provide the highest lifetime value?
  • How likely is it that this customer will pay back a loan if we approve it?
  • Which trucks in our fleet should be brought in for maintenance?

Neural networks and deep learning

Instead of pairing or clustering, neural networks use a hidden layer between input and output to create connections and weight them. As the neural network learns, the connections become more refined and better at predicting outcomes.

Deep learning has many hidden layers of complex neural networks and is used to solve highly complex problems.

Common neural networks and deep learning applications include:

  • Computer vision, image recognition, and object detection
  • Speech recognition and natural language processing
  • Recommendation systems from next best product to matchmaking
  • Anomaly detection for cybersecurity, medical diagnosis, and more

Neural networks depend on data processing to turn non-numerical information into numbers so that the algorithms can be applied.


Data Engineer's Handbook

4 Cloud Design Patterns

Challenges to machine learning

Without data, machine learning is like a balloon without air. Data integration and machine learning go hand-in-hand because all three types of machine learning depend on a continuous, reliable flow of trusted data. And the data you depend on constantly changes, not just the data itself but also its structure, meaning, and infrastructure.

Getting the data

Data scientists spend 45% of their time just getting data. Although that number has declined from about 80%, loading and cleansing data is still a significant drag to rapid innovation and expansion of machine learning initiatives.

The proliferation of sources, data platforms, and evolving technologies may require data pipelines for batch, streaming, CDC, ETL, or ELT processing. Advanced processing engines like Spark and cloud platforms built for machine learning, such as Databricks, require specialized skills.

Maintaining the data

Data quality is an input problem, but more and more data issues are a data drift problem. When the data schema, semantics or infrastructure change in unplanned or unexpected ways, data can be dropped or lost with cascading effects that are non-linear and non-traceable.

For example, a bank might have a billion rows of transactions used to train a model. A change to the data schema might result in an entire group of data being dropped. The models continue to learn, but miss an entire population of data.

Monitoring the data

As data scientists put their models into production, eventually someone will ask why their algorithms performed as they did. Monitoring machine learning models for traceability and compliance is a significant challenge and beyond the scope of this article.

One place to start is to have a clear understanding of the data value chain with visibility into all the data pipelines feeding into your model. It is essential to instrumented data pipelines and automation as is having a single pane of glass to monitor and manage across all design patterns and ecosystems.

Getting the right data into the models is absolutely critical. And because the models are probabilistic, machine learning engineers must convince leadership that data can be trusted in order to win buy-in for their initiatives.

What is MLOps?

ML value, AI value, and analytics value is meaningful if the data it operates on is valid. Noise in the data disrupts learning and leads to unreliable outcomes. Traditional methods of data integration invested heavily in data quality as a way of ensuring that only the cleanest data made it into the models. But the scale and complexity of today’s unknowable data architectures makes this approach risky. As companies operationalize ML, they increasingly depend on strong data integration frameworks being in place.

No matter what business you were in 10 years ago, today you are in the data business. But before your data scientists and machine learning experts can change the world with their models, they have to have data to train them and data to sustain them.

They must be proficient in dealing with multi-modal data, structured and unstructured data at scale. Depending on the source and destination for the data, data pipelines might need to support batch or stream processing, or change data capture (CDC) across hybrid and multi-cloud platforms.

Are you ready to unlock your data?
Resilient data pipelines help you integrate your data, without giving up control, to power your cloud analytics and digital innovation.