Software AG no longer operates as a stock corporation, but as Software GmbH (company with limited liability). Despite the change of name, we continue to offer our goods and services under the registered trademarks .
WHITE PAPER

A Data Integration Journey to Hybrid/Multi-Cloud

Multi-Cloud modernization is critical

In today’s competitive landscape, the pressure has never been higher to utilize data to create insights that lead your business and develop rich data applications that deliver value directly to your customers. To create these new insights and customer experiences, companies must think differently about the data they capture, store, and deliver to end customers. That’s because, increasingly, the data needed to complete this picture is outside of the company’s control and outside of their data centers.

Bringing all this data in-house doesn’t usually make sense which is why many companies have targeted these new workloads to run in the cloud from day one. This requires a design approach that takes into consideration the abundance of resources around storage and compute. It means that scale needs to not only be built-in but also tested. However, it does not mean never dealing with relational and core systems data that lives in your control. In fact, you get the most value from any successful customer 360 or advanced analytics use case when you marry internal and external data. This is why many companies will take a hybrid/multi-cloud approach for many years to come. 

Moving to multi-cloud is a journey with no set answers for every scenario. Each company must decide which workloads to target for the cloud, which data they feel confident storing in the cloud, and the level of data access they want people in their organizations to have. So, if many companies are still dealing with moving their on-premises workloads to the cloud, then why are we here today to talk about multi-cloud?

By leading with a multi-cloud mindset, you can truly build a systematic and scalable data pipeline architecture that will adjust to your needs at every stage of the journey without costly redesign and months or years of lost productivity. The foundation of this data pipeline architecture is a set of principles that provide agility, automation, and scalability called DataOps. DataOps systems operate across data centers and public clouds utilizing the best tools for storage and transformations. Companies that lead with a DataOps approach can weather the shifts in their cloud strategy. 

Benefits of leading with a multi-cloud approach

Ensuring the best-fit architecture for your workload

On-premises and cloud ecosystems each have their benefits. For instance, you would be hard-pressed to find the type of public cloud disk and IO performance that would compare to bare metal servers. On the other hand, you could easily build a cloud server with more memory than you could even conceive of deploying on a server in your data center. Then, different public cloud vendors have varying prices and performance levels between cloud services. Truly hybrid/multi-cloud companies leverage the best of these form factors to make strategic decisions on where they run their workloads. Being multi-cloud gives companies the leverage to run their workloads in the place that will bring the biggest value, whether that is cost or performance-focused. They can do this because data is reliably replicated to each environment, ensuring that it has reliable data to execute the workload. 

Agility to manage costs and leverage managed services

Additionally, all cloud providers now provide managed services for common workload execution, from data ingestion to machine learning. These managed services run on optimized architectures specifically configured to provide optimal performance for things like stream processing, data transformations, and analytic query execution. These managed services often require no specific skills to operate the service, allowing developers to focus on the data-specific workloads and easing the management burden on the IT team. A company would be hard-pressed to design these purpose-built architectures for all the things they need to do with data which is why cloud managed services have become increasingly popular. However, they have one area of caution that every company should be mindful of: vendor lock-in.

Avoiding vendor lock-in and reducing development time

Vendor lock-in can often be hard to detect, and if you don’t understand the degree to which you are locked in, the problems only worsen as you scale. Cloud ecosystems evolve rapidly. A single cloud’s offerings and pricing model may be aligned today but not so aligned in the future. Frequently, data and data platform products are the sources of lock-in. Data has gravity even with the impressive network speeds that modern cloud providers offer. Moving and migrating data reliably between clouds requires a mature understanding of data engineering for clouds. Lock-in also happens when companies rely heavily on managed services provided by a single cloud provider. Some of these solutions create their own catalog, mappings, and custom executions which may be difficult to recreate in new cloud environments. When choosing core data integration tools, it’s important to understand how portable your data and pipelines are. What would be the cost to the team to redevelop?

Support innovation by meeting your team where they already are

Despite even the most well-intended mandates, data workloads are finding their way to the cloud. With analytics permeating every sector of the modern company, it’s often impossible to fully understand the scope of cloud usage across the business. A multi-cloud strategy mitigates this risk by building foundations that can be applied as best practices across clouds while meeting the needs of diverse teams. Data teams already have diverse requirements for data structure and delivery. Trying to apply those universally across different cloud environments can often be a daunting task. Cloud-neutral tooling allows for data and pipeline artifacts to be shared across teams for application across different cloud providers. This provides a degree of productivity by eliminating the need for costly redevelopment. 

Roadblocks to multi-cloud success

Obviously, there are some real advantages to multi-cloud operation. So why have only a fraction of companies attained this goal?

Spoiler Alert: It’s NOT as simple as moving data across clouds.

This section will discuss a couple of common pain points in operating in multi-cloud environments. But the simple answer is that the dynamic nature of data and data workloads present very specific challenges in reliably moving, syncing, and transforming data across clouds. The key term in this sentence is “reliably.” There are a plethora of tools for the simple migration of data to the cloud or from one cloud to another, but very few offer the reassurance that data is migrated in its intended form and meaning. If not addressed at its foundation, this problem will cause major havoc as you scale. To mitigate the risk of data corruption, it is important to avoid these common pitfalls on the road to multi-cloud adoption.

Plan for data drift

Data drift is the unexpected and unending change that happens to data naturally as it flows through modern data systems. Data systems have long been disciplined and controlled by the master schema. It was really the only way to make sense of large data sets. Companies defined their schema approach and enforced it globally across the systems they controlled. In multi-cloud operation, a schema-led approach is often a recipe for disaster because much of the data is outside the company’s control. The dynamic natures of cloud data platforms provide flexibility beyond the schema’s confines. Data drift becomes increasingly prevalent as you migrate to the cloud and across clouds. Simply addressing data drift after it happens will only result in continuous delays and diminishing trust in your data. Multi-cloud companies take a proactive approach to remediate drift in real-time. They plan for drift, address it in real-time, and then take action before drift pollutes downstream datasets.

Avoid costly re-work where possible

Here, we broach the subject of the cautions of point solutions in the cloud. Companies have many options when designing components of their full data lifecycle. Cloud providers almost universally provide tools for simple ingest, easy transformation, and quick visualization. These tools allow teams to start easily gaining value in their cloud data environments. The solutions are often cheap and easy to get started—but be sure to heed the warnings of companies who have come before you. Often these point solutions are proprietary and limited in functionality. Providers offer them at a compelling price point to get more of your data into their systems. Data engineers often report that while these tools may be satisfactory for the preliminary movement of data, they often fall short of addressing the comprehensive needs of the data engineering team. This forces engineers to cobble together solutions within the constraints of the tool limitations versus the goals of the company or project. The tax results in diminishing productivity and further delays in realizing value. Unfortunately, many companies only realize how costly this redesign will be once they have decided to move to another cloud. This is why taking a proactive approach, with multi-cloud in mind from the outset, will provide you the agility when you need it most.

Cloud data ecosystem overview

Go beyond the hype to create your modern data stack

Cloud ecosystems are vast, moving at a feverish pace, and increasingly require a wide range of skills. Cloud platforms often provide services for storage, compute, data warehousing and data lakes, and services for streaming data. These services run on optimized architectures and remove the burden of management from both the infrastructure and operations. You can build a modern data stack with these cloud services in concert with third-party tools. The modern data stack is unique to each company and represents how data needs to be stored, treated, and delivered to analytics teams.

Designing a modern data stack can be daunting because it is often hard to navigate past the marketing and hype surrounding the tool descriptions. To simplify this exercise, here's a simple breakdown:

What is it?
What do companies use it for?
Storage
Amazon S3
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers scalability, data availability, security and performance.
Companies use Amazon S3 to land, store, and protect data for a range of use cases, such as data lakes, websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics. 
Microsoft Azure ADLS
Azure Data Lake service is used to store and process data for applications that utilize structured, semi-structured or unstructured data produced from applications including social networks, relational data, sensors, videos, web apps, mobile or desktop devices.
Companies use Azure Data Lake Services (ADLS) to land, store, and protect raw data for use in downstream analytics or operational data applications.
Google Cloud Storage
Google Cloud Storage is Google’s object storage platform and has an ever-growing list of storage bucket locations where you can store your data with multiple automatic redundancy options.
Companies use Google Cloud Storage to land raw data that can be utilized by the rest of Google Cloud’s services.
Compute
Amazon EMR
Amazon EMR is a managed cluster platform for running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze data.
Companies use EMR as the engine for their ETL workloads allowing them to apply data transformations to big data sets.
Microsoft Azure HDInsight
Azure HDInsight provides users with popular open-source frameworks—including Apache Hadoop, Spark, Hive, Kafka, and more—using Azure HDInsight.
Companies use HDInsight to migrate their big data workloads and processing to the cloud.
Google Dataproc
Google Dataproc is a fully managed and scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks.
Companies use Dataproc to transform data from their object storage raw landing zones to load into platforms like Google BigQuery.
Data lake / data ware- housing
Amazon Redshift
Amazon Redshift provides fast, simple, cost-effective data warehousing that scales as your data grows. Redshift is similar to a traditional data warehouse without the constraints of scaling and schema flexibility.
Companies will store their data in a conformed format into Redshift for easy access by analytics professionals. ETL operations are often needed to convert data from it’s raw format into a conformed format.
Microsoft Azure Synapse
Azure Synapse Analytics is an analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It gives users either server-less or dedicated options.
Companies use Azure Synapse to quickly query and visualize data in their object storage and database systems.
Google BigQuery
Google BigQuery is a serverless, scalable, and cost-effective “multi-cloud” data warehouse.
Companies use Google BigQuery to query and visualize data that is cleansed after laning in object storage.
Snowflake Data Cloud
Snowflake's Data Cloud platform supports multiple data workloads, from data warehousing and data lake to data engineering, data science, and data application development across multiple cloud providers.
Companies use Snowflake to land raw data, transform and manipulate data and provide self-service access to analytics professionals. 
Databricks Delta Lake
Delta Lake is an open source storage layer that provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing based on Apache Spark. Databricks Delta Lake works on multiple cloud platforms.
Companies use Databricks to land raw data, transform data using Apache Spark, and load transformed data for access into the Delta Lake platform.
Streaming
Amazon Kinesis
Amazon Kinesis helps users collect, process, and analyze real-time, streaming data.
Companies use Amazon Kinesis as the streaming ingest agent to ensure delivery of streaming data for analytics.
Google Pub/Sub
Google Pub/Sub is used for streaming analytics and data integration pipelines to ingest and distribute data. It is equally effective as a messaging—oriented middleware for service integration or as a queue to parallelize tasks.
Companies leverage Google Pub/Sub to send and subscribe to data streams for landing raw data into landing zones.

Rethinking multi-cloud deployment

Before you begin building pipelines, it’s important to solidify your deployment strategy. In fact, if you get the deployment wrong, no amount of pipelines will save you. This includes designing for multiple cloud environments. Legacy tools for building data pipelines relied on tight mappings to the deployment environment to execute pipeline transformations. These mappings were discrete and unique, which required rework when changes to the pipeline logic were needed. Modern data integration tools offer a reimagined approach to the design of data pipelines and can generally be grouped into two categories.

Ecosystem centric

Many modern/cloud-native data integration solutions rely heavily on the resources in the ecosystem where a workload or team works. This can provide several advantages in terms of cost and performance because the design minimizes the amount of data that needs to be moved in order to process the data. Also, by utilizing the underlying storage of the data platform, they can have the most direct connection to the data, minimizing limitations for throughput and IO. These solutions are commonly branded along with the ecosystem under names like ELT or SQL solutions.

Control plane and data plane decoupling

Another approach decouples the deployment and pipeline operations to give maximum agility and workload portability. In this scenario pipeline development, collaboration, design, and automation are all executed via a single Control Plane. This plane is focused on pipeline design, engine deployment, and cross-team collaboration. The control pane can be offered as a SaaS platform with broad access across the data and analytics teams. The pipeline engines themselves can be deployed directly in the data environment or in multiple data environments. These engines represent the Data Plane layer. 

This decoupled design is often superior because data from the data plane engines can be processed locally, even utilizing ecosystem resources like ELT and Apache Spark-based processing. This means that no data travels into the control plane and pipelines can easily be applied to the cloud ecosystem and changed with minimal operations. This approach offers the best of both worlds and a scalable model that is proven in many enterprises.

Now, with our multi-cloud deployment strategy defined, we can start building data pipelines. 

building data pipelines

3 Pipeline Designs for Multi-Cloud Success

  1. On-premises (legacy) to cloud
  2. Migrating from one cloud to another
  3. Multi-cloud operational


The move to the cloud and across clouds is often a journey.

For some companies, existing in the cloud from day one may have been an option. These companies may have more agility baked into their business models to avoid any risks the public cloud might present. Many others are still trying to map how they migrate their existing systems to a single cloud. No stage in the journey is finite and the path is not always linear. Other companies may choose to make their leap to the cloud a multi-cloud journey. While there is no one template for everyone we have outlined the critical inflection points that drove the design of a multi-cloud environment and provided pipeline demonstrations to visualize the concepts.

 

3 PHASES FOR MULTI-CLOUD SUCCESS
On-premises (legacy) to cloud

So let’s start with a common springboard to a multi-cloud journey: migrating data from on-premises to the cloud.

The first step in a multi-cloud journey is often to migrate on-premises data to one or multiple cloud object store locations. This is usually executed in two stages: an initial batch load and incremental updates to ensure that legacy and cloud environments are in sync. For many companies, this may be the starting point. Others may choose a multi-cloud design from day one and proactively send data to multiple destinations. There are no “right” answers, only the answer that best fits your business.

Pipeline considerations:
 

  • Rethinking a schema approach: Traditional data warehouse and data lake migrations can take months or even years because of the tendency to want to map out the schema design both on the source system and in the target destination. However, most cloud data warehouses and data lakes have flexible schema designs that will infer schema based on your incoming data. So, no need to carefully build the schema design on the source system side. Choose a modern data integration tool that supports multi-table inserts and then just press play and watch your schema build reliably in the destination.
  • Think beyond batch: While the initial scope of your data migration may only focus on the initial batch loading of data, it is important to re-shift your thinking to the idea of delivering continuous data. Even the most curated datasets need to change over time. Multi-cloud companies do not think of data migration as a single activity but a system to constantly deliver new data and enrich existing data and data relationships. Streaming brokers such as Apache Kafka and Amazon Kinesis can deliver periodic and streaming data in a continuous fashion with little oversight and management. 
  • Prepare for data drift: As we stated earlier, as data moves from highly regulated on-premise systems to the cloud you are likely to encounter data drift. Multi-cloud companies do not fight drift, they plan for it. They make sure their integration tools are equipped to handle the drift and make sure that it doesn’t break pipeline connections. They also invest in tools for data quality so they can ensure that data lands in it’s intended form. Because analytics are only as reliable as the data.
  • Consider writing to two destinations: One way to mitigate being locked into a cloud vendor or less-than-ideal workload design is to replicate critical data to multiple cloud object stores. These might be in two different clouds or possibly between two managed services in the same cloud. Since storage costs are generally very manageable in the cloud you can now have the freedom to apply compute to data in the most ideal manner, as well as the freedom to change your mind over time.

Check out the following pipeline example, and explore this pipeline further on github.

Figure 1. This example pipeline first connects and queries MySQL with an offset on primary key, removes the index, adds a timestamp and finally sends records based on a condition either to a central Snowflake warehouse or to small document storage on Amazon S3 as an Avro data format.
3 PHASES FOR MULTI-CLOUD SUCCESS
Migrating from one cloud to another

Another common pattern is moving workloads and migrating data between public cloud environments. Like the legacy (hybrid) approach you can’t simply throw data at another cloud and expect your workloads to perform the same. According to recent research, it can take from 6 to 24 months to migrate fully from one cloud to another. On the lower end of that spectrum are the organizations that have planned for multi-cloud and adopted data pipeline tools that decouple the pipeline logic from the infrastructure. In these scenarios, a cloud migration for a data workload may be as simple as changing the destination and some configuration settings. Companies and data teams can utilize pipeline fragments to make destination changes to hundreds of pipelines all at once. 

Pipeline considerations:
 

  • Design for new compute platforms and managed services: As important as it is to understand the new design considerations for moving to a single cloud, data professionals must also identify and plan for differences across clouds. While most cloud systems offer operating system and coding language flexibility at the application layer, data-layer components are often more specialized and proprietary in the control and experience. At the pipeline level, you want to make sure you can apply the same data pipeline logic across clouds and managed services with minimal redesign. When using specific ecosystem tools, you may need to migrate off the technology completely which will result in further delay and decreased agility.
  • Avoid vendor lock-in at the metadata and schema level: When using cloud data services, consider the additional risks you face by standardizing certain elements of your core data design. Should you reconsider your cloud strategy, some cloud tools offer metadata services and schema-specific design requirements that make switching solutions even more painful. Some degree of specificity is always around in any technology solution, but your team should evaluate whether the level of risk of lock-in is acceptable in exchange for the value that the service provides. This is especially important to consider early, as making these changes at scale only becomes more complex.
  • Sync environments and migrate over time: The key to being multi-cloud is not having to execute a fire sale every time changes occur from a technology or business perspective. Having data sets synced across multiple environments is an increasingly common practice for modern companies. Given that storage in the cloud is priced at commodity level, companies can be more liberal about how much data they replicate and how often. This is not to encourage you to replicate all your data to multiple clouds but to consider evaluating the core data used for analytics and application and make that strategic data available as close to the workload as possible. Modern data pipelines can deliver duplicate data to multiple destinations, a core capability to operate in multi-cloud operational.
  • Implement operational visibility across clouds: It is common for data integration services and data platforms to offer tools for management, configuration, and workload visibility. Generally, data engineers are rather talented at navigating these numerous control experiences to debug and work when pipelines or jobs break. When operating in multi-cloud this management burden becomes even greater. One approach that is common is to bolt on observability solutions after the data has landed to identify potential concerns with the data pipelines. While this is a common approach and indeed needed for certain scenarios, health and visibility into execution at the pipeline level are just as impactful and complimentary. Consider data pipeline solutions that provide as much visibility in as few panes as possible. This will ensure your team spends only minimal time debugging and breakfixing.
Figure 2. This example pipeline first connects to the data science data mart described in the first pipeline, then enriches the records from a lookup table, finally renaming fields for precision,  and then sending the records onwards as .csvs; first to an Azure Data Lake and then with a quick switch to Google Cloud Storage. This view in StreamSets compares version 1 and version 2 of this same pipeline.
3 PHASES FOR MULTI-CLOUD SUCCESS
Multi-cloud operational

This next phase is where modern companies are increasingly finding themselves. It’s impossible to know every shift in ecosystem, business strategy, or competitive pressure but you can be prepared for it. Multi-cloud operational companies are syncing data across clouds and executing workloads in the environments that best fit the requirements for cost and performance. While having duplicate data environments does require initial investments in duplicate storage and oversight, the initial saving is a clear testament to the power of making this shift. If you plan correctly and lean into multi-cloud considerations, your move into multi-cloud operational can be painless and swift.

Pipeline considerations:
 

  • Design portability across compute and storage platforms: The value of multi-cloud may never be fully realized if organizations don’t have the agility to move workloads liberally and strategically. Choosing platforms built on open standards like Apache Kafka, Apache Spark, and others ensures that some compute and streaming workloads can operate in a common form on different clouds. For data integration, you have the choice to leverage multiple tools across ecosystems, with the acknowledged tax that it will also require added development or a data integration solution that operates across clouds. For example, the above pipeline example looks mostly the same across clouds. 
  • Consider development at scale: It is one thing to build a simple pipeline and preview a set of test data through it. It is a very different experience running a pipeline in production. If your team has varying familiarity and skills in each tool, are you effectively able to load balance the work? Your ability to establish best practices and reusable processes will determine the speed with which you can develop data pipelines and ultimately deliver new functionality.  
  • Leverage reusable components: Our pipeline example shows a pipeline fragment, an artifact of a section of a pipeline. Fragments are generally used in two ways. The first way is that developers and data engineers will use pipeline fragments to capture common processing stages andeven sources and endpoints to speed up their development. They can use these fragments and even update the pipeline which uses the fragment automatically. This is an example of how you can better mitigate the risks of infrastructure drift. The second way that data engineers use fragments is to accelerate development across their teams. One data engineer can create multiple fragments that can be used by many ETL developers and analysts.  
  • Programmatic development and testing: While it is very possible to do a complete cloud migration without writing a single line of code, capable modern data integration systems and pipeline tools also support extensibility into a programmatic approach. In our pipeline example, an SDK for Python is leveraged to reuse the pipeline fragment from our earlier example to create an entirely new pipeline with either a Google Cloud Storage or an Azure Data Lake destination. These extensible frameworks can also help test pipeline function, which is critical to operating pipelines at scale.

 To explore this pipeline further, visit our github.

Figure 3. This is a fragment created from three stages from the previous pipeline. Fragments are reusable elements from pipelines.
Figure 4. This pipeline uses the fragment in Figure 3 and checks for duplicates in all of the fields of the data before sending the unique data to the central Snowflake warehouse and the duplicate data to trash.
Figure 5. Updating multiple pipelines by editing fragments is simple. Edit the fragment and check it in. In the Update Pipelines menu, all of the pipelines that use that fragment will be visible. Check the pipelines you wish to edit and save.

Federated data engineering across multi-cloud

Modern companies are federating their data engineering and data integration practices to deliver self-service data access, automated data transformations, and real-time data access that feeds analytics and data products with the help of StreamSets.

StreamSets is a modern data integration platform that enables companies to achieve the benefits of DataOps in multi-cloud environments. The platform has a unique architecture that lets enterprises build, run, monitor, and manage smart data pipelines at scale across hybrid and multi-cloud environments.

How do we do this? For pipeline development, StreamSets smart data pipelines are decoupled and intent-driven, making them far more resilient to constantly changing data schema, semantics, and infrastructure. StreamSets’ smart data pipelines are fully instrumented to detect all forms of data drift and decoupled as much as possible from the source and destination systems so that 80% of data drift can be automatically handled, avoiding broken pipelines. StreamSets’ smart data pipelines are also intent-driven, meaning they abstract away the “what” of the data from the “how” so that data engineers can focus on the meaning of the data rather than wasting time on the underlying technical implementation details that are irrelevant to the business.

For pipeline operations, StreamSets separates the data plane from the control plane so you can process data anywhere and control it all from a single pane of glass. The data plane engines, such as Data Collector and Transformer engines, execute the data pipelines that move and transform data. The data plane engines run in your environment in a VM on-premise, on an on-premises cluster, in a VPC, or in a public cloud. Many enterprises have multiple instances of data plane engines, each running hundreds or thousands of data pipelines distributed across their hybrid/multi-cloud environment. Control Hub, which is a StreamSets-managed SaaS, is able to deploy, manage and monitor all the data pipelines and all the data plane engines, no matter where they are, providing a single pane of glass for cross-enterprise transparency and control. This decoupled architecture ensures enterprise data remains fully secure and allows for global control and visibility across a distributed hybrid architecture.

This intent-driven design and decoupled approach allow data engineers to focus on building the pipeline logic. If requirements from the business or more optimized cloud solutions arise, data engineers can make minor adjustments to the pipeline, design and share those artifacts across the team and apply them to pipelines and fragments automatically.

Putting your plan into action

Now is the time to start thinking like a multi-cloud company and take action. You are aware of the precautions, design considerations, and best practices of the companies that came before you. The rest is up to you and your team of data-driven peers.  

  • Create a plan: Identify the skills and tools needed to operate utilizing the skills and teams you have today. Leveraging self-service training and certification for cloud ecosystems can better help your team better understand and plan for multi-cloud adoption. Identify early the data you want to replicate and ensure that replication will not interfere with existing operations. Lastly, don’t be afraid to experiment using open data sets and automation.
  • Ensure that systems and pipelines can scale: Even if you are starting with your first data pipeline, testing and automation can help you make sure that the pipeline will operate as expected at scale. Previewing pipeline runs can help you test how data moves through each stage and how long those stages are taking. Identifying the right test data and criteria for evaluating the test runs will allow you to move from concept to production quickly and reliably.
  • Build foundational artifacts: Building foundational artifacts are key to scaling data pipelines across your team. It allows a single data engineer to enable multiple ETL developers, which in turn deliver value to many data analysts. During the concept phase, pipelines, fragments of pipelines, and connections to data systems can all be built, saved, and shared to enable development in a parallel manner. It is a great way to bootstrap your team on the new tools and convey whatever level of complexity is appropriate for the role and knowledge of systems.
  • Identify your methods for keeping data anonymous and secure: Operating customer data in the cloud requires additional considerations to ensure customer trust is not violated. A common practice in data pipelines is masking or anonymizing customer or sensitive data so that data engineers and data scientists can work freely on the data without the risk of violating compliance. Have a plan for addressing your customer data and begin to identify fields where PII data lives.

Conclusion: Multi-cloud in the real world

It’s always helpful to hear how others put theory into action, so we’ll wrap this up with a customer story. We have a customer who started their journey with a universal sentiment that they would never move data to the cloud. Just shy of a year later, they are operating across multiple clouds. The reason that they could pivot to both technical and business considerations was due largely to the approach and tools they used, which allowed them to decouple the pipeline design process and the implementation. This approach minimized rework and ensured reliable pipeline operations that accounted for data drift. They were able to avoid vendor lock-in, ensure the best-fit architecture for their workload and change when their requirements changed. Most importantly, they were able to federate the practice of data engineering with tools like StreamSets which helped them create pipelines and pipeline artifacts that apply to any cloud ecosystem.
You may also like:
Research Report
The Business Value of Data Engineering
Explore the pivotal role of data engineering in driving business value and innovation. Dive into our research on trends, challenges, and strategies for 2024.
White paper
The Data Integration Advantage: Building a Foundation for Scalable AI
Discover how modern data integration is key to scaling AI initiatives. Learn strategies for overcoming AI challenges and driving enterprise success.
eBook
Five Principles for Agile Data & Operational Analytics
Master the five data principles essential for powering effective operational analytics. Transform your data strategy for agility and insight.
Are you ready to unlock your data?
Resilient data pipelines help you integrate your data, without giving up control, to power your cloud analytics and digital innovation.
ICS JPG PDF WRD XLS