Multi-Cloud modernization is critical
In today’s competitive landscape, the pressure has never been higher to utilize data to create insights that lead your business and develop rich data applications that deliver value directly to your customers. To create these new insights and customer experiences, companies must think differently about the data they capture, store, and deliver to end customers. That’s because, increasingly, the data needed to complete this picture is outside of the company’s control and outside of their data centers.
Bringing all this data in-house doesn’t usually make sense which is why many companies have targeted these new workloads to run in the cloud from day one. This requires a design approach that takes into consideration the abundance of resources around storage and compute. It means that scale needs to not only be built-in but also tested. However, it does not mean never dealing with relational and core systems data that lives in your control. In fact, you get the most value from any successful customer 360 or advanced analytics use case when you marry internal and external data. This is why many companies will take a hybrid/multi-cloud approach for many years to come.
Moving to multi-cloud is a journey with no set answers for every scenario. Each company must decide which workloads to target for the cloud, which data they feel confident storing in the cloud, and the level of data access they want people in their organizations to have. So, if many companies are still dealing with moving their on-premises workloads to the cloud, then why are we here today to talk about multi-cloud?
By leading with a multi-cloud mindset, you can truly build a systematic and scalable data pipeline architecture that will adjust to your needs at every stage of the journey without costly redesign and months or years of lost productivity. The foundation of this data pipeline architecture is a set of principles that provide agility, automation, and scalability called DataOps. DataOps systems operate across data centers and public clouds utilizing the best tools for storage and transformations. Companies that lead with a DataOps approach can weather the shifts in their cloud strategy.
Benefits of leading with a multi-cloud approach
Ensuring the best-fit architecture for your workload
On-premises and cloud ecosystems each have their benefits. For instance, you would be hard-pressed to find the type of public cloud disk and IO performance that would compare to bare metal servers. On the other hand, you could easily build a cloud server with more memory than you could even conceive of deploying on a server in your data center. Then, different public cloud vendors have varying prices and performance levels between cloud services. Truly hybrid/multi-cloud companies leverage the best of these form factors to make strategic decisions on where they run their workloads. Being multi-cloud gives companies the leverage to run their workloads in the place that will bring the biggest value, whether that is cost or performance-focused. They can do this because data is reliably replicated to each environment, ensuring that it has reliable data to execute the workload.
Agility to manage costs and leverage managed services
Additionally, all cloud providers now provide managed services for common workload execution, from data ingestion to machine learning. These managed services run on optimized architectures specifically configured to provide optimal performance for things like stream processing, data transformations, and analytic query execution. These managed services often require no specific skills to operate the service, allowing developers to focus on the data-specific workloads and easing the management burden on the IT team. A company would be hard-pressed to design these purpose-built architectures for all the things they need to do with data which is why cloud managed services have become increasingly popular. However, they have one area of caution that every company should be mindful of: vendor lock-in.
Avoiding vendor lock-in and reducing development time
Vendor lock-in can often be hard to detect, and if you don’t understand the degree to which you are locked in, the problems only worsen as you scale. Cloud ecosystems evolve rapidly. A single cloud’s offerings and pricing model may be aligned today but not so aligned in the future. Frequently, data and data platform products are the sources of lock-in. Data has gravity even with the impressive network speeds that modern cloud providers offer. Moving and migrating data reliably between clouds requires a mature understanding of data engineering for clouds. Lock-in also happens when companies rely heavily on managed services provided by a single cloud provider. Some of these solutions create their own catalog, mappings, and custom executions which may be difficult to recreate in new cloud environments. When choosing core data integration tools, it’s important to understand how portable your data and pipelines are. What would be the cost to the team to redevelop?
Support innovation by meeting your team where they already are
Despite even the most well-intended mandates, data workloads are finding their way to the cloud. With analytics permeating every sector of the modern company, it’s often impossible to fully understand the scope of cloud usage across the business. A multi-cloud strategy mitigates this risk by building foundations that can be applied as best practices across clouds while meeting the needs of diverse teams. Data teams already have diverse requirements for data structure and delivery. Trying to apply those universally across different cloud environments can often be a daunting task. Cloud-neutral tooling allows for data and pipeline artifacts to be shared across teams for application across different cloud providers. This provides a degree of productivity by eliminating the need for costly redevelopment.
Roadblocks to multi-cloud success
Spoiler Alert: It’s NOT as simple as moving data across clouds.
This section will discuss a couple of common pain points in operating in multi-cloud environments. But the simple answer is that the dynamic nature of data and data workloads present very specific challenges in reliably moving, syncing, and transforming data across clouds. The key term in this sentence is “reliably.” There are a plethora of tools for the simple migration of data to the cloud or from one cloud to another, but very few offer the reassurance that data is migrated in its intended form and meaning. If not addressed at its foundation, this problem will cause major havoc as you scale. To mitigate the risk of data corruption, it is important to avoid these common pitfalls on the road to multi-cloud adoption.
Plan for data drift
Data drift is the unexpected and unending change that happens to data naturally as it flows through modern data systems. Data systems have long been disciplined and controlled by the master schema. It was really the only way to make sense of large data sets. Companies defined their schema approach and enforced it globally across the systems they controlled. In multi-cloud operation, a schema-led approach is often a recipe for disaster because much of the data is outside the company’s control. The dynamic natures of cloud data platforms provide flexibility beyond the schema’s confines. Data drift becomes increasingly prevalent as you migrate to the cloud and across clouds. Simply addressing data drift after it happens will only result in continuous delays and diminishing trust in your data. Multi-cloud companies take a proactive approach to remediate drift in real-time. They plan for drift, address it in real-time, and then take action before drift pollutes downstream datasets.
Avoid costly re-work where possible
Here, we broach the subject of the cautions of point solutions in the cloud. Companies have many options when designing components of their full data lifecycle. Cloud providers almost universally provide tools for simple ingest, easy transformation, and quick visualization. These tools allow teams to start easily gaining value in their cloud data environments. The solutions are often cheap and easy to get started—but be sure to heed the warnings of companies who have come before you. Often these point solutions are proprietary and limited in functionality. Providers offer them at a compelling price point to get more of your data into their systems. Data engineers often report that while these tools may be satisfactory for the preliminary movement of data, they often fall short of addressing the comprehensive needs of the data engineering team. This forces engineers to cobble together solutions within the constraints of the tool limitations versus the goals of the company or project. The tax results in diminishing productivity and further delays in realizing value. Unfortunately, many companies only realize how costly this redesign will be once they have decided to move to another cloud. This is why taking a proactive approach, with multi-cloud in mind from the outset, will provide you the agility when you need it most.
Cloud data ecosystem overview
Go beyond the hype to create your modern data stack
Cloud ecosystems are vast, moving at a feverish pace, and increasingly require a wide range of skills. Cloud platforms often provide services for storage, compute, data warehousing and data lakes, and services for streaming data. These services run on optimized architectures and remove the burden of management from both the infrastructure and operations. You can build a modern data stack with these cloud services in concert with third-party tools. The modern data stack is unique to each company and represents how data needs to be stored, treated, and delivered to analytics teams.
Designing a modern data stack can be daunting because it is often hard to navigate past the marketing and hype surrounding the tool descriptions. To simplify this exercise, here's a simple breakdown:
Rethinking multi-cloud deployment
Before you begin building pipelines, it’s important to solidify your deployment strategy. In fact, if you get the deployment wrong, no amount of pipelines will save you. This includes designing for multiple cloud environments. Legacy tools for building data pipelines relied on tight mappings to the deployment environment to execute pipeline transformations. These mappings were discrete and unique, which required rework when changes to the pipeline logic were needed. Modern data integration tools offer a reimagined approach to the design of data pipelines and can generally be grouped into two categories.
Ecosystem centric
Many modern/cloud-native data integration solutions rely heavily on the resources in the ecosystem where a workload or team works. This can provide several advantages in terms of cost and performance because the design minimizes the amount of data that needs to be moved in order to process the data. Also, by utilizing the underlying storage of the data platform, they can have the most direct connection to the data, minimizing limitations for throughput and IO. These solutions are commonly branded along with the ecosystem under names like ELT or SQL solutions.
Control plane and data plane decoupling
Another approach decouples the deployment and pipeline operations to give maximum agility and workload portability. In this scenario pipeline development, collaboration, design, and automation are all executed via a single Control Plane. This plane is focused on pipeline design, engine deployment, and cross-team collaboration. The control pane can be offered as a SaaS platform with broad access across the data and analytics teams. The pipeline engines themselves can be deployed directly in the data environment or in multiple data environments. These engines represent the Data Plane layer.
This decoupled design is often superior because data from the data plane engines can be processed locally, even utilizing ecosystem resources like ELT and Apache Spark-based processing. This means that no data travels into the control plane and pipelines can easily be applied to the cloud ecosystem and changed with minimal operations. This approach offers the best of both worlds and a scalable model that is proven in many enterprises.
Now, with our multi-cloud deployment strategy defined, we can start building data pipelines.
3 Pipeline Designs for Multi-Cloud Success
- On-premises (legacy) to cloud
- Migrating from one cloud to another
- Multi-cloud operational
The move to the cloud and across clouds is often a journey.
For some companies, existing in the cloud from day one may have been an option. These companies may have more agility baked into their business models to avoid any risks the public cloud might present. Many others are still trying to map how they migrate their existing systems to a single cloud. No stage in the journey is finite and the path is not always linear. Other companies may choose to make their leap to the cloud a multi-cloud journey. While there is no one template for everyone we have outlined the critical inflection points that drove the design of a multi-cloud environment and provided pipeline demonstrations to visualize the concepts.
So let’s start with a common springboard to a multi-cloud journey: migrating data from on-premises to the cloud.
The first step in a multi-cloud journey is often to migrate on-premises data to one or multiple cloud object store locations. This is usually executed in two stages: an initial batch load and incremental updates to ensure that legacy and cloud environments are in sync. For many companies, this may be the starting point. Others may choose a multi-cloud design from day one and proactively send data to multiple destinations. There are no “right” answers, only the answer that best fits your business.
Pipeline considerations:
- Rethinking a schema approach: Traditional data warehouse and data lake migrations can take months or even years because of the tendency to want to map out the schema design both on the source system and in the target destination. However, most cloud data warehouses and data lakes have flexible schema designs that will infer schema based on your incoming data. So, no need to carefully build the schema design on the source system side. Choose a modern data integration tool that supports multi-table inserts and then just press play and watch your schema build reliably in the destination.
- Think beyond batch: While the initial scope of your data migration may only focus on the initial batch loading of data, it is important to re-shift your thinking to the idea of delivering continuous data. Even the most curated datasets need to change over time. Multi-cloud companies do not think of data migration as a single activity but a system to constantly deliver new data and enrich existing data and data relationships. Streaming brokers such as Apache Kafka and Amazon Kinesis can deliver periodic and streaming data in a continuous fashion with little oversight and management.
- Prepare for data drift: As we stated earlier, as data moves from highly regulated on-premise systems to the cloud you are likely to encounter data drift. Multi-cloud companies do not fight drift, they plan for it. They make sure their integration tools are equipped to handle the drift and make sure that it doesn’t break pipeline connections. They also invest in tools for data quality so they can ensure that data lands in it’s intended form. Because analytics are only as reliable as the data.
- Consider writing to two destinations: One way to mitigate being locked into a cloud vendor or less-than-ideal workload design is to replicate critical data to multiple cloud object stores. These might be in two different clouds or possibly between two managed services in the same cloud. Since storage costs are generally very manageable in the cloud you can now have the freedom to apply compute to data in the most ideal manner, as well as the freedom to change your mind over time.
Check out the following pipeline example, and explore this pipeline further on github.
Another common pattern is moving workloads and migrating data between public cloud environments. Like the legacy (hybrid) approach you can’t simply throw data at another cloud and expect your workloads to perform the same. According to recent research, it can take from 6 to 24 months to migrate fully from one cloud to another. On the lower end of that spectrum are the organizations that have planned for multi-cloud and adopted data pipeline tools that decouple the pipeline logic from the infrastructure. In these scenarios, a cloud migration for a data workload may be as simple as changing the destination and some configuration settings. Companies and data teams can utilize pipeline fragments to make destination changes to hundreds of pipelines all at once.
Pipeline considerations:
- Design for new compute platforms and managed services: As important as it is to understand the new design considerations for moving to a single cloud, data professionals must also identify and plan for differences across clouds. While most cloud systems offer operating system and coding language flexibility at the application layer, data-layer components are often more specialized and proprietary in the control and experience. At the pipeline level, you want to make sure you can apply the same data pipeline logic across clouds and managed services with minimal redesign. When using specific ecosystem tools, you may need to migrate off the technology completely which will result in further delay and decreased agility.
- Avoid vendor lock-in at the metadata and schema level: When using cloud data services, consider the additional risks you face by standardizing certain elements of your core data design. Should you reconsider your cloud strategy, some cloud tools offer metadata services and schema-specific design requirements that make switching solutions even more painful. Some degree of specificity is always around in any technology solution, but your team should evaluate whether the level of risk of lock-in is acceptable in exchange for the value that the service provides. This is especially important to consider early, as making these changes at scale only becomes more complex.
- Sync environments and migrate over time: The key to being multi-cloud is not having to execute a fire sale every time changes occur from a technology or business perspective. Having data sets synced across multiple environments is an increasingly common practice for modern companies. Given that storage in the cloud is priced at commodity level, companies can be more liberal about how much data they replicate and how often. This is not to encourage you to replicate all your data to multiple clouds but to consider evaluating the core data used for analytics and application and make that strategic data available as close to the workload as possible. Modern data pipelines can deliver duplicate data to multiple destinations, a core capability to operate in multi-cloud operational.
- Implement operational visibility across clouds: It is common for data integration services and data platforms to offer tools for management, configuration, and workload visibility. Generally, data engineers are rather talented at navigating these numerous control experiences to debug and work when pipelines or jobs break. When operating in multi-cloud this management burden becomes even greater. One approach that is common is to bolt on observability solutions after the data has landed to identify potential concerns with the data pipelines. While this is a common approach and indeed needed for certain scenarios, health and visibility into execution at the pipeline level are just as impactful and complimentary. Consider data pipeline solutions that provide as much visibility in as few panes as possible. This will ensure your team spends only minimal time debugging and breakfixing.
This next phase is where modern companies are increasingly finding themselves. It’s impossible to know every shift in ecosystem, business strategy, or competitive pressure but you can be prepared for it. Multi-cloud operational companies are syncing data across clouds and executing workloads in the environments that best fit the requirements for cost and performance. While having duplicate data environments does require initial investments in duplicate storage and oversight, the initial saving is a clear testament to the power of making this shift. If you plan correctly and lean into multi-cloud considerations, your move into multi-cloud operational can be painless and swift.
Pipeline considerations:
- Design portability across compute and storage platforms: The value of multi-cloud may never be fully realized if organizations don’t have the agility to move workloads liberally and strategically. Choosing platforms built on open standards like Apache Kafka, Apache Spark, and others ensures that some compute and streaming workloads can operate in a common form on different clouds. For data integration, you have the choice to leverage multiple tools across ecosystems, with the acknowledged tax that it will also require added development or a data integration solution that operates across clouds. For example, the above pipeline example looks mostly the same across clouds.
- Consider development at scale: It is one thing to build a simple pipeline and preview a set of test data through it. It is a very different experience running a pipeline in production. If your team has varying familiarity and skills in each tool, are you effectively able to load balance the work? Your ability to establish best practices and reusable processes will determine the speed with which you can develop data pipelines and ultimately deliver new functionality.
- Leverage reusable components: Our pipeline example shows a pipeline fragment, an artifact of a section of a pipeline. Fragments are generally used in two ways. The first way is that developers and data engineers will use pipeline fragments to capture common processing stages andeven sources and endpoints to speed up their development. They can use these fragments and even update the pipeline which uses the fragment automatically. This is an example of how you can better mitigate the risks of infrastructure drift. The second way that data engineers use fragments is to accelerate development across their teams. One data engineer can create multiple fragments that can be used by many ETL developers and analysts.
- Programmatic development and testing: While it is very possible to do a complete cloud migration without writing a single line of code, capable modern data integration systems and pipeline tools also support extensibility into a programmatic approach. In our pipeline example, an SDK for Python is leveraged to reuse the pipeline fragment from our earlier example to create an entirely new pipeline with either a Google Cloud Storage or an Azure Data Lake destination. These extensible frameworks can also help test pipeline function, which is critical to operating pipelines at scale.
To explore this pipeline further, visit our github.
Federated data engineering across multi-cloud
Modern companies are federating their data engineering and data integration practices to deliver self-service data access, automated data transformations, and real-time data access that feeds analytics and data products with the help of StreamSets.
StreamSets is a modern data integration platform that enables companies to achieve the benefits of DataOps in multi-cloud environments. The platform has a unique architecture that lets enterprises build, run, monitor, and manage smart data pipelines at scale across hybrid and multi-cloud environments.
How do we do this? For pipeline development, StreamSets smart data pipelines are decoupled and intent-driven, making them far more resilient to constantly changing data schema, semantics, and infrastructure. StreamSets’ smart data pipelines are fully instrumented to detect all forms of data drift and decoupled as much as possible from the source and destination systems so that 80% of data drift can be automatically handled, avoiding broken pipelines. StreamSets’ smart data pipelines are also intent-driven, meaning they abstract away the “what” of the data from the “how” so that data engineers can focus on the meaning of the data rather than wasting time on the underlying technical implementation details that are irrelevant to the business.
For pipeline operations, StreamSets separates the data plane from the control plane so you can process data anywhere and control it all from a single pane of glass. The data plane engines, such as Data Collector and Transformer engines, execute the data pipelines that move and transform data. The data plane engines run in your environment in a VM on-premise, on an on-premises cluster, in a VPC, or in a public cloud. Many enterprises have multiple instances of data plane engines, each running hundreds or thousands of data pipelines distributed across their hybrid/multi-cloud environment. Control Hub, which is a StreamSets-managed SaaS, is able to deploy, manage and monitor all the data pipelines and all the data plane engines, no matter where they are, providing a single pane of glass for cross-enterprise transparency and control. This decoupled architecture ensures enterprise data remains fully secure and allows for global control and visibility across a distributed hybrid architecture.
This intent-driven design and decoupled approach allow data engineers to focus on building the pipeline logic. If requirements from the business or more optimized cloud solutions arise, data engineers can make minor adjustments to the pipeline, design and share those artifacts across the team and apply them to pipelines and fragments automatically.
Putting your plan into action
Now is the time to start thinking like a multi-cloud company and take action. You are aware of the precautions, design considerations, and best practices of the companies that came before you. The rest is up to you and your team of data-driven peers.
- Create a plan: Identify the skills and tools needed to operate utilizing the skills and teams you have today. Leveraging self-service training and certification for cloud ecosystems can better help your team better understand and plan for multi-cloud adoption. Identify early the data you want to replicate and ensure that replication will not interfere with existing operations. Lastly, don’t be afraid to experiment using open data sets and automation.
- Ensure that systems and pipelines can scale: Even if you are starting with your first data pipeline, testing and automation can help you make sure that the pipeline will operate as expected at scale. Previewing pipeline runs can help you test how data moves through each stage and how long those stages are taking. Identifying the right test data and criteria for evaluating the test runs will allow you to move from concept to production quickly and reliably.
- Build foundational artifacts: Building foundational artifacts are key to scaling data pipelines across your team. It allows a single data engineer to enable multiple ETL developers, which in turn deliver value to many data analysts. During the concept phase, pipelines, fragments of pipelines, and connections to data systems can all be built, saved, and shared to enable development in a parallel manner. It is a great way to bootstrap your team on the new tools and convey whatever level of complexity is appropriate for the role and knowledge of systems.
- Identify your methods for keeping data anonymous and secure: Operating customer data in the cloud requires additional considerations to ensure customer trust is not violated. A common practice in data pipelines is masking or anonymizing customer or sensitive data so that data engineers and data scientists can work freely on the data without the risk of violating compliance. Have a plan for addressing your customer data and begin to identify fields where PII data lives.