Software AG no longer operates as a stock corporation, but as Software GmbH (company with limited liability). Despite the change of name, we continue to offer our goods and services under the registered trademarks .
4 MINUTE READ
The building blocks of AWS lakehouse architecture
The AWS lakehouse architecture connects your data lake, data warehouse, and other purpose-built analytics services.
The Building Blocks of AWS Lakehouse Architecture
The data lakehouse is a relatively recent evolution of data lakes and data warehouses. Amazon was one of the first to use a lakehouse as a service. In 2019, they developed Amazon Redshift Spectrum. This service lets users of its Amazon Redshift data warehouse service apply queries to data stored in Amazon S3. In this piece, we’ll dive into all things AWS lakehouse.

How AWS defines lakehouse architecture

The AWS lakehouse architecture connects your data lake, data warehouse, and other purpose-built analytics services. 

According to Amazon, this enables you to “Have a single place where you can run analytics across most of your data while the purpose-built analytics services provide the speed you need for specific use cases.” 

AWS represents lakehouse architecture in the following five logical layers:

Data Source(s) > Data Ingestion > Data Storage and Shared Catalog > Data Processing > Consumption\

The goal of the AWS data lakehouse is to create a single place where:

  1. You can run analytics across most of your data
  2. You can use your purpose-built analytics services for specific use cases

If you compare AWS’s take on the lakehouse concept to Databricks, another lakehouse provider, you can see the similarities. From the Databricks white paper on the lakehouse platform

“Lakehouses combine the key benefits of data lakes and data warehouses: low-cost storage in an open format accessible by a variety of systems from the former, and powerful management and optimization features from the latter.”

Threaded through these different providers is the concept that lakehouses take the best of both data warehouse concepts and data lakes by blending accessibility with speed and power.

Building an AWS lakehouse architecture

Lakehouse architecture is a concept, and AWS lakehouse architecture is a branded version of that concept. Lakehouse architectures don’t require one particular brand of technology. A successful lakehouse architecture could incorporate a wide variety of tools: Databricks, Dremio, and, of course, AWS, to name a few. But, lakehouses do, generally speaking, require the following layers:

  1. Data sources like applications, databases, file shares, web, sensors, and social media.
  2. Ingestion capable of migrating, replicating, and delivering data from the data sources. In the lakehouse reference architecture on AWS:
    • Application data is ingested with Amazon AppFlow
    • Data from operational databases is ingested with the AWS Data Migration Service
    • Files from file shares are ingested with AWS DataSync
    • Streaming data is ingested via Kinesis
  3. Integrated storage of data lake and warehouse with open file formats and a common catalog layer. In the lakehouse reference architecture on AWS:
    • Amazon Redshift and S3 offer data warehouse and data lake storage, respectively. 
    • Lake Formation offers a common catalog for data stored in Amazon S3 and Redshift
    • Processing with multiple components that match the lakehouses’s data structures and velocity and enable a variety of data processing use cases such as Spark data processing, near-real-time ETL, and SQL-based ELT. 
  4. Consumption that allows various users to access data via SQL, BI dashboards, and ML models for a variety of analytics use cases.

Lakehouse on AWS vs. other lakehouses

There’s a problem with the paradigm of AWS lakehouse vs. another. It considers one approach suboptimal and the other optimal. In practice, there are simply too many variables to follow one path blindly.

Lakehouse architecture allows for, even encourages, a variety of components assembled to accomplish flexibility and performance tailored to an individual organization’s current and future use cases.

StreamSets and the AWS lakehouse architecture

StreamSets helps customers bring even more agility, power, and scale to the AWS ecosystem. It does this through native AWS integration with many of the components of an AWS lakehouse architecture, including Redshift, S3, Kinesis, and EMR. These native integrations provide more users with easy accessibility to key capabilities, such as real-time analytics. 

By opening up more users to predictive analytics, organizations can begin predicting potential issues like:

  • Customer churn
  • Network connectivity
  • Identifying fraud
  • Marketing spend analysis
  • Discovering associated products and services

StreamSets provides a modern data integration platform for building smart data pipelines. You can design pipelines for any paradigm and reverse your course when the strategy changes—an important feature for data engineers who spend most of their time changing pipeline dynamics when destinations change. 

For a more concrete idea of what this could look like, check out our AWS reference architecture guide for StreamSets or take a look at our docs for supported Amazon stages.

StreamSets

Accelerate decision-making with analytics-ready data

Related Articles

A Deep Dive Into Data Pipeline Architecture
App & Data Integration
A deep dive into data pipeline architecture
Data pipeline architecture refers to the design of systems and schema that help collect, transform, and make data available. Take a deep dive here.
Read Blog
5 Examples of Cloud Data Lakehouse Management in Action
App & Data Integration
5 examples of cloud data lakehouse management in action
Data lakehouses present the best of both worlds—data lakes and data warehouses. See applications and examples here.
Read Blog
Data Mesh vs Data Fabric Architectures: What You Should Know
App & Data Integration
Data mesh vs data fabric architectures: What you should know
Data mesh and data fabric are two approaches to building a data architecture. They differ yet address common challenges. Learn more here.
Read Blog
SUBSCRIBE TO SOFTWARE AG'S BLOG

Find out what Software AG’s solutions can do for your business

Thanks for Subscribing 🎉

ICS JPG PDF WRD XLS