
Database to data warehouse: How it all began
Companies have long posited the merits of building large scale data infrastructure to support their analytics teams and goals. Operational databases are perfect for collecting data from applications and storing it for reference, but poor for running analytical queries on the same data. That is why many companies invest in separate data platforms for running analytics. In 1993 the practice was forever changed when Ralph Kimball released the first version of The Data Warehouse Toolkit. The previous modeling practice was adequate for accounting for the linear placement and changing of data but lacked the ability to represent complex relationships between data. This was the area where dimensional modeling really excelled and for that became the fundamental principle for building a data platform for analytics.
A data warehouse, also known as an enterprise data warehouse or EDW, is a central repository of information that can be analyzed to make better informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. Business analysts, data scientists, and decision makers access the data through business intelligence (BI) tools, SQL clients and other analytics applications. It is a key canonical architecture if you aspire to be a data driven company. Software giants like IBM and Oracle devised large, expensive offerings for data warehouse infrastructure that provided both the servers, software, and services needed for descriptive analytics. At the same time, companies rushed to hire EDW administrators who were in charge of building the schema and strategizing how all data should flow into the data warehouse. However, the data warehouse quickly became associated with its limitations, as companies became hungrier to leverage more and more data. The data warehouse became crowded and bogged down with requests which killed its performance and tested its ability to deliver on service level agreements. Thus enterprise data warehouse became a four letter word. This largely fueled the invention of the data lake, which aimed to solve the problem of scale. But was the data lake the real replacement of the data warehouse?
The rise of the data lake
The problem with data warehouse vs data lake
The problem with this paradigm is that it considers one approach wrong while the other is right when in practice companies may choose to leverage a data lake or data warehouse both for foundationally sound reasons. Here are some thoughts…
When to use a data warehouse
- Query performance
- Transactional reporting
- Dashboards
- Structured data
- Data integrity
When to use a data lake
- Large data volumes
- Unstructured and semi-structured data
- Streaming and time relevant data
- Data archive
Using a data lake and data warehouse for analytics
Another way to think about it is from the analytics view point. Let’s take an example of a retail store that wants to know more about their customers so they can provide personalized offers. In order to put together a customer profile, the company may use data like transaction history, purchase history, address, name, etc. These are all structured data sources that often live in the enterprise data warehouse (System of Record) and might feed things like company dashboards. Other data like website traffic, social media data, geolocation data and mobile app clickstream data are all unstructured sources and would likely live in the data lake (Systems of Engagement). Each set of data in a silo only reveals part of the story. For instance, it’s great to know if people are talking favorably about you on social media, but knowing if John Smith is talking about you favorably allows you to act on it. In order to know this, you need to marry together the siloed data.