Data federation and data virtualization are so similar that the terms are often used interchangeably. And in practice, you’re unlikely to run into trouble if you conflate them.
Even so, in the academic sense, virtualization and federation of data are not the same.
But because the terms are so often conflated, you’ll find many definitions of each term. So before we dive into their differences, let’s clearly define them.
What is data virtualization?
Data virtualization is a technology that uses a logical data layer to integrate and transform data virtually in real-time from various sources into the desired format. Data virtualization technologies leverage the following capabilities to present integrated, transformed data to data consumers, including people and systems:
- Abstraction of technical data aspects like storage structure, access language, etc.
- Virtual data access, which makes various data sources accessible from one access point
- Data transformation to reformat, aggregate, and clean source data
Data virtualization solutions make the results of their transformations and integrations available for use on request by client applications.
And what is data federation?
Outside of the data context, a federation is an umbrella organization consisting of a smaller subset of organizations. The subset organizations are either fully or partially autonomous, meaning they control part or all of their operations. The United States and the European Union are good examples of federations.
Data federation is a similar concept. It's a technology that “organizes” data from multiple different, autonomous data sources and makes it accessible under a uniform data model. The various underlying data stores continue to operate autonomously. However, data consumers can access the federated data on demand as though the data stores were combined.
Data virtualization and data federation
Data virtualization is a broad set of capabilities that includes data federation. Therefore, all federalized data is also virtualized data. But since data virtualization includes capabilities other than data federalization, not all virtualized data is federated.
For example, data virtualization may abstract the technical details of a data source so you can query the data without requiring advanced technical knowledge. Virtualizing data in this way is not the same as federating it.
The other differences between data virtualization and data federation include:
- Data federation implies multiple data stores; data virtualization doesn’t.
- Federated data is always virtualized; virtualized data is not always federated.
- Data federation is a subset of data virtualization; data virtualization’s features include federation.
- Data virtualization includes abstracting the peculiarities of specific data sources; data federation is virtualizing autonomous data stores into one large data store.
- Data federation tools are often limited to integrating relational data stores. Data virtualization can connect data between any flavor of RDBMS, data appliance, NoSQL, Web Services, SaaS, and enterprise applications.
As Rick van der Lans explains in Data Virtualization for Business Intelligence Systems:
If the data in one particular data store has to be virtualized for a data consumer, no need exists for data federation. But data federation always leads to data virtualization because if a set of data stores is presented as one, the aspect of distribution is hidden for the applications.
The act of “hiding the aspect of distribution” can be referred to broadly as data virtualization or, more specifically, as data federation. In the same way, the act of “slicing vegetables” can be accurately referred to (broadly) as cooking or (more specifically) as doing prep work.
Data federation, virtualization, and StreamSets
Data federation and virtualization help business users create one-off reports quickly without needing specialized knowledge.
But when it comes to facilitating the development of an enterprise-level data strategy, federation and virtualization tools fall short. Virtualized databases simply can’t match the flexibility, resiliency, and performance of fully integrated data stores connected via smart data pipelines. Data integration enables the reusability, recoverability, shareability, and performance optimizations that a scalable data operation requires.
Until now, data integration tools required specialized coding knowledge. But StreamSets changes that by enabling non-technical users who can’t or don’t want to code to build smart data pipelines.