StreamSets

Prebuilt transformation processors

Key features
For both data practitioners and business users
UI-driven actions increase speed and consistency
Comprehensive suite of processors including SCDs and Pivots
Quicker time to insights with drag and drop capabilities
Stage
Function

AGGREGATE


Benefit of Using / Best Practice:

Allows users to calculate averages, sums, counts, or other statistical measures to provide quick insights.

Performs aggregate calculations within a data set

APPLY FUNCTIONS


Benefit of Using / Best Practice:

Allows user to apply a function based on a dynamically (run-time) selected set of columns; as new columns matching the criteria are added to the source tables, the function will be applied to them. This feature offers flexibility and ease-of-use, empowering users to streamline data processing without being limited to regular expressions.

Differentiates Us? (compared to SQL or competitor)
StreamSets can apply functions to a set of columns chosen by a regular expression. For example, you could apply a user defined function to columns with name ending "_DATE". New columns would automatically get the function applied.

When it comes to dates, never again get your interval wrong. How many times in the world of code do you sit and wonder, is 'month' a capital 'M', or does that mean minute? StreamSets reduces the need to search elsewhere like on Stack Overflow or read through documentation.

Productivity Impact (H/M/L):
H

Applies a function to selected columns, enabling users to define multiple columns at once or conveniently select columns within the tool.

COLUMN ORDER


Benefit of Using / Best Practice:

Allows users to rearrange the order of columns in a dataset to improve data organization and readability.

Reorders columns to a specified order

COLUMN REMOVER


Benefit of Using / Best Practice:

Allows users to remove columns in a dataset to improve data organization and readability.

Differentiates Us? (compared to SQL or competitor)
Selecting columns to keep is easy, but selecting columns to remove requires you to specify the exact columns you need, driving brittleness if a new column is added, requires manual intervention to add that column to the dataset. Pattern-based inclusion/exclusion is not achievable in traditional SQL (can be done but requires a heck of a lot of additional code, and becomes a maintenance nightmare!)

Productivity Impact (H/M/L):
L

Removes columns from a data set

COLUMN RENAME


Benefit of Using / Best Practice:

Allows users to easily rename columns.

Differentiates Us? (compared to SQL or competitor)
To rename a column in SQL, you would have to call each individual column into your script and then manually apply the change. You cannot do this based on a regular expression, also traditional approaches drive brittleness or slower adoption to changing columns in the source tables (as you have to manually bring them in).

Productivity Impact (H/M/L):
M

Renames columns in a data set

COLUMN TYPE CONVERTER


Benefit of Using / Best Practice:

Enables users to seamlessly convert data from one type to another, such as converting float to integer, enhancing data quality and improving downstream processing.

Differentiates Us? (compared to SQL or competitor)
StreamSets can convert a set of columns chosen by a regular expression. For example, you could convert string columns with name ending "_ID" to integers. New columns would automatically get the function applied.

Converts data types for columns that match a regular expression. Regex can also be used to specify multiple columns.

CUBE


Benefit of Using / Best Practice:

Simplifies complex data aggregations, providing a powerful analytical capability for multidimensional analysis.

Performs group by cube and aggregates data in specified columns

DUPLICATES


Benefit of Using / Best Practice:

Enables users to efficiently remove duplicate records from datasets, which ensures data quality and reduce storage footprint by eliminating redundant information.

Removes duplicates rows from a data set

FILTER


Benefit of Using / Best Practice:

Allows users to selectively extract and manipulate data based on specific conditions or criteria.

Passes only rows that match all defined conditions

JOIN


Benefit of Using / Best Practice:

Simplifies the process of merging datasets based on common attributes, which enables non-technical users to efficiently integrate and analyze data. Saves time while reducing risk of error.

Differentiates Us? (compared to SQL or competitor)
The concept of joining is common across SQL and Transformer for Snowflake. However, the benefit lies in the fact you can quickly understand the join. In traditional SQL, bad naming practices creep in (how many derived tables called A or B?) and understanding the impact of the join requires assessing the where statement in conjunction. StreamSets simplifies the view of this showing clearly what records are going into the join and what the join means, helping with debugging and audit compliance in regulated markets.

Joins data from two tables

JSON Parser


Benefit of Using / Best Practice:

Parsing JSON in SQL often requires writing complex queries. Using a dedicated JSON Parser function eliminates the need for manual JSON manipulation while reducing the complexity of SQL queries.

Differentiates Us? (compared to SQL or competitor)
With StreamSets, you don't need to know the structure of the JSON data in order to parse it.

Parses JSON data and extracts specific elements

NULL PROCESSOR


Benefit of Using / Best Practice:

Without this processor, they often have to do manual, iterative updates (it's takes time, it's low value work but high visibility when it's not done). It is a particularly useful processor to use in conjunction with the Join processor because doing Joins with Nulls is problematic (technical and business).

There are many reasons null values happen: this is a processor that helps the analysts automatically clean up the data (eliminate the null values) for use in a pipeline.

PIVOT


Benefit of Using / Best Practice:

Reshapes the data to provide a summarized and structured view of the dataset, making it easier to analyze and present information. Significantly reduces time required by manual data manipulation.

Differentiates Us? (compared to SQL or competitor)
Have you tried pivoting in SQL? One way to do it is to put all the results in a derived table, then write a set of sums and a group by. And if a new item comes to sum... it will not appear in the dataset.

With Transformer for Snowflake, you can select the field you want as a column, what you want to group together, and what math function to apply and it reacts to change.

Rotates a column by turning unique values into new columns and aggregates the results

ROLLUP


Benefit of Using / Best Practice:

Simplifies the process of producing subtotal results for various levels of grouping. It enables users to quickly obtain summaries and subtotals for multiple dimensions within their data, making it easier to identify patterns and trends across different levels of granularity.

Performs Group by Rollup and aggregates data into specific columns

SAMPLE


Benefit of Using / Best Practice:

Allows users to generate a representative sample of incoming data for analysis or testing purposes.

Generates sample of incoming data

SLOWLY CHANGING DIMENSION


Benefit of Using / Best Practice:

Simplifies the management of changing data over time. This allows users to track and analyze historical changes in data attributes without requiring extensive coding. Reduces time required to update changing data by manual coding.

Differentiates Us? (compared to SQL or competitor)
With StreamSets, history tables are made easy, but is a really complex logic to write by hand.

Generates updates for a slowly changing dimension based on master and change data

SNOWFLAKE SQL EVALUATOR


Benefit of Using / Best Practice:

Enables users to leverage the full power and flexibility of Snowflake's SQL language for data manipulation. The SQL Evaluator function simplifies the process of incorporating complex SQL logic into the transformation pipeline without the need for separate SQL scripts or manual coding.

Applies Snowflake SQL expressions to a data set

SORT


Benefit of Using / Best Practice:

Enables efficient downstream processes such as joins or aggregations that require ordered data. Using SQL would require complex queries, which can take up a lot of time.

Sorts data based on specified columns

STREAM SELECTOR


Benefit of Using / Best Practice:

Isolating particular records to process is very messy in traditional SQL, especially to meet the requirements of auditing and collaborative working. StreamSets reduces time to inspect and evaluate what a dataflow is doing.

Passes rows downstream based on conditions

UNION


Benefit of Using / Best Practice:

Allows users to combine multiple datasets. As a best practice, ensure the datasets have compatible columns types and order. Saves time while reducing risk of error.

Differentiates Us? (compared to SQL or competitor)
Transformer for Snowflake innovatively streamlines data integration by automatically aligning columns, regardless of their inbound order. It eliminates time-consuming tasks like schema matching and null-value backfilling. Additionally, it offers control over column selection and duplicates, improving data management efficiency. We're transforming data integration - you gain valuable time for more critical tasks.

Merges data from multiple inputs into a single output

UNPIVOT


Benefit of Using / Best Practice:

Makes it easier for users to work with multi-dimensional datasets. Using SQL to accomplish this requires complex queries and manual transformations, which can take up a significant amount of time.

Rotates a table by transforming columns into rows

WINDOWS FUNCTIONS


Benefit of Using / Best Practice:

Window functions allow for advanced calculations and analysis, such as ranking, running totals, and moving averages, by operating on a specific subset of data based on defined window specifications.

Applies window functions
You may also like:
Research Report
The Business Value of Data Engineering
Explore the pivotal role of data engineering in driving business value and innovation. Dive into our research on trends, challenges, and strategies for 2024.
White paper
The Data Integration Advantage: Building a Foundation for Scalable AI
Discover how modern data integration is key to scaling AI initiatives. Learn strategies for overcoming AI challenges and driving enterprise success.
eBook
Five Principles for Agile Data & Operational Analytics
Master the five data principles essential for powering effective operational analytics. Transform your data strategy for agility and insight.
Are you ready to unlock your data?
Resilient data pipelines help you integrate your data, without giving up control, to power your cloud analytics and digital innovation.
ICS JPG PDF WRD XLS