Azure Data Flow - Cloud based ETL
Data Flow is a service product from Azure, which can be integrated with Data Factory to provide GUI based components to develop complex transformation logic.
Data Flow, in turn can be executed as activities within Azure Data Factory pipelines. The purpose of Data Flows is to transform massive amount of data with zero coding. Behind the scenes, Data Flow will execute on a Spark cluster for scaled out data processing.
For one of our clients - the requirement was to provide a visual step-by-step approach to doing ETL on their data. A stored procedure wouldn't have helped in achieving this. But, data flow, integrated with Azure Data Factory helped us to create such an ETL pipeline along with fast processing of data.
Few Data Flow activities available are:
Joins – join data from 2 streams
Conditional Splits – for splitting the data based on a particular condition
Union – functions similar to SQL union
Lookup – looking up from other streams
Derived Columns – create new columns using some formulae
Aggregates – functions similar to SQL aggregate functions
Surrogate Keys –generating new Surrogate key
Exists – functions similar to SQL exist functions
Select – remove / rearrange columns from the flow
Filter – functions similar to SQL where clause
Sort – functions similar to SQL order by clause
It has an extensive library of custom functions(derived from spark functions) that can be used. All this at Thus, Data Flow provides a way of implementing good visual ETL pipelines in Azure Data Factory.
Azure Data Flow has a wide range of connectors (or linked services) similar to the Azure Data Factory
More info : https://docs.microsoft.com/en-us/azure/data-factory/connector-overview
A variety of File formats are supported like Avro, Binary, CSV, Excel, XML JSON, ORC, Parquet, etc.
It has got options to execute on cloud or local integration environments (basically Spark clusters) and sizes (cores) as required.
Comments
Post a Comment