Azure Data Flow - Cloud based ETL

 Data Flow is a service product from Azure, which can be integrated with Data Factory to provide GUI based components to develop complex transformation logic. 

Data Flow, in turn can be executed as activities within Azure Data Factory pipelines. The purpose of Data Flows is to transform massive amount of data with zero coding. Behind the scenes, Data Flow will execute on a Spark cluster for scaled out data processing. 

For one of our clients - the requirement was to provide a visual step-by-step approach to doing ETL on their data. A stored procedure wouldn't have helped in achieving this. But, data flow, integrated with Azure Data Factory helped us to create such an ETL pipeline along with fast processing of data.

Few Data Flow activities available are:

Joins – join data from 2 streams

Conditional Splits – for splitting the data based on a particular condition

Union – functions similar to SQL union

Lookup – looking up from other streams

Derived Columns – create new columns using some formulae

Aggregates – functions similar to SQL aggregate functions

Surrogate Keys –generating new Surrogate key

Exists – functions similar to SQL exist functions

Select – remove / rearrange columns from the flow

Filter – functions similar to SQL where clause

Sort – functions similar to SQL order by clause

It has an extensive library of custom functions(derived from spark functions) that can be used. All this at Thus, Data Flow provides a way of implementing good visual ETL pipelines in Azure Data Factory.

Azure Data Flow has a wide range of connectors (or linked services) similar to the Azure Data Factory

More info : https://docs.microsoft.com/en-us/azure/data-factory/connector-overview

A variety of File formats are supported like Avro, Binary, CSV, Excel, XML JSON, ORC, Parquet, etc. 

It has got options to execute on cloud or local integration environments (basically Spark clusters) and sizes (cores) as required. 

Comments