Building data pipeline on Apache Nifi

Dhruv Saksena
3 min readAug 12, 2021

Apache Nifi is great tool to build data pipeline at scale. It manages data in form on flow-files which go from one processor to another processor.

Here, in this example we are looking into processing pending transactions manually in small batches.

Ex: Let’ say we have 10,000 pending transactions to be executed into a system and you want to break this in batch-size of 100 each and process them in a pipeline. Here’s where Apache Nifi comes to rescue-

Let’s assume our api response is-

{
"data":
{
"body":[
{
"txnId":22,
"value": 700
}, {
"txnId":23,
"value": 5676
}
]

}
}

In this scenario, we will pull the pending transactions from a source system’s GET API and transform that into a batch size of each and send it sequentially on sink which will process it-

Components

Following components will be used in setting up the pipeline-

  1. InvokeHTTP(Periodically fetch unprocessed transactions from transaction service)
  2. SplitJSON(Splitting the bulk transactions into single transaction)
  3. MergeContent(Create batches of transaction of configurable size)

Data pipeline

Component Details

Invoke Http

This component will invoke the source GET API to fetch the pending transactions in bulk at a regular interval of 1hr. In “Properties” tab we will add the remote API url(in this case for security reasons Im adding a dummy url)

Split Json

In this we will split the incoming array using json-path “$.data.body”. So, every element in the body array will become a separate flow file-

MergeContent

Here, we will merge the content in batch size of 100 and also set delimiter as text having header as “[“ and footer as “]” and separator as “,” so that it becomes a json array-

InvokeHttp

This will be the final component to Invoke the remote service which will process these transactions in batches.

Once you play all the components the data pipeline will start the process of first splitting and later merging the transactions and data will be transferred from source to sink.(without any coding effort)

--

--

No responses yet