RDD and DAG in Spark

Dhruv Saksena
2 min readAug 29, 2021

--

A major drawback of MapReduce jobs was the Disk I/O performed by them. This led to the development of Spark Jobs where they used RAM instead of Disk I/O operations.

Two very important concepts to study in Spark are -

  1. RDD (Resilient Distributed DataSet)
  2. DAG (Directed Acyclic Graph)

RDD(Resilient Distributed DataSet)

RDD is a distributed in-memory abstraction, which lets programmers do in-memory computations on a Hadoop cluster. It’s an immutable dataset having a collection of objects-

RDD for Fruits and Cars

The above two are the RDDs for Fruits and Cars. A RDD is partitioned across cluster onto multiple nodes. Let’s say our cluster has 3nodes and we have the above two RDDs.

Partition distribution across nodes

Following are the operations which can be performed on an RDD-

  1. Transformation
  2. Action

In very simple terms, if the result of an operation results into another RDD then it’s called a transformation ex: filter fruits whose name starts from ‘A’, which is “Apple”. So, “Apple” will be in another RDD.

If the result of an operation is not an RDD then it’s a Action. Ex: count then number of cars in initial RDD which comes out to be 5.

DAG(Directed Acyclic Graph)

In Spark, all the transformations are stored in a graph and are invoked only when any action is called. Whenever you execute an action on RDD, an execution plan in the form of a DAG is sent to cluster, then DagScheduler distributes the tasks to different nodes in the cluster. It keeps track of the RDDs and finds a minimum schedule to run the job.

--

--

No responses yet