Spark Components
In this section, we are going to familiarise ourselves with various components of Spark-
Driver
This is the program which orchestrates the entire end to end execution of a Spark Job. It negotiates resources with Node Manager and oversees parallel execution of spark jobs in cluster. It declares all the transformation and actions on RDDs and then submits serialized DAG then to master. This is the program which creates the Spark Context. It coordinates with worker nodes to oversee execution of a task.
Whenever, we write our own program irrespective of language Java,Scala,Python, we create our own Driver program.
Executors
These are the main workers which execute the tasks given by driver. There can be multiple executors that execute the task.
So, whenever we execute any program in Spark shell, that shell acts as a Driver Program which delegates tasks to Executors in Spark cluster.(refer: shell-programs)
Master
Spark follows a master slave architecture. Master is the node where Driver program runs and slaves are the nodes where Executors execute.
Directed Acyclic Graph
Simply speaking, its lineage of all RDDs. Directed Acyclic Graph(DAG) represents a tree structure where every vertice denotes the RDD and edges denote the actions performed on those RDDs.
Spark Context
Whenever we submit a job to a spark cluster. Spark Context creates a graph of all the transformations and submits this graph to the DAGScheduler. SparkContext invokes the worker threads, which will in turn use the executors to execute the tasks.
DAGScheduler
Once a DAG has been submitted to the Scheduler, it creates stages of the task and then requests NodeManager to execute the tasks on different executors.