Quick look at Spark UI
Spark provides a good UI to provide monitor status of your application, resource consumption and configuration. It’s a part of Spark package.
By default it runs on port 4040 http://localhost:4040/jobs/
This UI will only show any data when you execute any action. If we are just creating and transforming RDDs then this UI wont show anything.
Any work we give to Spark is termed as “Job”. Each job is broken down into various stages. Stages are internally broken down into various TaskLets and tasklet is further broken down into tasks.
To begin with let’s execute a very simple job from Spark Shell-
scala> val sampleRDD = sc.parallelize(1 to 10000);sampleRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23scala> sampleRDD.collect();res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171,...
Here, we are just collecting data. Now, as soon as you do sampleRDD.collect(), Spark cluster will come into action and do the data processing-
Here 8 is the number of cores in your system and that is the level of parallel tasks we execute to complete the job.
To check the stages of the job, just click on the description above-
Now, let’s take one more example where we load a file in spark memory-
scala> val fileRDD = sc.textFile("/Users/dhruv/Documents/Personal/Learning/Learning/Spark/abc.rtf");fileRDD: org.apache.spark.rdd.RDD[String] = /Users/dhruv/Documents/Personal/Learning/Learning/Spark/abc.rtf MapPartitionsRDD[2] at textFile at <console>:23scala> fileRDD.cacheres2: fileRDD.type = /Users/dhruv/Documents/Personal/Learning/Learning/Spark/abc.rtf MapPartitionsRDD[2] at textFile at <console>:23scala> fileRDD.collectres3: Array[String] = Array({\rtf1\ansi\ansicpg1252\cocoartf2578, \cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fmodern\fcharset0 Courier;}, {\colortbl;\red255\green255\blue255;\red0\green0\blue0;}, {\*\expandedcolortbl;;\cssrgb\c0\c0\c0;}, \paperw11900\paperh16840\margl1440\margr1440\vieww11520\viewh8400\viewkind0, \deftab720, \pard\pardeftab720\partightenfactor0, "", \f0\fs24 \cf2 \expnd0\expndtw0\kerning0, \outl0\strokewidth0 \strokec2 Quod equidem non reprehendo;\, Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quibus natura iure responderit non esse verum aliunde finem beate vivendi, a se principia rei gerendae peti; Quae enim adhuc protulisti, popularia sunt, ego autem a te elegantiora desidero. Duo Reges: constructio interrete. Tum Lucius: Mihi...
Now, if we go to Spark UI storage tab we can see how this file is stored in the Spark memory and the partitions aswell-
To get the Environment details for the Spark, just click on Environment tab-
The Executors tab provides information about memory, cores and other resources being used by executors. For debugging purpose, you can also download the Thread Dump