Spark SQL

Dhruv Saksena
2 min readDec 12, 2021

If you are coming from the Hadoop background then you can understand the performance implications of Hive with ~180GB datasets. Actually Hive uses Map/Reduce jobs to perform its queries which is already slower as compared to Spark and hence we see a performance drop in Apache Hive as compared to Spark SQL.

Spark by itself doesn’t have storage. It relies on an underlying data storage like HDFS, NoSQL DB or RDBMS

When we come across Spark, we hear 3 terminologies quite often RDD, DataFrame and DataSets. Let’s see the fundamental difference between these three-

RDD APIs

As I mentioned in my previous article, RDD is a read-only collection of data, which Spark uses to do in-memory computation in a Spark cluster. These are basically data holder java/scala objects hosted on different machines in a cluster. We can move any RDD to a dataframe using toDF() method. Once a RDD is converted into a dataframe then it cannot be recovered back.

DataFrame APIs

DataFrame imposes a structure on the distributed set of data, thereby giving a higher level abstraction. We can think of DataFrame as table in a relational database.It’s only meant for structured/semi-structured data. It organises data into named columns.

DataSet APIs

DataSet provides a rich set of APIs as an extension to DatFrame API, which provides a type safe and object oriented programming interface.

Working Example

SparkSession is the entry-point for any functionality in Spark.

Open Spark shell and execute the following commands-

scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> import spark.implicits._
import spark.implicits._
scala> val spark = SparkSession.builder().appName("Spark SQL Hello World").getOrCreate()21/12/12 21:08:36 WARN SparkSession$Builder: Using an existing SparkSession; some spark core configurations may not take effect.spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@14e79d6scala> val df = spark.read.option("multiline","true").json("/Users/dhruv/Documents/Personal/Learning/Learning/Spark/example_1.json");df: org.apache.spark.sql.DataFrame = [color: string, fruit: string ... 1 more field]scala> df.createOrReplaceTempView("data");scala> val sqlDF = spark.sql("SELECT * FROM data");sqlDF: org.apache.spark.sql.DataFrame = [color: string, fruit: string ... 1 more field]scala> sqlDF.show();+-----+------+-----+|color| fruit| size|+-----+------+-----+| Red| Apple|Large||Green|Banana|Large|+-----+------+-----+

It’s a simple example which loads the data into Spark and once can query the same with a simple SQL.

On Spark-UI, you can see the details of the executed query in the SQL tab-

--

--