Spark SQL
If you are coming from the Hadoop background then you can understand the performance implications of Hive with ~180GB datasets. Actually Hive uses Map/Reduce jobs to perform its queries which is already slower as compared to Spark and hence we see a performance drop in Apache Hive as compared to Spark SQL.
Spark by itself doesn’t have storage. It relies on an underlying data storage like HDFS, NoSQL DB or RDBMS
When we come across Spark, we hear 3 terminologies quite often RDD, DataFrame and DataSets. Let’s see the fundamental difference between these three-
RDD APIs
As I mentioned in my previous article, RDD is a read-only collection of data, which Spark uses to do in-memory computation in a Spark cluster. These are basically data holder java/scala objects hosted on different machines in a cluster. We can move any RDD to a dataframe using toDF() method. Once a RDD is converted into a dataframe then it cannot be recovered back.
DataFrame APIs
DataFrame imposes a structure on the distributed set of data, thereby giving a higher level abstraction. We can think of DataFrame as table in a relational database.It’s only meant for structured/semi-structured data. It organises data into named columns.
DataSet APIs
DataSet provides a rich set of APIs as an extension to DatFrame API, which provides a type safe and object oriented programming interface.
Working Example
SparkSession is the entry-point for any functionality in Spark.
Open Spark shell and execute the following commands-
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSessionscala> import spark.implicits._
import spark.implicits._scala> val spark = SparkSession.builder().appName("Spark SQL Hello World").getOrCreate()21/12/12 21:08:36 WARN SparkSession$Builder: Using an existing SparkSession; some spark core configurations may not take effect.spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@14e79d6scala> val df = spark.read.option("multiline","true").json("/Users/dhruv/Documents/Personal/Learning/Learning/Spark/example_1.json");df: org.apache.spark.sql.DataFrame = [color: string, fruit: string ... 1 more field]scala> df.createOrReplaceTempView("data");scala> val sqlDF = spark.sql("SELECT * FROM data");sqlDF: org.apache.spark.sql.DataFrame = [color: string, fruit: string ... 1 more field]scala> sqlDF.show();+-----+------+-----+|color| fruit| size|+-----+------+-----+| Red| Apple|Large||Green|Banana|Large|+-----+------+-----+
It’s a simple example which loads the data into Spark and once can query the same with a simple SQL.
On Spark-UI, you can see the details of the executed query in the SQL tab-