Spark Interview Questions & Answers

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark was initially started by Matei Zaharia at UC Berkeley’s AMPLab in 2009. It was an academic project in UC Berkley. Initially the idea was to build a cluster management tool, which can support different kind of cluster computing systems.

What is Apache Spark?

Apache Spark is an open source data processing framework for performing big data analytics on distributed computing cluster. Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. It provides a provision of reusability, Fault Tolerance, real-time stream processing and many more.

What are the main Components of Spark?’

What are features of Apache Spark?

It supports wide variety of operations, compared to Map and Reduce functions.
Swift Processing.
It provides concise and consistent APIs in Scala, Java and Python.

It is written in Scala Programming Language and runs in JVM.
Supported programming languages: Scala, Java, Python, R
It leverages the distributed cluster memory for doing computations for increased speed and data processing.

Spark is most suitable for real time decision making with big data.
Spark runs on top of existing Hadoop cluster and access Hadoop data store (HDFS), it can also process data stored by HBase structure. It can also run without Hadoop with apache Mesos or alone in standalone mode.
The Spark code can be reused for batch-processing, join stream against historical data or run ad-hoc queries on stream state

Spark enables applications in Hadoop clusters to run up to as much as 100 times faster in memory and 10 times faster even when running in disk.
Apache Spark can be integrated with various data sources like SQL, NoSQL, S3, HDFS, local file system etc.
Good fit for iterative tasks like Machine Learning (ML) algorithms.

Fault Tolerance in Spark
Active, Progressive and Expanding Spark Community
And also, Map and Reduce operations, it supports SQL like queries, streaming data, machine learning and data processing in terms of graph

How Apache Spark works?

Apache Spark will method data from a spread of data repositories, as well as the Hadoop Distributed file system (HDFS), NoSQL databases and relational data stores, like Apache Hive. Spark supports in-memory process to spice up the performance of massive data analytics applications, however it may perform standard disk-based processing when information sets are overlarge to suit into the available system memory.

The Spark Core engine uses the resilient distributed data set, or RDD, as its basic data type. The RDD is designed in such a way so as to hide much of the computational complexity from users. It aggregates data and partitions it across a server cluster, where it can then be computed and either moved to a different data store or run through an analytic model. The user doesn’t have to define where specific files are sent or what computational resources are used to store or retrieve files.

Can you explain Spark RDD?

Spark RDD: At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel .A programming abstraction that represents an immutable collection of objects that can be split across a computing cluster. Operations on the RDDs can also be split across the cluster and executed in a parallel batch process, leading to fast and scalable parallel processing.

RDDs can be created from simple text files, SQL databases, NoSQL stores (such as Cassandra and MongoDB), Amazon S3 buckets, and much more besides. Much of the Spark Core API is built on this RDD concept, enabling traditional map and reduce functionality, but also providing built-in support for joining data sets, filtering, sampling, and aggregation.

Can you explain Spark Core?

Spark Core is the base of the whole project. It provides distributed task dispatching, scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data structure known as RDD (Resilient Distributed Datasets) that is a logical collection of data partitioned across machines. RDDs can be created in two ways; one is by referencing datasets in external storage systems and second is by applying transformations (e.g. map, filter, reducer, join) on existing RDDs.

The RDD abstraction is exposed through a language-integrated API. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data.

Can you explain Spark MLlib?

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs. MLLib comes with distributed implementations of clustering and classification algorithms such as k-means clustering and random forests that can be swapped in and out of custom pipelines with ease. Models can be trained by data scientists in Apache Spark using R or Python, saved using MLLib, and then imported into a Java-based or Scala-based pipeline for production use.

At a high level, it provides tools such as:

ML Algorithms: It is common learning algorithms such as classification, regression, clustering, and collaborative filtering.

Featurization: It is feature extraction, transformation, dimensionality reduction, and selection

Pipelines: these tools for constructing, evaluating, and tuning ML Pipelines

Persistence: It is saving and load algorithms, models, and Pipelines

Utilities: It is linear algebra, statistics, data handling, etc.

The benefits of MLlib’s design include:

Simplicity and Scalability
Streamlined end-to-end

Compatibility
Security monitoring/fraud detection, including risk assessment and network monitoring
Operational optimization such as supply chain optimization and preventative maintenance

Can you explain Spark GraphX?

GraphX is Apache Spark’s API for graphs and graph-parallel computation. It comes with a selection of distributed algorithms for processing graph structures including an implementation of Google’s PageRank. These algorithms use Spark Core’s RDD approach to modeling data; the Graph Frames package allows you to do graph operations on dataframes, including taking advantage of the Catalyst optimizer for graph queries.

Can you explain Spark Streaming?

Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. It is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.

Can you explain Spark SQL?

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.

SQL support, Spark SQL provides a standard interface for reading from and writing to other data stores including JSON, HDFS, Apache Hive, JDBC, Apache ORC, and Apache Parquet, all of which are supported out of the box. Other popular stores—Apache Cassandra, MongoDB, Apache HBase, and many others—can be used by pulling in separate connectors from the Spark Packages ecosystem.

What is DataFrames?

What is RDD?

RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD is immutable and distributed. There are primarily two types of RDD:

Parallelized Collections: The existing RDD’s running parallel with one another.

Hadoop datasets: perform function on each file record in HDFS or other storage system

How RDD persist the data?

What are the types of Cluster Managers in Spark?

What is Transformation in spark?

Can you define Yarn?

What are common Spark Ecosystems?

What is Dataset?

What are Actions?

Explain Spark map() transformation?

Explain the action count() in Spark RDD?

Is the following approach correct? Is the sqrtOfSumOfSq a valid reducer?

Explain about mapPartitions() and mapPartitionsWithIndex()

mapPartitions() and mapPartitionsWithIndex() are both transformation.

mapPartitions() : mapPartitions() can be used as an alternative to map() and foreach() .mapPartitions() can be called for each partitions while map() and foreach() is called for each elements in an RDD

Hence one can do the initialization on per-partition basis rather than each element basis

mapPartitions() :mapPartitionsWithIndex is similar to mapPartitions() but it provides second parameter index which keeps the track of partition.

What is Hive on Spark?

Explain how can Spark be connected to Apache Mesos?

Can you define Parquet file?

Can you define RDD Lineage?

Can you define PageRank?

Can you explain broadcast variables?

Can you explain accumulators in Apache Spark?

What do you know about Transformations in Spark?

What is Spark Executor?

What do you know about SchemaRDD?

SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.These are some of the popular questions asked in an Apache Spark interview. Always be prepared to answer all types of questions — technical skills, interpersonal, leadership or methodology. If you are someone who has recently started your career in big data, you can always get certified in Apache Spark to get the techniques and skills required to be an expert in the field.

Can you explain worker node?

Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and accept incoming connections from its executors and must be network addressable from the worker nodes.

Worker node is basically the slave node. Master node assigns work and worker node actually performs the assigned tasks. Worker nodes process the data stored on the node and report the resources to the master. Based on the resource availability, the master schedule tasks.

Explain how can you minimize data transfers when working with Spark?

Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:

Using Broadcast Variable: Broadcast variable enhances the efficiency of joins between small and large RDDs.

Using Accumulators: Accumulators help update the values of variables in parallel while executing.

The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.

Explain how can Apache Spark be used alongside Hadoop?

Why is there a need for broadcast variables when working with Apache Spark?

Can you explain benefits of Spark over MapReduce?

Due to the availability of in-memory process, Spark implements the process around 10-100x quicker than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data process tasks.

Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.

Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.

Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.

(Source: wiki and Spark doc)