Apache Spark Interview Questions and Answers

Apache Spark Interview Questions and Answers

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark was initially started by Matei Zaharia at UC Berkeley’s AMPLab in 2009. It was an academic project in UC Berkley. Initially the idea was to build a cluster management tool, which can support different kind of cluster computing systems.

What is Apache Spark?

What are the main Components of Spark?’

What are features of Apache Spark?

How Apache Spark works?

Can you explain Spark RDD?

Can you explain Spark Core?

Can you explain Spark MLlib?

Can you explain Spark GraphX?

Can you explain Spark Streaming?

Can you explain Spark SQL?

What is DataFrames?

What is RDD?

How RDD persist the data?

What are the types of Cluster Managers in Spark?

What is Transformation in spark?

Can you define Yarn?

What are common Spark Ecosystems?

What is Dataset?

What are Actions?

Explain Spark map() transformation?

Explain the action count() in Spark RDD?

Is the following approach correct? Is the sqrtOfSumOfSq a valid reducer?

Explain about mapPartitions() and mapPartitionsWithIndex()

What is Hive on Spark?

Explain how can Spark be connected to Apache Mesos?

Can you define Parquet file?

Can you define RDD Lineage?

Can you define PageRank?

Can you explain broadcast variables?

Can you explain accumulators in Apache Spark?

What do you know about Transformations in Spark?

What is Spark Executor?

What do you know about SchemaRDD?

Can you explain worker node?

Explain how can you minimize data transfers when working with Spark?

Explain how can Apache Spark be used alongside Hadoop?

Why is there a need for broadcast variables when working with Apache Spark?

Can you explain benefits of Spark over MapReduce?