Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Spark was initially started by Matei Zaharia at UC Berkeley’s AMPLab in 2009. It was an academic project in UC Berkley. Initially the idea was to build a cluster management tool, which can support different kind of cluster computing systems.
Apache Spark is an open source data processing framework for performing big data analytics on distributed computing cluster. Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. It provides a provision of reusability, Fault Tolerance, real-time stream processing and many more. Apache Spark will method data from a spread of data repositories, as well as the Hadoop Distributed file system (HDFS), NoSQL databases and relational data stores, like Apache Hive. Spark supports in-memory process to spice up the performance of massive data analytics applications, however it may perform standard disk-based processing when information sets are overlarge to suit into the available system memory. The Spark Core engine uses the resilient distributed data set, or RDD, as its basic data type. The RDD is designed in such a way so as to hide much of the computational complexity from users. It aggregates data and partitions it across a server cluster, where it can then be computed and either moved to a different data store or run through an analytic model. The user doesn’t have to define where specific files are sent or what computational resources are used to store or retrieve files. Spark RDD: At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel .A programming abstraction that represents an immutable collection of objects that can be split across a computing cluster. Operations on the RDDs can also be split across the cluster and executed in a parallel batch process, leading to fast and scalable parallel processing. RDDs can be created from simple text files, SQL databases, NoSQL stores (such as Cassandra and MongoDB), Amazon S3 buckets, and much more besides. Much of the Spark Core API is built on this RDD concept, enabling traditional map and reduce functionality, but also providing built-in support for joining data sets, filtering, sampling, and aggregation. Spark Core is the base of the whole project. It provides distributed task dispatching, scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data structure known as RDD (Resilient Distributed Datasets) that is a logical collection of data partitioned across machines. RDDs can be created in two ways; one is by referencing datasets in external storage systems and second is by applying transformations (e.g. map, filter, reducer, join) on existing RDDs. The RDD abstraction is exposed through a language-integrated API. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data. MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs. MLLib comes with distributed implementations of clustering and classification algorithms such as k-means clustering and random forests that can be swapped in and out of custom pipelines with ease. Models can be trained by data scientists in Apache Spark using R or Python, saved using MLLib, and then imported into a Java-based or Scala-based pipeline for production use. At a high level, it provides tools such as: ML Algorithms: It is common learning algorithms such as classification, regression, clustering, and collaborative filtering. Featurization: It is feature extraction, transformation, dimensionality reduction, and selection Pipelines: these tools for constructing, evaluating, and tuning ML Pipelines Persistence: It is saving and load algorithms, models, and Pipelines Utilities: It is linear algebra, statistics, data handling, etc. The benefits of MLlib’s design include: GraphX is Apache Spark’s API for graphs and graph-parallel computation. It comes with a selection of distributed algorithms for processing graph structures including an implementation of Google’s PageRank. These algorithms use Spark Core’s RDD approach to modeling data; the Graph Frames package allows you to do graph operations on dataframes, including taking advantage of the Catalyst optimizer for graph queries. Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. It is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation. SQL support, Spark SQL provides a standard interface for reading from and writing to other data stores including JSON, HDFS, Apache Hive, JDBC, Apache ORC, and Apache Parquet, all of which are supported out of the box. Other popular stores—Apache Cassandra, MongoDB, Apache HBase, and many others—can be used by pulling in separate connectors from the Spark Packages ecosystem. A DataFrame can be considered as a distributed set of data which has been organized into many named columns. It can be compared with a relational table, CSV file or a data frame in R or Python. The DataFrame functionality is made available as API in Scala, Java, Python, and R. RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD is immutable and distributed. There are primarily two types of RDD: Parallelized Collections: The existing RDD’s running parallel with one another. Hadoop datasets: perform function on each file record in HDFS or other storage system There are two methods to persist the data, such as persist() to persist permanently and cache() to persist temporarily in the memory. Different storage level options there such as MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY and many more. Both persist() and cache() uses different options depends on the task. The Spark framework supports three major types of Cluster Managers: Standalone: a basic manager to set up a cluster. Apache Mesos: generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications. Yarn: responsible for resource management in Hadoop Spark provides two special operations on RDDs called transformations and Actions. Transformation follow lazy operation and temporary hold the data until unless called the Action. Each transformation generate/return new RDD. Example of transformations: Map, flatMap, groupByKey, reduceByKey, filter, co-group, join, sortByKey, Union, distinct, sample are common spark transformations. Like Hadoop, Yarn is one of the key components in Spark, giving a central and resource organization stage to pass on versatile operations across the cluster. Running Spark on Yarn requires a matched apportionment of Spar as built on Yarn support. A Dataset is a new addition in the list of spark libraries. It is an experimental interface added in Spark 1.6 to v2.2.1 that tries to provide the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine. An action brings back the data from the RDD to the local machine. Execution of an action results in all the previously created transformation. The example of actions is: reduce () – executes the function passed again and again until only one value is left. The function should take two argument and return one value. take () – take all the values back to the local node form RDD. map () transformation takes a function as input and apply that function to each element in the RDD. Output of the function will be a new element (value) for each input element. Ex.val rdd1 = sc.parallelize(List(10,20,30,40)) val rdd2 = rdd1.map(x=>x*x) println(rdd2.collect().mkString(“,”)) count() is an action in Apache Spark RDD operation count() returns the number of elements in RDD. Example: val rdd1 = sc.parallelize(List(11,22,33,44)) println(rdd1.count()) Output: 4 numsAsText =sc.textFile(“hdfs://hadoop1.knowbigdata.com/user/student/sgiri/mynumbersfile.txt”); def toInt(str): return int(str); nums = numsAsText.map(toInt); def sqrtOfSumOfSq(x, y): return math.sqrt(x*x+y*y); total = nums.reduce(sum) import math; print math.sqrt(total); A: Yes. The approach is correct and sqrtOfSumOfSq is a valid reducer. mapPartitions() and mapPartitionsWithIndex() are both transformation. mapPartitions() : mapPartitions() can be used as an alternative to map() and foreach() .mapPartitions() can be called for each partitions while map() and foreach() is called for each elements in an RDD Hence one can do the initialization on per-partition basis rather than each element basis mapPartitions() :mapPartitionsWithIndex is similar to mapPartitions() but it provides second parameter index which keeps the track of partition. Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark: hive> set spark.home=/location/to/sparkHome; hive> set hive.execution.engine=spark; Hive on Spark supports Spark on yarn mode by default To connect Spark with Mesos: Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics formats so far. Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets. A unique feature and algorithm in graph, PageRank is the measure of each vertex in the graph. For instance, an edge from u to v represents endorsement of v’s importance by u. In simple terms, if a user at Instagram is followed massively, it will rank high on that platform. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost. Accumulators are variables that are only added through an associative and commutative operation. They are used to implement counters or sums. Tracking accumulators in the UI can be useful for understanding the progress of running stages. Spark natively supports numeric accumulators. We can create named or unnamed accumulators. Transformations are functions implemented on RDD, resulting into another RDD. It does not execute until an action occurs. map () and filer () are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD. The filter () creates a new RDD by selecting elements form current RDD that pass function argument. When SparkContext connect to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution. SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.These are some of the popular questions asked in an Apache Spark interview. Always be prepared to answer all types of questions — technical skills, interpersonal, leadership or methodology. If you are someone who has recently started your career in big data, you can always get certified in Apache Spark to get the techniques and skills required to be an expert in the field. Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and accept incoming connections from its executors and must be network addressable from the worker nodes. Worker node is basically the slave node. Master node assigns work and worker node actually performs the assigned tasks. Worker nodes process the data stored on the node and report the resources to the master. Based on the resource availability, the master schedule tasks. Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are: Using Broadcast Variable: Broadcast variable enhances the efficiency of joins between small and large RDDs. Using Accumulators: Accumulators help update the values of variables in parallel while executing. The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles. The best part of Apache Spark is its compatibility with Hadoop. As a result, this makes for a very powerful combination of technologies. Here, we will be looking at how Spark can benefit from the best of Hadoop. Using Spark and Hadoop together helps us to leverage Spark’s processing to utilize the best of Hadoop’s HDFS and YARN. Broadcast variables are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup(). Due to the availability of in-memory process, Spark implements the process around 10-100x quicker than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data process tasks. Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing. Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage. Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop. (Source: wiki and Spark doc) What is Apache Spark?
What are the main Components of Spark?’
What are features of Apache Spark?
How Apache Spark works?
Can you explain Spark RDD?
Can you explain Spark Core?
Can you explain Spark MLlib?
Can you explain Spark GraphX?
Can you explain Spark Streaming?
Can you explain Spark SQL?
What is DataFrames?
What is RDD?
How RDD persist the data?
What are the types of Cluster Managers in Spark?
What is Transformation in spark?
Can you define Yarn?
What are common Spark Ecosystems?
What is Dataset?
What are Actions?
Explain Spark map() transformation?
Explain the action count() in Spark RDD?
Is the following approach correct? Is the sqrtOfSumOfSq a valid reducer?
Explain about mapPartitions() and mapPartitionsWithIndex()
What is Hive on Spark?
Explain how can Spark be connected to Apache Mesos?
Can you define Parquet file?
Can you define RDD Lineage?
Can you define PageRank?
Can you explain broadcast variables?
Can you explain accumulators in Apache Spark?
What do you know about Transformations in Spark?
What is Spark Executor?
What do you know about SchemaRDD?
Can you explain worker node?
Explain how can you minimize data transfers when working with Spark?
Explain how can Apache Spark be used alongside Hadoop?
Why is there a need for broadcast variables when working with Apache Spark?
Can you explain benefits of Spark over MapReduce?
Related posts:
- Ant Interview Questions and Answers
- Apache Ambari Interview Questions and Answers
- Apache Hadoop Interview Questions and Answers
- Apache Kafka Interview Questions and Answers
- Apache Mahout Interview Questions and Answers
- Apache Storm Interview Questions and Answers
- Struts 2 Interview Questions and Answers