Apache Hadoop Interview Questions and Answers

Apache Hadoop Interview Questions and Answers

Apache Hadoop is an open source software framework for storage and large data scale processing of data sets on clusters of commodity hardware. It is designed to scale up from single servers to thousands of machine, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.-(Apache Hadoop official site)

What is Apache Hadoop?

What are the Main Components of Hadoop?

Why do we need Hadoop?

What are the four characteristics of Big Data?

What are the modes in which Hadoop run?

Can you explain about the indexing process in HDFS?

How many daemon processes run on a Hadoop cluster?

What happens to a NameNode that has no data?

Can you explain Hadoop streaming?

Can you define a block and block scanner in HDFS?

Can you define a checkpoint?

Can you explain commodity hardware?

Can you explain heartbeat in HDFS?

What happens when a data node fails?

Can you explain textinformat?

Can you define Sqoop in Hadoop?

What are the data components used by Hadoop?

Can you explain rack awareness?

How to do ‘map’ and ‘reduce’ works?

Can you explain Combiner?

How many input splits will be made by Hadoop framework?

Suppose Hadoop spawned 100 tasks for a job and one of the tasks failed. What will Hadoop do?

What are problems with small files and HDFS?

What does ‘jps’ command do?

How to restart Namenode?

What does /etc /init.d do?

What is the purpose of Context Object?

What is the number of default partitioner in Hadoop?

What is the use of RecordReader in Hadoop?

What is the best way to copy files between HDFS clusters?

Can you explain Speculative Execution?

What is the difference between an Input Split and HDFS Block?

How can native libraries be included in YARN jobs?

Can you explain Apache HBase?

Can you define SerDe in Hive?

Can you explaine WAL in HBase?

Can you explain Apache Spark?

Can you define UDF?

Can you explain SMB Join in Hive?

How can you connect an application, if you run Hive as a server?

Is YARN a replacement of Hadoop MapReduce?

Can you explain Record Reader?

Can you explain sequence file in Hadoop?

What does the Conf.setMapper Class do?

How do you overwrite replication factor?

How do you do a file system check in HDFS?

Is Namenode also a commodity?

Can you define InputSplit in Hadoop?

How many InputSplits is made by a Hadoop Framework?

What is the difference between an InputSplit and a Block?

What is the difference between SORT BY and ORDER BY in Hive?

In which directory Hadoop is installed?

What are the port numbers of Namenode, jobtracker, and task tracker?

What are the Hadoop configuration files at present?

What is Cloudera and why it is used?

How can we check whether Namenode is working or not?

Which files are used by the startup and shutdown commands?

Can we create a Hadoop cluster from scratch?

How can you transfer data from Hive to HDFS?

What is Job Tracker role in Hadoop?

What are the core methods of a Reducer?

Can I access Hive Without Hadoop?

How Spark uses Hadoop?

What is Spark SQL?

What are the additional benefits YARN brings in to Hadoop?

Can you explain Sqoop metastore?

Which are the elements of Kafka?

Can you explain Apache Kafka?

What is the role of the ZooKeeper?

What are the key benefits of using Storm for Real-Time Processing?

List out different stream grouping in Apache storm?

Which operating system(s) are supported for production Hadoop deployment?

What is the best practice to deploy the secondary name node?

What are the side effects of not running a secondary name node?

What daemons run on Master nodes?

Can you explain BloomMapFile.

What is the usage of foreach operation in Pig scripts?

Explain about the different complex data types in Pig.

What is the difference between PigLatin and HiveQL?

Whether pig latin language is case-sensitive or not?

What are the use cases of Apache Pig?

What does Apache Mahout do?

Mention some machine learning algorithms exposed by Mahout?

What is Apache Flume?

Explain about the different channel types in Flume. Which channel type is faster?

Why are we using Flume?

Which Scala library is used for functional programming?

What do you understand by Unit and ()in Scala?

What do you understand by a closure in Scala?

List some use cases where classification machine learning algorithms can be used.

What is data cleansing?

List of some best tools that can be useful for data-analysis?

List out some common problems faced by data analyst?

What are the tools used in Big Data?

Which language is more suitable for text analytics? R or Python?

Can you explain logistic regression?

Can you list few commonly used Hive services?

Can you explain indexing?

What are the components used in Hive query processor?

If you run a select * query in Hive, Why does it not run MapReduce?

What is the purpose of exploding in Hive?