Hadoop HDFS Interview Questions and Answers

Top Hadoop HDFS Interview Questions and Answers: Below, we have covered detailed answers to the Hadoop HDFS Interview Questions Which will be helpful to freshers and experienced Professionals. All the best for your interview Preparation.

What is HDFS?

What are the key features of HDFS?

What is a block and block scanner in HDFS?

What is the difference between NameNode, Backup Node and Checkpoint NameNode?

NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace-

fsimage file– It keeps track of the latest checkpoint of the namespace.

edits file-It is a log of changes that have been made to the namespace since checkpoint.

BackupNode:

Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

Checkpoint Node:

Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.

Is NameNode also a commodity?

What is commodity hardware?

What is a metadata?

What is a Datanode?

What is a daemon?

What is a job tracker?

Job tracker is a daemon that runs on a Namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

What is a task tracker?

Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

What is a heartbeat in HDFS?

What is a ‘block’ in HDFS?

A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks. If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size? No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.

How indexing is done in HDFS?

How can you overwrite the replication factors in HDFS?

The replication factor in HDFS can be modified or overwritten in 2 ways-

Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-

$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)

Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-

$hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5)

Explain the difference between NAS and HDFS?

NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.

NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.
In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.

Hadoop HDFS Interview Questions and Answers

What is HDFS?

What are the key features of HDFS?

What is a block and block scanner in HDFS?

What is the difference between NameNode, Backup Node and Checkpoint NameNode?

Is NameNode also a commodity?

What is commodity hardware?

What is a metadata?

What is a Datanode?

What is a daemon?

What is a job tracker?

What is a task tracker?

What is a heartbeat in HDFS?

What is a ‘block’ in HDFS?

How indexing is done in HDFS?

How can you overwrite the replication factors in HDFS?

Explain the difference between NAS and HDFS?

What is the process to change the files at arbitrary locations in HDFS?

Who is a ‘user’ in HDFS?

Is client the end user in HDFS?

What is ‘Key value pair’ in HDFS?

What is the difference between MapReduce engine and HDFS cluster?

What is HDFS?

What are the key features of HDFS?

What is a block and block scanner in HDFS?

What is the difference between NameNode, Backup Node and Checkpoint NameNode?

Is NameNode also a commodity?

What is commodity hardware?

What is a metadata?

What is a Datanode?

What is a daemon?

What is a job tracker?

What is a task tracker?

What is a heartbeat in HDFS?

What is a ‘block’ in HDFS?

How indexing is done in HDFS?

How can you overwrite the replication factors in HDFS?

Explain the difference between NAS and HDFS?

What is the process to change the files at arbitrary locations in HDFS?

Who is a ‘user’ in HDFS?

Is client the end user in HDFS?

What is ‘Key value pair’ in HDFS?

What is the difference between MapReduce engine and HDFS cluster?

Related Posts