Hadoop Pig Interview Questions and Answers

Hadoop Pig is the programming language, supported by the Hadoop framework, to process large data sets present in HDFS. Pig uses Pig Latin language for processing.

The Big Data processing has made many advances since it was developed. The MapReduce programming has a modest design which breaks work down and recombines it in a series of parallelizable operations making it incredibly scalable. Since MapReduce expects hardware failures, it can run on inexpensive commodity hardware, sharply lowering the cost of a computing cluster.

However, although MapReduce puts parallel programming within reach of most professional software engineers, developing MapReduce jobs isn’t easy:

They require the programmer to think in terms of “map” and “reduce”
N-stage jobs can be difficult to manage.
Common operations (such as filters, projections, and joins) and rich data types require custom code.

Thus, Apache Pig was developed. Which automates the MapReduce low-level details handling and provide a high-level MapReduce architecture to the users.

Every time a Pig Latin statement or the Pig script is executed, inside it is transformed into a Map Reduce program and run above HDFS.

Pig latin is procedural version of SQl. pig has certainly similarities, more difference from sql. Sql is a query language for user asking question in query form. sql makes answer for given but don’t tell how to answer the given question. suppose, if user want to do multiple operations on tables, we have write multiple queries and also use temporary table for storing, Sql is support for subqueries but intermediate we have to use temporary tables, SQL users find subqueries confusing and difficult to form properly.

Using sub-queries creates an inside-out design where the first step in the data pipeline is the innermost query. pig is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.

In mapreduce, groupby operation performed at reducer side and filter, projection can be implemented in the map phase. pig latin also provides standard-operation similar to mapreduce like orderby and filters, groupby etc. We can analyze pig script and know data flows and also early to find the error checking. pig Latin is much lower cost to write and maintain than Java code for MapReduce.

To convert Pig Latin statements to MapReduce code, Pig Architecture comes into action. It will compile and execute a pig script.

Logical Plan
Physical Plan
MapReduce Plan

Two modes are available in the pig.

Local Mode (Runs on localhost filesystem)
MapReduce Mode (Runs on Hadoop Cluster)

There are 3 modes of execution available in pig

Interactive Mode (Also known as Grunt Mode)
Batch Mode
Embedded Mode

The pig is easy to learn: Pig is easy to learn. it overcomes the need for writing complex MapReduce programs to some extent. Pig works in a step by step manner. So it is easy to write, and even better, it is easy to read.

It can handle heterogeneous data: Pig can handle all types of data – structured, semi-structured, or unstructured.

Pig is Faster: Pig’s multi-query approach combines certain types of operations together in a single pipeline, reducing the number of times data is scanned.

Pig does more with less: Pig provides the common data operations (filters, joins, ordering, etc.) And nested data types (e.g. Tuples, bags, and maps) which can be used in processing data.

Pig is Extensible: Pig is easily extensible by UDFs – including Python, Java, JavaScript, and Ruby so you can use them to load, aggregate and analysis. Pig insulates your code from changes to the Hadoop Java API.

Define a class that extends EvalFunc<Class DataType>
Override the public String exec (Tuple input) method and add requisite business logic
Create Jar file for that class
Register JarFilename;
Write The Pig Script which uses the UDF created (UDF should be referred using fully qualified path i.e. PackageName.Classname)
Execute the Pig script

Int
Long
Float
Double
Char array
Byte array

Pig latin has many data analysis functions. Below are some of the categorization

Math Functions
Eval Functions
String Functions

ABS
ACOS
EXP
LOG
ROUND
CBRT
RANDOM
SQRT

AVG
CONCAT
MAX
MIN
SUM
SIZE
COUNT
COUNT_STAR
DIFF
TOKENIZE
IsEmpty

UPPER
LOWER
TRIM
SUBSTRING
INDEXOF
STRSPLIT
LAST_INDEX_OF

Loading and Storing
Filtering
Grouping and joining
Sorting
Combining and Splitting
Diagnostic

Hadoop Pig Interview Questions and Answers

What is Hadoop Pig?

Why there is need of Pig language?

How does Pig work?

What is the difference between pig and sql?

How Pig differs from MapReduce?

Explain Pig Architecture?

What are the different modes available in Pig?

What are the different execution mode available in Pig?

What are the advantages of pig language?

What are the basic steps to writing a UDF Function in Pig?

What are the primitive data types in pig?

What are the different functions available in pig latin language?

What are the different math functions available in pig?

What are the different Eval functions available in pig?

What are the different String functions available in pig?

What are the different Relational Operators available in pig language?