What is Big data?			
		
		
			
“Big data refers to a method that’s used once ancient data mining and handling techniques cannot uncover the insights and that means of the underlying data. knowledge that’s unstructured or time sensitive or just terribly massive cannot be processed by relational database engines. this kind of data needs a special process approach referred to as big data, that uses massive parallelism on readily-available hardware”.
“Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves”.
“Big data is high-volume, and high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation”.
		
 
		
					
				Why Big data use?			
		
		
			
The importance of big data doesn’t revolve around what proportion data you’ve got, however what you are doing with it. you’ll be able to take data from any source and analyze it to search out answers that enable cost reductions, time reductions, new development and optimized offerings, and smart move creating.
Healthcare: Data-driven medication involves analyzing huge numbers of medical records and pictures for patterns which will facilitate spot disease early and develop new medicines.
Predicting and responding to natural and man-made disasters: Sensor data may be analyzed to predict wherever earthquakes are probably to strike next, and patterns of human behavior provide clues that facilitate organizations provide relief to survivors. big data technology is additionally used to monitor and safeguard the flow of refugees removed from war zones around the world.
Preventing crime: Police forces are increasingly adopting data-driven strategies based on their own intelligence and public data sets in order to deploy resources more efficiently and act as a deterrent where one is needed.
Big Data in communications: Gaining new subscribers, retaining customers, and expanding within current subscriber bases are top priorities for telecommunication service providers. The solutions to these challenges lie in the ability to combine and analyze the masses of customer-generated data and machine-generated data that is being created every day.
Big Data for Retail: Brick and Mortar or an online e-tailer, the answer to staying the game and being competitive is understanding the customer better to serve them. This requires the ability to analyze all the disparate data sources that companies deal with every day, including the weblogs, customer transaction data, social media, store-branded credit card data, and loyalty program data.
Thus, big data is used in a number of ways like:
Customer analytics
Compliance analytics
Fraud analytics
Operational analytics
		
 
		
					
				Can you define Big Data Analytics?			
		
		
			
Big data analytics refers to the strategy of analyzing large volumes of data. This big data is gathered from a good type of sources, together with social networks, videos, digital pictures, sensors, and sales transaction records. The aim in analyzing all this data is to uncover patterns and connections which may preferably be invisible, which would possibly offer valuable insights regarding the users who created it. Through this insight, businesses are also ready to gain an edge over their rivals and build superior business decisions.
		
 
		
					
				What are the Characteristics of Big data? (10V’s of Big data)			
		
		
			
Volume: Big data first and foremost has to be “big,” and size in this case is measured as volume. It used to be employees created data. Now that data is generated by machines, networks and human interaction on systems like social media the volume of data to be analyzed is massive. Yet, Inderal states that the volume of data is not as much the problem as other V’s like veracity.
Variety: It refers to the many sources and types of data both structured and unstructured. We used to store data from sources like spreadsheets and databases. Now data comes in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. This variety of unstructured data creates problems for storage, mining and analyzing data. Jeff Veis, VP Solutions at HP Autonomy presented how HP is helping organizations deal with big challenges including data variety. With big data technology we can now analyze and bring together data of different types such as messages, social media conversations, photos, sensor data, video or voice recordings.
Velocity: Big Data Velocity deals with the pace at which data flows in from sources like business processes, machines, networks and human interaction with things like social media sites, mobile devices, etc. Just think of social media messages going viral in seconds. Technology allows us now to analyze the data while it is being generated (sometimes referred to as in-memory analytics), without ever putting it into databases.
The Velocity is the speed at which the data is created, stored, analyzed and visualized. In the past, when batch processing was common practice, it was normal to receive an update from the database every night or even every week. Computers and servers required substantial time to process the data and update the databases
Veracity: It refers to the biases, noise and abnormality in data. Is the data that is being stored and mined meaningful to the problem being analyzed. Inderpal feel veracity in data analysis is the biggest challenge when compares to things like volume and velocity. In scoping out your big data strategy you need to have your team and partners work to help keep your data clean and processes to keep ‘dirty data’ from accumulating in your systems.
Validity: Like big data veracity is the issue of validity meaning is the data correct and accurate for the intended use. Clearly valid data is key to making the right decisions. Phil Francisco, VP of Product Management from IBM spoke about IBM’s big data strategy and tools they offer to help with data veracity and validity.
Volatility: It refers to however long is data valid and the way long ought to it’s stored. during this world of real time data, you wish to see at what purpose is data no longer relevant to this analysis. big data clearly deals with problems on the far side volume, selection and velocity to alternative issues like veracity, validity and volatility.
Venue: = distributed, heterogeneous data from multiple platforms, from different owners’ systems, with different access and formatting requirements, private vs. public cloud.
Varifocal:  This is big data and data science together allow us to see both the forest and the trees.
Varmint: As big data gets bigger, so can software bugs
Vocabulary:  schema, data models, semantics, ontologies, taxonomies, and other content- and context-based metadata that describe the data’s structure, syntax, content, and provenance.
		
 
		
					
				What are the important tools useful for Big Data?			
		
		
			
- Apache Hadoop
- Cassandra
- Lumify
- MongoDB
- R -Programming
- Talend
- Cloudera
- Polybase
- Teradata
- Data Minining
- Data cleaning
- Data Visualization
- Data analytics
- Bigdata in excel
- Tableau
		
 
		
					
				Can you explain data preparation?			
		
		
			
Data preparation is one of the crucial steps in big data projects. Data preparation (or data preprocessing) in this context means manipulation of data into a form suitable for further analysis and processing. It is a process that involves many different tasks and which cannot be fully automated. Many of the data preparation activities are routine, tedious, and time consuming. It has been estimated that data preparation accounts for 60%-80% of the time spent on a data mining project.
		
 
		
					
				Explain the steps to be followed to deploy a Big Data solution?			
		
		
			
Followings are the three steps that are followed to deploy a Big Data Solution –
Data Ingestion: It is extraction of data from various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc. The data can be ingested either through batch jobs or real-time streaming. The extracted data is then stored in HDFS.
 Data Storage: The next step is to store the extracted data. The data either be stored in HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access whereas HBase for random read/write access.
Data Processing: The data is processed through one of the processing frameworks like Spark, MapReduce, Pig, etc. (the converting of raw data to machine-readable form and its subsequent processing (such as storing, updating, rearranging, or printing out) by a computer)
		
 
		
					
				Can you explain Ingestion in Big Data?			
		
		
			
Big Data Ingestion involves connecting to various data sources, extracting the data, and detecting the changed data. It’s about moving data – and especially the unstructured data – from where it is originated, into a system where it can be stored and analyzed. We can also say that Data Ingestion means taking data coming from multiple sources and putting it somewhere it can be accessed. It is the beginning of Data Pipeline where it obtains or import data for immediate use.
Data Ingestion Parameters:
Data Velocity: This deals with the speed at which data flows in from different sources like machines, networks, human interaction, media sites, social media. The movement of data can be massive or continuous.
Data Size: It implies enormous volume of data. Data is generated from different sources that may increase timely.
Data Frequency (Batch, Real-Time): Data can be processed in real time or batch, in real time processing as data received on same time, it further proceeds but in batch time data is stored in batches, fixed at some time interval and then further moved.
Data Format (Structured, Semi-Structured, Unstructured): Data can be in different formats, mostly it can be the structured format, i.e., tabular one or unstructured format, i.e., images, audios, videos or semi-structured, i.e., JSON files, CSS files, etc.
		
 
		
					
				What is Hadoop?			
		
		
			
Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover, it can be scaled up just by adding nodes in the cluster. 		
 
		
					
				Can you define Data Lake?			
		
		
			
It is a replacement/new type of cloud-based enterprise design that structures data during a lot of scalable method that produces it easier to experiment with it. With data lake, incoming data goes into the lake during a raw type or no matter form data source providers, and there we select and organize the data during a raw form. There are not any assumptions regarding the schema of the data; every data source can use no matter scheme it likes.
		
 
		
					
				How do big data solutions interact with the existing enterprise infrastructure?			
		
		
			 
Big data solutions work in parallel with the existing enterprise infrastructure leveraging all the unstructured raw data that cannot be processed and stored in a traditional data warehouse solution.
		
 
		
					
				Can you define Oozie?			
		
		
			
Oozie is a workflow scheduler for Hadoop Oozie allows a user to create Directed Acyclic Graphs of workflows and these can be running in parallel and sequential in Hadoop. It can also run plain java classes, pig workflows and interact with the HDFS. It can run jobs sequentially and in parallel
		
 
		
					
				Can you explain the common input formats in Hadoop?			
		
		
			
Below are the common input formats in Hadoop
Sequence File Input Format: To read files in a sequence, Sequence File Input Format is used.
Text Input Format: The default input format defined in Hadoop is the Text Input Format.
Key Value Input Format: The input format used for plain text files (files broken into lines) is the Key Value Input Format.
		
 
		
					
				Can you explain the core methods of a Reducer?			
		
		
			
There are three core methods of a reducer. They are-
setup() : Configures different parameters like distributed cache, heap size, and input data.
reduce() : A parameter that is called once per key with the concerned reduce task
cleanup() : Clears all temporary files and called only at the end of a reducer task.
		
 
		
					
				How is big data analysis helpful in increasing business revenue?			
		
		
			
Big data analysis has become vital for the companies. It helps businesses to differentiate themselves from others and increase the revenue. Through predictive analytics, big data analytics provides businesses custom-made recommendations and suggestions. Also, big data analytics permits businesses to launch new products looking on client wants and preferences. These factors make businesses earn more revenue, and thus companies are using big data analytics. Companies may encounter a significant increase of 5-20% in revenue by implementing big data analytics. Some popular companies those are using big data analytics to increase their revenue is – Walmart, LinkedIn, Facebook, Twitter, Bank of America etc.
		
 
		
					
				Why do we need Hadoop for Big Data Analytics?			
		
		
			
In most cases, exploring and analyzing large unstructured data sets becomes difficult with the lack of analysis tools. This is where Hadoop comes in as it offers storage, processing, and data collection capabilities. Hadoop stores data in its raw forms without the use of any schema and allows the addition of any number of nodes.
Since Hadoop is open-source and is run on commodity hardware, it is also economically feasible for businesses and organizations to use it for the purpose of Big Data Analytics.
		
 
		
					
				Can you define a Combiner?			
		
		
			
Combiner’ is also termed as ‘Mini-reducer’. It performs the local reduce task. It receives the input from the mapper on a particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the reducers.
		
 
		
					
				Can you explain Edge Nodes in Hadoop?			
		
		
			
Edge nodes are gateway nodes in Hadoop which act as the interface between the Hadoop cluster and external network. They run client applications and cluster administration tools in Hadoop and are used as staging areas for data transfers to the Hadoop cluster. Enterprise-class storage capabilities (like 900GB SAS Drives with Raid HDD Controllers) is required for Edge Nodes, and a single edge node usually suffices for multiple Hadoop clusters.
		
 
		
					
				Can you define a UDF?			
		
		
			
If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring those functionalities using other languages like Java, Python, Ruby, etc. and embed it in Script file.
		
 
		
					
				Can you define TaskInstance?			
		
		
			
It is a specific Hadoop MapReduce work process that runs on any given slave node. Each task instance has its very own JVM process that is created by default for aiding its performance.
		
 
		
					
				Can you Define FSCK?			
		
		
			
FSCK (File System Check) is a command used to run a Hadoop summary report that describes the state of the Hadoop file system. This command is used to check the health of the file distribution system when one or more file blocks become corrupt or unavailable in the system. FSCK only checks for errors in the system and does not correct them, unlike the traditional FSCK utility tool in Hadoop. The command can be run on the whole system or on a subset of files.
The correct command for FSCK is bin/HDFS FSCK.
		
 
		
					
				What are the different catalog tables in HBase?			
		
		
			
The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.
		
 
		
					
				Explain how are file systems checked in HDFS?			
		
		
			
The “fsck” command is used for conducting file system checks in Linux Hadoop and HDFS. It is helpful in blocking names and locations, as well as ascertaining the overall health of any given file system.
		
 
		
					
				What is difference the between Sqoop and distCP?			
		
		
			
DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between Hadoop and RDBMS.
		
 
		
					
				Explain how do “reducers” communicate with each other?			
		
		
			
The “MapReduce” programming model does not allow “reducers” to communicate with each other. “Reducers” run in isolation.
		
 
		
					
				Can you explain collaborative filtering?			
		
		
			
 A set of technologies that forecast which items a particular consumer will like depending on the preferences of scores of individuals. It is nothing but the tech word for questioning individuals for suggestions
		
 
		
					
				What are the important modes of Hadoop?			
		
		
			
There are three important modes of Hadoop:
Local Mode or Standalone Mode: Standalone mode is the default mode in which Hadoop run. Standalone mode is mainly used for debugging where you don’t really use HDFS. You can use input and output both as a local file system in standalone mode.
Pseudo-distributed Mode: The pseudo-distribute mode is also known as a single-node cluster where both Name Node and Data Node will reside on the same machine.
Fully-Distributed Mode or Multi-Node Cluster: This is the production mode of Hadoop where multiple nodes will be running. Here data will be distributed across several nodes and processing will be done on each node.
		
 
		
					
				Explain some important features of Hadoop?			
		
		
			
Hadoop supports the storage and processing of big data. It is the best solution for handling big data challenges. Some important features of Hadoop are –
Open Source: It is an open source framework which means it is available free of cost. Also, the users are allowed to change the source code as per their requirements.
Distributed Processing: It supports distributed processing of data i.e. faster processing. The data in Hadoop HDFS is stored in a distributed manner and MapReduce is responsible for the parallel processing of data.
Fault Tolerance: It is highly fault-tolerant. It creates three replicas for each block at different nodes, by default. This number can be changed according to the requirement. So, we can recover the data from another node if one node fails. The detection of node failure and recovery of data is done automatically.
Reliability: It stores data on the cluster in a reliable manner that is independent of machine. So, the data stored in Hadoop environment is not affected by the failure of the machine.
Scalability: It is the scalability. It is compatible with the other hardware and we can easily ass the new hardware to the nodes.
High Availability: The data stored in Hadoop is available to access even after the hardware failure. In case of hardware failure, the data can be accessed from another path.
		
 
		
					
				Explain the responsibilities of a data analyst?			
		
		
			
Helping promoting executives understand that product ar the foremost profitable by season, client sort, region and another feature
Tracking external trends relatives to geographies, demographic and specific products
Ensure customers and employees relate well
Explaining the optimal staffing plans to cater the needs of executives looking for decision support.
		
 
		
					
				Can you explain the benefits of Big Data?			
		
		
			
Big Data is Timely :60% of each workday, knowledge staff spend making an attempt to search out and manage data.
Big Data is Accessible: Half of senior executive’s report that accessing the right data is difficult.
Holistic: Information is presently unbroken in silos at intervals the organization. selling data, let’s say, could be found in internet analytics, mobile analytics, social analytics, CRMs, A/B Testing tools, email marketing systems, and more… every with specialize in its silo.
Trustworthy: 29% of companies measure the monetary cost of poor data quality. Things as straightforward as observance multiple systems for client contact info updates will save innumerable dollars.
Relevant: 43% of companies are dissatisfied with their tools ability to filter out irrelevant data. Something as straightforward as filtering customers from your web analytics will give a lot of insight into your acquisition efforts.
Secure: The average data security breach costs $214 per customer. The secure infrastructures being built by big data hosting and technology partners can save the average company 1.6% of annual revenues.
Authorities: 80% of organizations struggle with multiple versions of the truth depending on the source of their data. By combining multiple, vetted sources, more companies can produce highly accurate intelligence sources.