What is Flume?
Flume is a reliable distributed service for collection and aggregation of large amount of streaming data into HDFS. Most of the Bigdata analysts use Apache Flume to push data from different sources like Twitter, Facebook, & LinkedIn. into Hadoop, Strom, Solr, Kafka & Spark.

Why we are using Flume?
Most often Hadoop developer use this tool to get log data from social media sites. Its developed by Cloudera for aggregating and moving very large amount of data. The primary use is gather log files from different sources and asynchronously persists in the Hadoop cluster.

What is Flume Agent?
A Flume agent is a JVM process that holds the Flume core components (Source, Channel, Sink) through which events flow from an external source like web-servers to destination like HDFS. Agent is heart of the Apache Flume.

What is Flume event?
A unit of data with set of string attributes called Flume event. The external source like web-server sends events to the source. Internally Flume has inbuilt functionality to understand the source format. For example, Avro sends events from Avro sources to the Flume.
Each log file is considering as an event. Each event has header and value sectors, which has header information and appropriate value that assign to the particular header.
What are Flume Core components?
Source, Channels and Sink are core components in Apache Flume.
When Flume source receives event from external sources, it stores the event in one or multiple channels.
Flume channel is temporarily store & keeps the event until it’s consumed by the Flume sink. It acts as Flume repository.
Flume Sink removes the event from channel and put into an external repository like HDFS or Move to the next Flume agent.

What are the core components of Flume?
The core components of Flume are –
Event- The single log entry or unit of data that is transported.
Source- This is the component through which data enters Flume workflows.
Sink-It is responsible for transporting data to the desired destination.
Channel- it is the duct between the Sink and Source.
Agent- Any JVM that runs Flume.
Client- The component that transmits event to the source that operates with the agent.

Does Flume provide 100% reliability to the data flow?
Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow.

How can Flume be used with HBase?
Apache Flume can be used with HBase using one of the two HBase sinks –
HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96.
AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non-blocking calls to HBase.
Working of the HBaseSink –
In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the initialize method in the serializer which then translates the Flume Event into HBase increments and puts to be sent to HBase cluster.
Working of the AsyncHBaseSink-
AsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp method is called by the serializer.

What are the different channel types in Flume?
They are three different built in channel types available in Flume.
MEMORY Channel – Events are read from the source into memory and passed to the sink.
JDBC Channel – JDBC Channel stores the events in an embedded Derby database.
FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink.

Which is the reliable channel in Flume to ensure that there is no data loss?
FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY.

Explain about the replication and multiplexing selectors in Flume.
Channel Selectors are used to handle multiple channels. Based on the Flume header value, an event can be written just to a single channel or to multiple channels. If a channel selector is not specified to the source, then by default it is the Replicating selector. Using the replicating selector, the same event is written to all the channels in the source’s channels list. Multiplexing channel selector is used when the application has to send different events to different channels.

How multi-hop agent can be setup in Flume?
Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume.

Does Apache Flume provide support for third party plug-ins?
Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations.

What is the Differentiate between FileSink and FileRollSink?
The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system.

What are the complicated steps in Flume configuration?
Flume can be processing streaming data, so if started once, there is no stop/end to the process. asynchronously it can flow data from source to HDFS via Agent. First of all, Agent should know individual components how they are connected to load data. So configuration is trigger to load streaming data. For example, consumerKey, consumerSecret, accessToken and accessTokenSecret are key factors to download data from Twitter.

What are the important steps in the configuration?
Configuration file is the heart of the Apache Flumes agent.
Every Source must have atleast one channel.
Every Sink must have only one channel.
Every Component must have a specific type.

What are interceptors?
Interceptors are used to filter the events between source and channel, channel and sink. These channels can filter un-necessary or targeted log files. Depends on requirements you can use n number of interceptors.

What are Channel selectors?
channel selectors control and separating the events and allocate to a particular channel. There are default/ replicated channel selectors. Replicated channel selectors can replicate the data in multiple/all channels.
Multiplexing channel selectors used to separate and aggregate the data based on the events header information. It means based on Sinks destination, the event aggregates into the particular sink.
Leg example: One sink connected with Hadoop, another with S3 another with Hbase, at that time, multiplexing channel selectors can separate the events and flow to the particular sink.

What is sink processors?
Sink processors is a mechanism by which you can create a fail-over task and load balancing.
Request to Download PDF

Post A Comment:

0 comments: