Cloudera vs. Hortonworks vs. MapR - Hadoop Distribution Comparison
Choosing the right Hadoop Distribution for your enterprise is a very important decision, whether you have been using Hadoop for a while or you are a newbie to the framework. The decision to go with a particular commercial Hadoop Distribution is very critical as an organization spends significant amount of money on hardware and hadoop solutions. However, choosing the right Hadoop Distribution for business needs leads to faster data driven solutions and helps your organization gain traction from best people in the industry. The idea of this blog post is to explore and compare the Hadoop distributions, Cloudera vs. Hortonworks vs. MapR - based on cost, technical details, ease of maintenance and deployment.
Choosing the right Hadoop Distribution for your enterprise is a very important decision, whether you have been using Hadoop for a while or you are a newbie to the framework. The decision to go with a particular commercial Hadoop Distribution is very critical as an organization spends significant amount of money on hardware and hadoop solutions. However, choosing the right Hadoop Distribution for business needs leads to faster data driven solutions and helps your organization gain traction from best people in the industry. The idea of this blog post is to explore and compare the Hadoop distributions, Cloudera vs. Hortonworks vs. MapR - based on cost, technical details, ease of maintenance and deployment.
Cloudera vs. Hortonworks vs. MapR
Hadoop is an open source project and several vendors have stepped in to develop their own distributions on top of Hadoop framework to make it enterprise ready. The beauty of Hadoop distributions lies in the fact that they can be personalized with different feature sets to meet the requirements of different classes of users. Hadoop Distributions pull together all the enhancement projects present in the Apache repository and present them as a unified product so that organizations don’t have to spend time on assembling these elements into a single functional component.
Hadoop is an open source project and several vendors have stepped in to develop their own distributions on top of Hadoop framework to make it enterprise ready. The beauty of Hadoop distributions lies in the fact that they can be personalized with different feature sets to meet the requirements of different classes of users. Hadoop Distributions pull together all the enhancement projects present in the Apache repository and present them as a unified product so that organizations don’t have to spend time on assembling these elements into a single functional component.
Different Classes of Users who require Hadoop-
- Professionals who are learning Hadoop might need a temporary Hadoop deployment.
- Organizations that want to adopt big data solutions to pace up with the massive growth of data from disparate sources.
- Hadoop Developers, whose job roles require them to build new tools for the Hadoop ecosystem.
Learn Hadoop to become a Microsoft Certified Big Data Engineer.
- Professionals who are learning Hadoop might need a temporary Hadoop deployment.
- Organizations that want to adopt big data solutions to pace up with the massive growth of data from disparate sources.
- Hadoop Developers, whose job roles require them to build new tools for the Hadoop ecosystem.
Learn Hadoop to become a Microsoft Certified Big Data Engineer.
Layers of Innovation offered by Commercial Hadoop Vendors
Hadoop vendors have added new functionalities by improving the code base and bundling it with easy to use and user-friendly management tools, technical support and continuous updates. The most recognized Hadoop Distributions available in the market are – Cloudera, MapR and Hortonworks. All these Hadoop Distributions are compatible with Apache Hadoop but the question is –what distinguishes them from each other?
Hadoop vendors have added new functionalities by improving the code base and bundling it with easy to use and user-friendly management tools, technical support and continuous updates. The most recognized Hadoop Distributions available in the market are – Cloudera, MapR and Hortonworks. All these Hadoop Distributions are compatible with Apache Hadoop but the question is –what distinguishes them from each other?
Core Distribution
All the 3 big players - Cloudera, MapR and Hortonworks use the core Hadoop framework and bundle it for enterprise use. The features offered as a part of core distribution by these vendors include support service and subscription service model.
All the 3 big players - Cloudera, MapR and Hortonworks use the core Hadoop framework and bundle it for enterprise use. The features offered as a part of core distribution by these vendors include support service and subscription service model.
Enterprise Reliability and Integration
Commercial vendor MapR offers a robust distribution package that includes various features like –real-time data streaming, built-in connectors to existing systems, data protection, enterprise quality engineering.
Commercial vendor MapR offers a robust distribution package that includes various features like –real-time data streaming, built-in connectors to existing systems, data protection, enterprise quality engineering.
Management Capabilities
Cloudera and MapR offer additional management software as a part of the commercial distribution so that Hadoop Administrators can configure, monitor and tune their hadoop clusters.
Cloudera and MapR offer additional management software as a part of the commercial distribution so that Hadoop Administrators can configure, monitor and tune their hadoop clusters.
Cloudera vs. Hortonworks vs. MapR- Similarities and Differences Unleashed
Hadoop Distribution
Advantages
Disadvantages
Cloudera Distribution for Hadoop (CDH)
CDH has a user friendly interface with many features and useful tools like Cloudera Impala
CDH is comparatively slower than MapR Hadoop Distribution
MapR Hadoop Distribution
It is one of the fastest hadoop distribution with multi node direct access.
MapR does not have a good interface console as Cloudera
Hortonworks Data Platform (HDP)
It is the only Hadoop Distribution that supports Windows platform.
The Ambari Management interface on HDP is just a basic one and does not have many rich features.
Learn Hadoop to solve the biggest big data problems for top tech companies!
Hadoop Distribution
|
Advantages
|
Disadvantages
|
Cloudera Distribution for Hadoop (CDH)
|
CDH has a user friendly interface with many features and useful tools like Cloudera Impala
|
CDH is comparatively slower than MapR Hadoop Distribution
|
MapR Hadoop Distribution
|
It is one of the fastest hadoop distribution with multi node direct access.
|
MapR does not have a good interface console as Cloudera
|
Hortonworks Data Platform (HDP)
|
It is the only Hadoop Distribution that supports Windows platform.
|
The Ambari Management interface on HDP is just a basic one and does not have many rich features.
|
Learn Hadoop to solve the biggest big data problems for top tech companies!
Similarities -
- All the three – Cloudera, Hortonworks and MapR, are focused on Hadoop and their entire revenue comes in by offering enterprise ready hadoop distributions
- Cloudera, MapR and Hortonworks are all mid-sized companies with their premium paid customers increasing over time and with partnership ventures across different industries.
- All three vendors provide downloadable free versions of their distributions but MapR and Cloudera also provide additional premium hadoop distributions to their paying customers.
- They have established communities for support to help users with the problems faced and also demonstrations, if required.
- All the three Hadoop distributions have stood the test of time ensuring stability and security to meet business needs.
That’s where the similarities end. Let’s move on to understand the differences by understanding the features of each Hadoop distribution in detail.
- All the three – Cloudera, Hortonworks and MapR, are focused on Hadoop and their entire revenue comes in by offering enterprise ready hadoop distributions
- Cloudera, MapR and Hortonworks are all mid-sized companies with their premium paid customers increasing over time and with partnership ventures across different industries.
- All three vendors provide downloadable free versions of their distributions but MapR and Cloudera also provide additional premium hadoop distributions to their paying customers.
- They have established communities for support to help users with the problems faced and also demonstrations, if required.
- All the three Hadoop distributions have stood the test of time ensuring stability and security to meet business needs.
That’s where the similarities end. Let’s move on to understand the differences by understanding the features of each Hadoop distribution in detail.
Cloudera Distribution for Hadoop (CDH)
Cloudera is the best known player and market leader in the Hadoop space to release the first commercial Hadoop distribution. With more than 350 customers and with active contribution of code to the Hadoop Ecosystem, it tops the list when it comes to building innovative tools. The management console –Cloudera Manager, is easy to use and implement with rich user interface displaying all the information in an organized and clean way. The proprietary Cloudera Management suite automates the installation process and also renders various other enhanced services to users –displaying the count of real-time nodes, reducing the deployment time, etc.
Cloudera offers consulting services to bridge the gap between - what the community provides and what organizations need to integrate Hadoop technology in their data management strategy. Groupon uses CDH for its hadoop services.
Cloudera is the best known player and market leader in the Hadoop space to release the first commercial Hadoop distribution. With more than 350 customers and with active contribution of code to the Hadoop Ecosystem, it tops the list when it comes to building innovative tools. The management console –Cloudera Manager, is easy to use and implement with rich user interface displaying all the information in an organized and clean way. The proprietary Cloudera Management suite automates the installation process and also renders various other enhanced services to users –displaying the count of real-time nodes, reducing the deployment time, etc.
Cloudera offers consulting services to bridge the gap between - what the community provides and what organizations need to integrate Hadoop technology in their data management strategy. Groupon uses CDH for its hadoop services.
Unique Features Supported by Cloudera Distribution for Hadoop
- The ability to add new services to a running Hadoop cluster.
- CDH supports multi cluster management.
- CDH provides Node Templates i.e. it allows creation of groups of nodes in a Hadoop cluster with varying configuration so that the users don’t have to use the same configuration throughout the Hadoop cluster.
- Hortonworks and Cloudera both depend on HDFS and go with the DataNode and NameNode architecture for splitting up where the data processing is done and metadata is saved.
For the complete list of big data companies and their salaries- CLICK HERE
- The ability to add new services to a running Hadoop cluster.
- CDH supports multi cluster management.
- CDH provides Node Templates i.e. it allows creation of groups of nodes in a Hadoop cluster with varying configuration so that the users don’t have to use the same configuration throughout the Hadoop cluster.
- Hortonworks and Cloudera both depend on HDFS and go with the DataNode and NameNode architecture for splitting up where the data processing is done and metadata is saved.
For the complete list of big data companies and their salaries- CLICK HERE
MapR Hadoop Distribution
MapR hadoop distribution works on the concept that a market driven entity is meant to support market needs faster. Leading companies like Cisco, Ancestry.com, Boeing, Google Cloud Platform and Amazon EMR use MapR Hadoop Distribution for their Hadoop services. Unlike Cloudera and Hortonworks, MapR Hadoop Distribution has a more distributed approach for storing metadata on the processing nodes because it depends on a different file system known as MapR File System (MapRFS) and does not have a NameNode architecture. MapR hadoop distribution does not rely on the Linux File system.
MapR hadoop distribution works on the concept that a market driven entity is meant to support market needs faster. Leading companies like Cisco, Ancestry.com, Boeing, Google Cloud Platform and Amazon EMR use MapR Hadoop Distribution for their Hadoop services. Unlike Cloudera and Hortonworks, MapR Hadoop Distribution has a more distributed approach for storing metadata on the processing nodes because it depends on a different file system known as MapR File System (MapRFS) and does not have a NameNode architecture. MapR hadoop distribution does not rely on the Linux File system.
Unique Features Supported by MapR Hadoop Distribution
- It is the only Hadoop distribution that includes Pig, Hive and Sqoop without any Java dependencies - since it relies on MapRFS.
- MapR is the most production ready Hadoop distribution with enhancements that make it more user friendly, faster and dependable.
- Provides multi node direct access NFS , so that users of the distribution can mount MapR file system over NFS allowing applications to access hadoop data in a traditional way.
- MapR Hadoop Distribution provides complete data protection, ease of use and no single points of failure.
- MapR is considered to be one of the fastest hadoop distributions.
Why you should choose MapR Hadoop distribution?
Though MapR is still at number 3 in terms of number of installations, it is one of the easiest and fastest hadoop distributions when compared to others.If you are looking for an innovative approch with lots of free learning material then MapR Hadoop distribution is the way to go.
- It is the only Hadoop distribution that includes Pig, Hive and Sqoop without any Java dependencies - since it relies on MapRFS.
- MapR is the most production ready Hadoop distribution with enhancements that make it more user friendly, faster and dependable.
- Provides multi node direct access NFS , so that users of the distribution can mount MapR file system over NFS allowing applications to access hadoop data in a traditional way.
- MapR Hadoop Distribution provides complete data protection, ease of use and no single points of failure.
- MapR is considered to be one of the fastest hadoop distributions.
Why you should choose MapR Hadoop distribution?
Though MapR is still at number 3 in terms of number of installations, it is one of the easiest and fastest hadoop distributions when compared to others.If you are looking for an innovative approch with lots of free learning material then MapR Hadoop distribution is the way to go.
Hortonworks Data Platform (HDP)
Hortonworks, founded by Yahoo engineers, provides a ‘service only’ distribution model for Hadoop. Hortonworks is different from the other hadoop distributions, as it is an open enterprise data platform available free for use. Hortonworks hadoop distribution –HDP can easily be downloaded and integrated for use in various applications.
Ebay, Samsung Electronics, Bloomberg and Spotify use HDP. Hortonworks was the first vendor to provide a production ready Hadoop distribution based on Hadoop 2.0. Though CDH had Hadoop 2.0 features in its earlier versions, all of its components were not considered production ready. HDP is the only hadoop distribution that supports windows platform. Users can deploy a windows based hadoop cluster on Azure through HDInsight service.
Hortonworks, founded by Yahoo engineers, provides a ‘service only’ distribution model for Hadoop. Hortonworks is different from the other hadoop distributions, as it is an open enterprise data platform available free for use. Hortonworks hadoop distribution –HDP can easily be downloaded and integrated for use in various applications.
Ebay, Samsung Electronics, Bloomberg and Spotify use HDP. Hortonworks was the first vendor to provide a production ready Hadoop distribution based on Hadoop 2.0. Though CDH had Hadoop 2.0 features in its earlier versions, all of its components were not considered production ready. HDP is the only hadoop distribution that supports windows platform. Users can deploy a windows based hadoop cluster on Azure through HDInsight service.
Unique Features Supported by Hortonworks Hadoop Distribution –HDP
- HDP makes Hive faster through its new Stinger project.
- HDP avoids vendor lock-in by pledging to a forked version of Hadoop.
- Focused on enhancing the usability of the Hadoop platform.
- HDP makes Hive faster through its new Stinger project.
- HDP avoids vendor lock-in by pledging to a forked version of Hadoop.
- Focused on enhancing the usability of the Hadoop platform.
Choose the Right Hadoop Distribution to make Big Data Meaningful for your Organization
With a clear distinction in strategy and features between the three big vendors in the Hadoop market - there is no clear winner in sight. Organizations have to choose the kind of Hadoop Distribution depending on the level of sophistication they require. Some of the important questions you would want to get answered before deciding on a particular Hadoop distribution are -
- Will the chosen Hadoop distribution help the general administrators work with Hadoop effectively?
- Does the chosen Hadoop distribution provide ease of data access to hadoop developers and business analysts?
- Does the Hadoop distribution support your organization’s data protection policies?
- Does the Hadoop distribution fit into your environment?
- Does the Hadoop distribution package everything together that Hadoop has to offer?
- Does your organization need a big data solution that can make a quick impact on the overall profitability of the business or do you want to clinch the flexibility of the open source Hadoop to alleviate the risk of vendor lock-in?
- How significant are - system dependability, technical support and expanded functionality for your organization?
MapR Distribution is the way to go if it’s all about product and if open source is your uptake - then Hortonworks Hadoop Distribution is for you. If your business requirements fit somewhere in between then opting for Cloudera Distribution for Hadoop, might be a good decision.
Choosing a Hadoop Distribution completely depends on the hindrances or obstacles an organization is facing in implementing Hadoop in the enterprise. A right move in choosing a hadoop distribution will help organizations connect Hadoop to different data analysis platforms with flexibility, reliability and visibility. Each hadoop distribution has its own pros and cons. When choosing a hadoop distribution for business needs, it is imperative to consider the additional value offered by each hadoop distribution by balancing the risk and cost, for the Hadoop distribution to prove beneficial for your enterprise needs.
With a clear distinction in strategy and features between the three big vendors in the Hadoop market - there is no clear winner in sight. Organizations have to choose the kind of Hadoop Distribution depending on the level of sophistication they require. Some of the important questions you would want to get answered before deciding on a particular Hadoop distribution are -
- Will the chosen Hadoop distribution help the general administrators work with Hadoop effectively?
- Does the chosen Hadoop distribution provide ease of data access to hadoop developers and business analysts?
- Does the Hadoop distribution support your organization’s data protection policies?
- Does the Hadoop distribution fit into your environment?
- Does the Hadoop distribution package everything together that Hadoop has to offer?
- Does your organization need a big data solution that can make a quick impact on the overall profitability of the business or do you want to clinch the flexibility of the open source Hadoop to alleviate the risk of vendor lock-in?
- How significant are - system dependability, technical support and expanded functionality for your organization?
MapR Distribution is the way to go if it’s all about product and if open source is your uptake - then Hortonworks Hadoop Distribution is for you. If your business requirements fit somewhere in between then opting for Cloudera Distribution for Hadoop, might be a good decision.
Choosing a Hadoop Distribution completely depends on the hindrances or obstacles an organization is facing in implementing Hadoop in the enterprise. A right move in choosing a hadoop distribution will help organizations connect Hadoop to different data analysis platforms with flexibility, reliability and visibility. Each hadoop distribution has its own pros and cons. When choosing a hadoop distribution for business needs, it is imperative to consider the additional value offered by each hadoop distribution by balancing the risk and cost, for the Hadoop distribution to prove beneficial for your enterprise needs.
Q1) What is big data?
Big Data is
the really large amount of data that exceeds the processing capacity of
conventional database systems and requires special parallel processing
mechanism. The data is too big and grows rapidly. This data can be either
structural or unstructured data. To retrieve meaningful information from this
data, we must choose an alternative way to process it.
Characteristics of Big Data:
Data that has very large volume, comes from variety of sources and formats and flows into an organization with a great velocity is normally referred to as Big Data.
Characteristics of Big Data:
Data that has very large volume, comes from variety of sources and formats and flows into an organization with a great velocity is normally referred to as Big Data.
Q2) What is Hadoop?
Hadoop is a framework that allows distributed processing of
large data sets across clusters of computers using simple and fault tolerant
programming model. It is designed to scale up from a very few to thousands of
machines, each machine provides local computation and storage. The Hadoop
software library itself is designed to detect and handle failures at the
application layer.
Hadoop is written in java by Apache Software Foundation. It process data very reliably and fault-tolerant manner.
Core components of Hadoop:
HDFS (Storage) + MapReduce/YARN (Processing)
Hadoop is written in java by Apache Software Foundation. It process data very reliably and fault-tolerant manner.
Core components of Hadoop:
HDFS (Storage) + MapReduce/YARN (Processing)
Q3) What are the sources generating big data?
Employers,Users and
Machines
·
Employees: Historically, employees of organizations
generated data.
·
Users: Then a shift occurred where users started
generating data. For example, email, social media, photos, videos, audio and
e-Commerce.
·
Machines: Smart phones, intelligent kitchen appliances,
CCTV cameras, smart meters, global satellites, and traffic flow sensors.
Q4) Why do we need a new framework for handling big data?
Most of the traditional data was organized neatly in relational
databases. Data sets now are so large and complex that they are beyond the
capabilities of traditional storage and processing systems.
The following challenges demand cost-effective and innovative forms of handling big data at scale:
The following challenges demand cost-effective and innovative forms of handling big data at scale:
Lots of data
Organizations are increasingly required to store more and more data to survive in today’s highly competitive environment. The sheer volume of the data demands lower storage costs as compared to the expensive commercial relational database options.
Organizations are increasingly required to store more and more data to survive in today’s highly competitive environment. The sheer volume of the data demands lower storage costs as compared to the expensive commercial relational database options.
Complex nature of data
Relational data model has great properties for structured data but many modern systems don’t fit well in row-column format. Data is now generated by diverse sources in various formats like multimedia, images, text, real-time feeds, and sensor streams. Usually for storage, the data is transformed, aggregated to fit into the structured format resulting in the loss of the original raw data.
Relational data model has great properties for structured data but many modern systems don’t fit well in row-column format. Data is now generated by diverse sources in various formats like multimedia, images, text, real-time feeds, and sensor streams. Usually for storage, the data is transformed, aggregated to fit into the structured format resulting in the loss of the original raw data.
New analysis techniques
Previously simple analysis (like average, sum) would prove to be sufficient to predict customer behavior. But now complex analysis needs to be performed to gain insightful understanding of data collected. For example, prediction models for effective micro-segmentation needs to analyse the customer’s purchase history, browsing behavior, likes and reviews on social media website to perform micro-segmentation. These advanced analytic techniques need the framework to run on.
Previously simple analysis (like average, sum) would prove to be sufficient to predict customer behavior. But now complex analysis needs to be performed to gain insightful understanding of data collected. For example, prediction models for effective micro-segmentation needs to analyse the customer’s purchase history, browsing behavior, likes and reviews on social media website to perform micro-segmentation. These advanced analytic techniques need the framework to run on.
Hadoop to rescue: Framework that provides low-cost storage and complex
analytic processing capabilities
Q5) Why do we need Hadoop framework, shouldn’t DFS be able to
handle large volumes of data already?
Yes, it is true that
when the datasets cannot fit in a single physical machine, then Distributed
File System (DFS) partitions the data, store and manages the data across
different machines. But, DFS lacks the following features for which we need
Hadoop framework:
Fault tolerant:
When a lot of machines are involved chances of data loss increases. So, automatic fault tolerance and failure recovery become a prime concern.
When a lot of machines are involved chances of data loss increases. So, automatic fault tolerance and failure recovery become a prime concern.
Move data to computation:
If huge amounts of data are moved from storage to the computation machines then the speed depends on network bandwidth.
If huge amounts of data are moved from storage to the computation machines then the speed depends on network bandwidth.
Q6) What is the difference between traditional RDBMS and Hadoop?
RDBMS
|
Hadoop
|
Schema on write
|
Schema on read
|
Scale up approach
|
Scale out approach
|
Relational tables
|
Key-value format
|
Structured queries
|
Function programming
|
Online Transactions
|
Batch processing
|
Q7) What is HDFS?
Hadoop Distributed
File Systems (HDFS) is one of the core components of Hadoop framework. It is a
distributed file system for Hadoop. It runs on top of existing file system
(ext2, ext3, etc.)
Goals: Automatic
recovery from failures, Move Computation than data.
HDFS features:
1.
Supports storage of
very large datasets
2.
Write once read many
access model
3.
Streaming data access
4.
Replication using
commodity hardware
Q8) What is difference between regular file system and HDFS?
Regular File Systems
|
HDFS
|
Small block size of
data (like 512 bytes)
|
Large block size
(orders of 64mb)
|
Multiple disk seeks
for large files
|
Reads data
sequentially after single seek
|
Q9) What HDFS is not meant for?
HDFS is not good at:
1.
Applications that
requires low latency access to data (in terms of milliseconds)
2.
Lot of small files
3.
Multiple writers and
file modifications
Q10) What is HDFS block size and what did you chose in your
project?
By default, the HDFS
block size is 64MB. It can be set to higher values as 128MB or 256MB. 128MB is
acceptable industry standard.
Q11) What is the default replication factor?
Default replication
factor is 3
Q12) What are different hdfs dfs shell commands to perform copy
operation?
$ hadoop fs
-copyToLocal
$ hadoop fs -copyFromLocal
$ hadoop fs -put
$ hadoop fs -copyFromLocal
$ hadoop fs -put
Q13) What are the problems with Hadoop 1.0?
1.
NameNode: No
Horizontal Scalability and No High Availability
2.
Job Tracker:
Overburdened.
3.
MRv1: It can only
understand Map and Reduce tasks
Q14) What comes in Hadoop 2.0 and MapReduce V2 (YARN)?
NameNode: HA and Federation
JobTracker: Cluster and application resource
JobTracker: Cluster and application resource
Q15) What different type of schedulers and type of scheduler did
you use?
Capacity Scheduler
It is designed to run Hadoop applications as a shared, multi-tenant cluster while maximizing the throughput and the utilization of the cluster.
It is designed to run Hadoop applications as a shared, multi-tenant cluster while maximizing the throughput and the utilization of the cluster.
Fair Scheduler
Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time.
Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time.
Q16) Steps involved in decommissioning (removing) the nodes in
the Hadoop cluster?
1.
Update the network
addresses in the dfs.exclude and mapred.exclude
2.
$ hadoop dfsadmin
-refreshNodes and hadoop mradmin -refreshNodes
3.
Check Web UI it will
show “Decommissioning in Progress”
4.
Remove the Nodes from
include file and then run again the step 2 refreshNodes.
5.
Remove the Nodes from
slave file.
Q17) Steps involved in commissioning (adding) the nodes in the
Hadoop cluster?
1.
Update the network
addresses in the dfs.include and mapred.include
2.
$ hadoop dfsadmin
-refreshNodes and hadoop mradmin -refreshNodes
3.
Update the slave file.
4.
Start the DataNode and
NodeManager on the added Node.
Q18) How to keep HDFS cluster balanced?
Balancer is a tool
that tries to provide a balance to a certain threshold among data nodes by
copying block data distribution across the cluster.
Q19) What is distcp?
1.
istcp is the program
comes with Hadoop for copying large amount of data to and from Hadoop file
systems in parallel.
2.
It is implemented as
MapReduce job where copying is done through maps that run in parallel across
the cluster.
3.
There are no reducers.
Q20) What are the daemons of HDFS?
1.
NameNode
2.
DataNode
3.
Secondary NameNode.
Q21) Command to format the NameNode?
$ hdfs namenode
-format
Q22) What are the functions of NameNode?
The NameNode is mainly
responsible for:
Namespace
Maintain metadata about the data
Maintain metadata about the data
Block Management
Processes block reports and maintain location of blocks.
Supports block related operations
Manages replica placement
Processes block reports and maintain location of blocks.
Supports block related operations
Manages replica placement
Q23) What is HDFS Federation?
·
HDFS federation allows
scaling the name service horizontally; it uses multiple independent NameNodes
for different namespaces.
·
All the NameNodes use
the DataNodes as common storage for blocks.
·
Each DataNode
registers with all the NameNodes in the cluster.
·
DataNodes send
periodic heartbeats and block reports and handles commands from the NameNodes
Q24) What is HDFS High Availability?
1.
In HDFS High
Availability (HA) cluster; two separate machines are configured as NameNodes.
2.
But one of the
NameNodes is in an Active state; other is in a Standby state.
3.
The Active NameNode is
responsible for all client operations in the cluster, while the Standby is
simply acting as a slave, maintaining enough state to provide a fast failover
if necessary
4.
They shared the same
storage and all DataNodes connects to both the NameNodes.
Q25) How client application interacts with the NameNode?
1.
Client applications
interact using Hadoop HDFS API with the NameNode when it has to
locate/add/copy/move/delete a file.
2.
The NameNode responds
the successful requests by returning a list of relevant DataNode servers where
the data is residing.
3.
Client can talk
directly to a DataNode after the NameNode has given the location of the data
Q26) What is a DataNode?
1.
A DataNode stores data
in the Hadoop File System HDFS is a slave node.
2.
On startup, a DataNode
connects to the NameNode.
3.
DataNode instances can
talk to each other mostly during replication.
Q27) What is rack-aware replica placement policy?
1.
Rack-awareness is used
to take a node’s physical location into account while scheduling tasks and
allocating storage.
2.
Default replication
factor is 3 for a data blocks on HDFS.
3.
The first two copies
are stored on DataNodes located on the same rack while the third copy is stored
on a different rack.
Q28) What is the main purpose of HDFS fsck command?
fsck a utility to
check health of the file system, to find missing files, over-replicated,
under-replicated and corrupted blocks.
Command for finding
the blocks for a file:
$ hadoop fsck -files
-blocks –racks
Q29) What is the purpose of DataNode block scanner?
1.
Block scanner runs on
every DataNode, which periodically verifies all the blocks stored on the
DataNode.
2.
If bad blocks are detected
it will be fixed before any client reads.
Q30) What is the purpose of dfsadmin tool?
1.
It is used to find
information about the state of HDFS
2.
It performs
administrative tasks on HDFS
3.
Invoked by hadoop
dfsadmin command as superuser
Q31) What is the command for printing the topology?
It displays a tree of
racks and DataNodes attached to the tracks as viewed by the .hdfs dfsadmin
-printTopology
Q32) What is RAID?
RAID is a way of combining multiple disk drives into a single
entity to improve performance and/or reliability. There are a variety of
different levels in RAID
For example, In RAID level 1 copy of the same data on two disks increases the read performance by reading alternately from each disk in the mirror.
For example, In RAID level 1 copy of the same data on two disks increases the read performance by reading alternately from each disk in the mirror.
Q33) Does Hadoop requires RAID?
1.
In DataNodes storage
is not using RAID as redundancy can be achieved by replication between the
Nodes.
2.
In NameNode’s disk
RAID is recommended.
Q34) What are the site-specific configuration files in Hadoop?
1.
conf/core-site.xml
2.
conf/hdfs-site.xml
3.
conf/yarn-site.xml
4.
conf/mapred-site.xml.
5.
conf/hadoop-env.sh
6.
conf/yarn-env.sh
Q35) What is MapReduce?
MapReduce is a
programming model for processing on the distributed datasets on the clusters of
a computer.
MapReduce Features:
1.
Distributed
programming complexity is hidden
2.
Built in
fault-tolerance
3.
Programming model is
language independent
4.
Parallelization and
distribution are automatic
5.
Enable data local
processing
Q36) What is the fundamental idea behind YARN?
In YARN (Yet Another Resource
Allocator), JobTracker responsibility is split into:
1.
Resource management
2.
Job
scheduling/monitoring having separate daemons.
Yarn supports
additional processing models and implements a more flexible execution engine.
Q37) What MapReduce framework consists of?
ResourceManager (RM)
1.
Global resource
scheduler
2.
One master RM
NodeManager (NM)
1.
One slave NM per
cluster-node.
Container
1.
RM creates Containers
upon request by AM
2.
Application runs in
one or more containers
ApplicationMaster (AM)
1.
One AM per application
2.
Runs in Container
Q38) What are different daemons in YARN?
1.
ResourceManager:
Global resource manager.
2.
NodeManager: One per
data node, It manages and monitors usage of the container (resources in terms
of Memory, CPU).
3.
ApplicationMaster: One
per application, Tasks are started by NodeManager
Q39) What are the two main components of ResourceManager?
Scheduler
Scheduler
It allocates the
resources (containers) to various running applications: Container elements such
as memory, CPU, disk etc.
ApplicationManager
It accepts job-submissions, negotiating for container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
It accepts job-submissions, negotiating for container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
Q40) What is the function of NodeManager?
The NodeManager is the
resource manager for the node (Per machine) and is responsible for containers,
monitoring their resource usage (cpu, memory, disk, network) and reporting the
same to the ResourceManager
Q41) What is the function of ApplicationMaster?
ApplicationMaster is
per application and it has the responsibility of negotiating appropriate
resource containers from the Scheduler, tracking their status and monitoring
for progress.
Q42) What are the minimum configuration requirements for a
MapReduce application?
The job configuration
requires the
1.
input location
2.
output location
3.
map() function
4.
reduce() functions and
5.
job parameters.
Q43) What are the steps to submit a Hadoop job?
Steps involved in
Hadoop job submission:
1.
Hadoop job client
submits the job jar/executable and configuration to the ResourceManager.
2.
ResourceManager then
distributes the software/configuration to the slaves.
3.
ResourceManager then
scheduling tasks and monitoring them.
4.
Finally, job status
and diagnostic information is provided to the client.
Q44) How does MapReduce framework view its input internally?
It views the input as
a set of pairs and produces a set of pairs as the output of the job.
Q45) Assuming default configurations, how is a file of the size
1 GB (uncompressed) stored in HDFS?
Default block size is
64MB. So, file of 1GB will be stored as 16 blocks. MapReduce job will create 16
input splits; each will be processed with separate map task i.e. 16 mappers.
Q46) What are Hadoop Writables?
Hadoop Writables
allows Hadoop to read and write the data in a serialized form for transmission
as compact binary files. This helps in straightforward random access and higher
performance. Hadoop provides in built classes, which implement Writable: Text,
IntWritable, LongWritable, FloatWritable, and BooleanWritable.
Q47) Why comparison of types is important for MapReduce?
A comparison is
important as in the sorting phase the keys are compared with one another. For
comparison, the WritableComparable interface is implemented.
Q48) What is the purpose of RawComparator interface?
RawComparator allows
the implementors to compare records read from a stream without deserialization
them into objects, so it will be optimized, as there is not overhead of object
creation.
Q49) What is a NullWritable?
It is a special type
of Writable that has zero-length serialization. In MapReduce, a key or a value
can be declared as NullWritable if we don’t need that position, storing
constant empty value.
Q50) What is Avro Serialization System?
Avro is a language-neutral data serialization system. It has
data formats that work with different languages. Avro data is described using a
language-independent schema (usually written in JSON). Avro data files support
compression and are splittable.
Avro provides
AvroMapper and AvroReducer to run MapReduce programs.
Q51) Explain use cases where SequenceFile class can be a good
fit?
When the data is of
type binary then SequenceFile will provide a persistent structure for binary
key-value pairs. SequenceFiles also work well as containers for smaller files
as HDFS and MapReduce are optimized for large files.
Q52) What is MapFile?
A MapFile is an
indexed SequenceFile and it is used for look-ups by key.
Q53) What is the core of the job in MapReduce framework?
The core of a job:
Mapper interface: map method
Reducer interface reduce method
Mapper interface: map method
Reducer interface reduce method
Q54) What are the steps involved in MapReduce framework?
1.
Firstly, the mapper
input key/value pairs maps to a set of intermediate key/value pairs.
2.
Maps are the
individual tasks that transform input records into intermediate records.
3.
The transformed
intermediate records do not need to be of the same type as the input records.
4.
A given input pair
maps to zero or many output pairs.
5.
The Hadoop MapReduce
framework creates one map task for each InputSplit generated by the InputFormat
for the job.
6.
It then calls map(WritableComparable,
Writable, Context) for each key/value pair in the InputSplit for that
task.
7.
All intermediate
values associated with a given output key are grouped passed to the Reducers.
Q55) Where is the Mapper Output stored?
The mapper output is
stored on the Local file system of each individual mapper nodes. The
intermediate data is cleaned up after the Hadoop Job completes.
Q56) What is a partitioner and how the user can control which
key will go to which reducer?
Partitioner controls the partitioning of the keys of the
intermediate map-outputs by the default. The key to decide the partition uses
hash function. Default partitioner is HashPartitioner.
A custom partitioner is implemented to control, which keys go to which Reducer.
A custom partitioner is implemented to control, which keys go to which Reducer.
public class SamplePartitioner extends
Partitioner {
@Override
public int getPartition(Text key, Text value,
int numReduceTasks) {
}
}
Q57) What are combiners and its purpose?
1.
Combiners are used to
increase the efficiency of a MapReduce program. It can be used to aggregate
intermediate map output locally on individual mapper outputs.
2.
Combiners can help
reduce the amount of data that needs to be transferred across to the reducers.
3.
Reducer code as a
combiner if the operation performed is commutative and associative.
4.
Hadoop may or may not
execute a combiner.
Q58) How a number of partitioners and reducers are related?
The total numbers of
partitions are the same as the number of reduce tasks for the job.
Q59) What is IdentityMapper?
IdentityMapper
implements the mapping inputs directly to output. IdentityMapper.class is used
as a default value when JobConf.setMapperClass is not set.
Q60) What is IdentityReducer?
In IdentityReducer no
reduction is performed, writing all input values directly to the output.
IdentityReducer.class is used as a default value when JobConf.setReducerClass
is not set
Q61) What is the reducer and its phases?
Reducer reduces a set of intermediate values, which has same key
to a smaller set of values. The framework then calls reduce().
Syntax:
reduce(WritableComparable, Iterable, Context) method for each pair in the grouped inputs.
Reducer has three primary phases:
Syntax:
reduce(WritableComparable, Iterable, Context) method for each pair in the grouped inputs.
Reducer has three primary phases:
1.
Shuffle
2.
Sort
3.
Reduce
Q62) How to set the number of reducers?
The number of reduces
for the user sets the job:
1.
Job.setNumReduceTasks(int)
2.
-D
mapreduce.job.reduces
Q63) Detail description of the Reducer phases?
Shuffle:
Sorted output (Mapper) Ã Input (Reducer). Framework then fetches the relevant partition of the output of all the mappers.
Sorted output (Mapper) Ã Input (Reducer). Framework then fetches the relevant partition of the output of all the mappers.
Sort:
The framework groups Reducer inputs by keys. The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
The framework groups Reducer inputs by keys. The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
Secondary Sort:
Grouping the intermediate keys are required to be different from those for grouping keys before reduction, then Job.setSortComparatorClass(Class).
Grouping the intermediate keys are required to be different from those for grouping keys before reduction, then Job.setSortComparatorClass(Class).
Reduce:
reduce(WritableComparable, Iterable, Context) method is called for each pair in the grouped inputs.
The output of the reduce task is typically written using Context.write(WritableComparable, Writable).
reduce(WritableComparable, Iterable, Context) method is called for each pair in the grouped inputs.
The output of the reduce task is typically written using Context.write(WritableComparable, Writable).
Q64) Can there be no Reducer?
Yes, the number of
reducer can be zero if no reduction of values is required.
Q65) What can be optimum value for Reducer?
Value of Reducers can
be: 0.95
1.
1.75 multiplied by ( *
< number of maximum containers per node>)
Increasing number of
reducers
1.
Increases the
framework overhead
2.
Increases load
balancing
3.
Lowers the cost of
failures
Q66) What are a Counter and its purpose?
The counter is a
facility for MapReduce applications to report its statistics. They can be used
to track job progress in a very easy and flexible manner. It is defined by
MapReduce framework or by applications. Each Counter can be of any Enum type.
Applications can define counters of type Enum and update them via
counters.incrCounter in the map and/or reduce methods.
Q67) Define different types of Counters?
Built in Counters:
·
Map Reduce Task
Counters
·
Job Counters
Custom Java Counters:
·
MapReduce allows users
to specify their own counters (using Java enums) for performing their own
counting operation.
Q68) Why Counter values are shared by all map and reduce tasks
across the MapReduce framework?
Counters are global so
shared across the MapReduce framework and aggregated at the end of the job
across all the tasks.
Q69) Explain speculative execution.
1.
Speculative execution
is a way of dealing with individual machine’s performance. As there are lots of
machines in the cluster, some machines can have low performance, which affects
the performance of the whole job.
2.
Speculative execution
in Hadoop can run multiple copies of the same map or reduce task on different
task tracker nodes and the results from first node to finish are used.
Q70) What is DistributedCache and its purpose?
DistributedCache is a
facility provided by the MapReduce framework to cache files (text, archives,
jars etc.) needed by applications. It distributes application-specific, large,
read-only files efficiently. The user needs to use DistributedCache to
distribute and symlink the script file.
Q71) What is the Job interface in MapReduce framework?
Job is the primary
interface for a user to describe a MapReduce job to the Hadoop framework for
execution. Some basic parameters are configured for example:
1.
Job.setNumReduceTasks(int)
2.
Configuration.set(JobContext.NUM_MAPS,
int)
3.
Mapper
4.
Combiner (if any)
5.
Partitioner
6.
Reducer
7.
InputFormat
8.
OutputFormat
implementations
9.
setMapSpeculativeExecution(boolean))/
setReduceSpeculativeExecution(boolean))
10. Maximum number of attempts per task
(setMaxMapAttempts(int)/ setMaxReduceAttempts(int)) etc.
11. DistributedCache for large amounts of
(read-only) data.
Q72) What is the default value of map and reduce max attempts?
The framework will try to execute a map task or reduce task
by default 4 times before giving up on it.
Q73) Explain InputFormat?
InputFormat describes
the input-specification for a MapReduce job. The MapReduce framework depends on
the InputFormat of the job to:
Checks the input-specification of the job.
It then splits the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.
It then splits the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.
To extract input records from the logical InputSplit for
processing by the Mapper it provides the RecordReader implementation.
Default: TextInputFormat
Default: TextInputFormat
Q74) What is InputSplit and RecordReader?
InputSplit specifies the data to be processed by an individual
Mapper.
In general, InputSplit presents a byte-oriented view of the input.
In general, InputSplit presents a byte-oriented view of the input.
Default: FileSplit
RecordReader reads pairs from an InputSplit, then processes them and presents record-oriented view
RecordReader reads pairs from an InputSplit, then processes them and presents record-oriented view
Q75) Explain the Job OutputFormat?
OutputFormat describes details of the output for a MapReduce
job.
The MapReduce framework depends on the OutputFormat of the job to:
It checks the job output-specification
The MapReduce framework depends on the OutputFormat of the job to:
It checks the job output-specification
To write the output files of the job in the pairs, it provides
the RecordWriter implementation.
Default: TextOutputFormat
Default: TextOutputFormat
Q76) How is the option in Hadoop to skip the bad records?
Hadoop provides an option where a certain set of bad input
records can be skipped when processing map inputs. This feature can be
controlled by the SkipBadRecords class.
Q77) Different ways of debugging a job in MapReduce?
1.
Add debug statement to
log to standard error along with the message to update the task’s status
message. Web UI makes it easier to view.
2.
Create a custom
counter, it gives valuable information to deal with the problem dataset
3.
Task page and task
detailed page
4.
Hadoop Logs
5.
MRUnit testing
PROGRAM 1: Counting the number of words in an input file
Introduction
This section describes how to get the word count of a sample input file.
This section describes how to get the word count of a sample input file.
Software Versions
The software versions used are:
VirtualBox: 4.3.20
CDH 5.3: Default MapReduce Version
hadoop-core-2.5.0
hadoop-yarn-common-2.5.0
The software versions used are:
VirtualBox: 4.3.20
CDH 5.3: Default MapReduce Version
hadoop-core-2.5.0
hadoop-yarn-common-2.5.0
Steps
1. Create the input file
Create the input.txt file with sample text.
$ vi input.txt
Thanks Lord Krishna for helping us write this book
Hare Krishna Hare Krishna Krishna Krishna Hare Hare
Hare Rama Hare Rama Rama Rama Hare Hare
1. Create the input file
Create the input.txt file with sample text.
$ vi input.txt
Thanks Lord Krishna for helping us write this book
Hare Krishna Hare Krishna Krishna Krishna Hare Hare
Hare Rama Hare Rama Rama Rama Hare Hare
2. Move the input file into HDFS
Use the –put or –copyFromLocal command to move the file into HDFS
$ hadoop fs -put input.txt
Use the –put or –copyFromLocal command to move the file into HDFS
$ hadoop fs -put input.txt
3. Code for the MapReduce program
Java files:
WordCountProgram.java // Driver Program
WordMapper.java // Mapper Program
WordReducer.java // Reducer Program
————————————————–
WordCountProgram.java File: Driver Program
————————————————–
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCountProgram extends Configured implements Tool{
@Override
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, “wordcountprogram”);
job.setJarByClass(getClass());
// Configure output and input source
TextInputFormat.addInputPath(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(WordReducer.class);
// Configure output
TextOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new WordCountProgram(), args);
System.exit(exitCode);
}
}
————————————————–
WordMapper.java File: Mapper Program
————————————————–
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper {
private final static IntWritable count = new IntWritable(1);
private final Text nameText = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
StringTokenizer tokenizer = new StringTokenizer(value.toString(),” “);
while (tokenizer.hasMoreTokens()) {
nameText.set(tokenizer.nextToken());
context.write(nameText, count);
}
}
}
———————————————–
WordReducer.java file: Reducer Program
————————————————–
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordReducer extends Reducer {
@Override
protected void reduce(Text t, Iterable counts, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable count : counts) {
sum += count.get();
}
context.write(t, new IntWritable(sum));
}
}
4. Run the MapReduce program
Create the jar of the Code in Step 3 and use the following command to run the MapReduce program
$ hadoop jar WordCount.jar WordCountProgram input.txt output1
Here,
WordCount.jar: Name of jar exported having the all the methods.
WordCountProgram: Driver Program having the entire configuration
input.txt: Input file
output1: Output folder where the output file will be stored
5. View the Output
View the output in the output1 folder
$ hadoop fs -cat /user/cloudera/output1/part-r-00000
Hare 8
Krishna 5
Lord 1
Rama 4
Thanks 1
book 1
for 1
helping 1
this 1
us 1
write 1
Java files:
WordCountProgram.java // Driver Program
WordMapper.java // Mapper Program
WordReducer.java // Reducer Program
————————————————–
WordCountProgram.java File: Driver Program
————————————————–
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCountProgram extends Configured implements Tool{
@Override
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, “wordcountprogram”);
job.setJarByClass(getClass());
// Configure output and input source
TextInputFormat.addInputPath(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(WordReducer.class);
// Configure output
TextOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new WordCountProgram(), args);
System.exit(exitCode);
}
}
————————————————–
WordMapper.java File: Mapper Program
————————————————–
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper {
private final static IntWritable count = new IntWritable(1);
private final Text nameText = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
StringTokenizer tokenizer = new StringTokenizer(value.toString(),” “);
while (tokenizer.hasMoreTokens()) {
nameText.set(tokenizer.nextToken());
context.write(nameText, count);
}
}
}
———————————————–
WordReducer.java file: Reducer Program
————————————————–
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordReducer extends Reducer {
@Override
protected void reduce(Text t, Iterable counts, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable count : counts) {
sum += count.get();
}
context.write(t, new IntWritable(sum));
}
}
4. Run the MapReduce program
Create the jar of the Code in Step 3 and use the following command to run the MapReduce program
$ hadoop jar WordCount.jar WordCountProgram input.txt output1
Here,
WordCount.jar: Name of jar exported having the all the methods.
WordCountProgram: Driver Program having the entire configuration
input.txt: Input file
output1: Output folder where the output file will be stored
5. View the Output
View the output in the output1 folder
$ hadoop fs -cat /user/cloudera/output1/part-r-00000
Hare 8
Krishna 5
Lord 1
Rama 4
Thanks 1
book 1
for 1
helping 1
this 1
us 1
write 1
Q78) What problem does Apache Flume solve?
Scenario:
1.
There are several
services producing a huge number of logs that run in different servers. These
logs need to be accumulated, stored and analyzed together.
2.
Hadoop has emerged as
a cost effective and scalable framework for storage and analysis for big data.
Problem:
1.
How can these logs be
collected, aggregated and stored to a place where Hadoop can process them?
2.
Now there is a
requirement for a reliable, scalable, extensible and manageable solution.
Q79) What is Apache Flume?
Apache Flume is a distributed data collection service that
gets flows of data (like logs) from the systems that generate them and
aggregates them to a centralized data store where they can be processed
together.
Goals: reliability,
recoverability, and scalability
Flume features:
1.
Ensures guaranteed
data delivery
2.
Gather high volume
data streams in real time
3.
Streaming data is
coming from multiple sources into Hadoop for analysis
4.
Scales horizontally
Q80) How is Flume-NG different from Flume 0.9?
Flume 0.9:
Centralized configuration of the agents handled by Zookeeper.
Input data and writing data are handled by same thread.
Flume 1.X (Flume-NG):
No centralized configuration. Instead a simple on-disk configuration file is used.
Different threads called runners handle input data and writing data.
Input data and writing data are handled by same thread.
Flume 1.X (Flume-NG):
No centralized configuration. Instead a simple on-disk configuration file is used.
Different threads called runners handle input data and writing data.
Q90) What is the problem with HDFS and streaming data (like
logs)?
1.
In a regular
filesystem when you open a file and write data, it exists on disk even before
it is closed.
2.
Whereas in HDFS, the
file exists only as a directory entry of zero length till it is closed. This
implies that if data is written to a file for an extended period without
closing it, you may be left with an empty file if there is a network disconnect
with the client.
3.
It is not a good
approach to close the files frequently and create smaller files as this leads
to poor efficiency in HDFS.
Q91) What are core components of Flume?
Flume architecture:
Flume Agent:
1.
An agent is a daemon
(physical Java virtual machine) running Flume.
2.
It receives and stores
the data until it is written to a next destination.
3.
Flume source, channel
and sink run in an agent.
Source:
1.
A source receives data
from some application that is producing data.
2.
A source writes events
to one or more channels.
3.
Sources either poll
for data or wait for data to be delivered to them.
4.
For
Example: log4j, Avro,
syslog, etc.
Sink:
1.
A sink removes the
events from the agent and delivering it to the destination.
2.
The destination could
be different agent or HDFS, HBase, Solr etc.
3.
For
Example: Console, HDFS, HBase,
etc.
Channel:
1.
A channel holds events
passing from a source to a sink.
2.
A source ingests
events into the channel while sink removes them.
3.
A sink gets events
from one channel only.
4.
For
Example: Memory, File,
JDBC etc.
Q92) Explain a common use case for Flume?
Common Use case: Receiving web logs from several sources into HDFS.
Web server logs → Apache Flume → HDFS (Storage) → Pig/Hive (ETL) → HBase (Database) → Reporting (BI Tools)
Web server logs → Apache Flume → HDFS (Storage) → Pig/Hive (ETL) → HBase (Database) → Reporting (BI Tools)
1.
Logs are generated by
several log servers and saved in local hard disks, which need to be pushed into
HDFS using Flume framework.
2.
Flume agents, which
are running on, log servers collect the logs, which are pushed into HDFS.
3.
Data analytics tools
like Pig or Hive then process this data.
4.
The analysed data is
stored in structured format in HBase or other database.
5.
Business intelligence
tools will then generate reports on this data.
Q93) What are Flume events?
Flume events:
1.
Basic payload
of data transported by Flume (typically a single log entry)
2.
It has zero or more
headers and a body
Event Headers are
key-value pairs that are used to make routing decisions or carry other
structured information like:
1.
Timestamp of the event
2.
Hostname of the server
where event has originated
Event Body
Event Body is an array
of bytes that contains the actual payload.
Q94) Can we change the body of the flume event?
Yes, editing Flume
Event using interceptors can change its body.
Q95) What are interceptors?
Interceptor
An interceptor is a
point in your data flow where you can inspect and alter flume events. After the
source creates an event, there can be zero or more interceptors tied together
before it is delivered to sink.
Q96) What are channel selectors?
Channel selectors:
Channel selectors are responsible for how an event moves from a
source to one or more channels.
Types of channel selectors are:
Types of channel selectors are:
1.
Replicating
Channel Selector: This is the
default channel selector that puts a copy of event into each channel
2.
Multiplexing
Channel Selector: Routes data into
different channel depending on header information and/or interceptor logic
Q97) What are sink processors?
Sink processor:
Sink processor is a
mechanism for failover and load balancing events across multiple sinks from a
channel
Q98) How to Configure an Agent?
1.
An agent is configured
using a simple Java property file of key/value pairs
2.
This configuration
file is passed as an argument to the agent upon startup.
3.
You can configure
multiple agents in a single configuration file. It is required to pass an agent
identifier (called a name).
4.
Each agent is
configured starting with:
·
agent.sources=
·
agent.channels=
·
agent.sinks=
5.
Each source, channel
and sink also has a distinct name within the context of that agent.
Q99) Explain the “Hello world” example in flume.
In the following example, the source listens on a socket for
network clients to connect and sends event data. Those events were written to
an in-memory channel and then fed to a log4j sink to become output.
Configuration file for one agent (called a1) that has a source named s1, a channel named c1 and a sink named k1.
# netcatAgent.conf: Logs the netcat events to console
# Name of the components on this agent
a1.sources=s1
a1.channels=c1
a1.sinks=k1
# Configure the source
a1.sources.s1.type=Netcat
Q100) What is Hive?
Configuration file for one agent (called a1) that has a source named s1, a channel named c1 and a sink named k1.
# netcatAgent.conf: Logs the netcat events to console
# Name of the components on this agent
a1.sources=s1
a1.channels=c1
a1.sinks=k1
# Configure the source
a1.sources.s1.type=Netcat
Q100) What is Hive?
Hive is a Hadoop based
system for querying and analyzing large volumes of Structured data which is
stored on HDFS or in other words Hive is an query engine built to work on top
of Hadoop that can compile queries into Map Reduce jobs and run them on the
cluster.
Q101) In which scenario Hive is good fit?
1.
Data warehousing
applications where more static data is analyzed.
2.
Fast response time is
not the criteria.
3.
Data is not changing
rapidly.
4.
An abstract to
underlying MapReduce programs
5.
Like SQL
Q102) What are the limitations of Hive?
Hive does not provide:
1.
Record-level
operations like INSERT, DELETE or UPDATE.
2.
Cannot be used for low
latency jobs.
3.
Transaction.
Q103) What are the differences between Hive and RDBMS?
HIVE:
·
Schema on Read
·
Batch processing jobs
·
Data stored on HDFS
·
Processed using
MapReduce
RDBMS:
·
Schema on write
·
Real time jobs
·
Data stored on
internal structure
·
Processed using
database
Q104) What are the components of Hive architecture?
·
Hive Driver
·
Metastore
·
Hive CLI/HUE/HWI
Q105) What is the purpose of Hive Driver?
Hive Driver is
responsible for compiling, optimizing and then executing the HiveQL.
Q106) What is a Metastore and what it stores?
1.
It is a database by
default Derby SQL server
2.
Holds metadata about
table definition, column, and data types partitioning information,
3.
It can be stored in
MySQL, derby, oracle etc.
Q107) What is the purpose of storing the metadata?
People want to read the dataset with a particular schema in
mind.
For e.g.: BA and CFO of a company look at the data with a particular schema.
BA may be interested in say IP addresses and timings of the clicks in a weblog while the CFO may be interested in say the clicks that were direct clicks on the website or from paid Google adds.
Underneath it’s the same dataset that is accessed. This schema is used again and again. So it makes sense to store this schema in a RDBMS.
For e.g.: BA and CFO of a company look at the data with a particular schema.
BA may be interested in say IP addresses and timings of the clicks in a weblog while the CFO may be interested in say the clicks that were direct clicks on the website or from paid Google adds.
Underneath it’s the same dataset that is accessed. This schema is used again and again. So it makes sense to store this schema in a RDBMS.
Q108) List the various options available with the Hive command.
Syntax:
$ ./hive –service
serviceName
where
serviceName options are:
cli
help
hiveserver
hwi
jar
lineage
metastore
rcfile
where
serviceName options are:
cli
help
hiveserver
hwi
jar
lineage
metastore
rcfile
Q109) Explain the different services that can be invoked using
the Hive command.
cli
·
default service
·
used to define tables,
run queries, etc.
hiveserver
·
aemon that listens for
Thrift connections from other processes
hwi
·
Simple web interface
for running queries
jar
·
Extension of the
hadoop jar command
metastore
·
External Hive
metastore service to support multiple clients
rcfile
·
Tool for printing the
contents of an RFFile
Q110) Can you execute Hadoop dfs Commands from Hive CLI? How?
Hadoop dfs commands
can be run from within the hive CLI by dropping the hadoop work from the
command and adding a semicolon in the end.
For Example:
Hadoop dfs command:
hadoop dfs -ls /
From within hive
hive > dfs -ls / ;
hadoop dfs -ls /
From within hive
hive > dfs -ls / ;
Q111) How to give multiline comments in Hive Scripts?
Hive does not support multiline comments. All lines of comments
should start with the string —
For e.g.
— This is first line of comment
— This is second line of comment !!
For e.g.
— This is first line of comment
— This is second line of comment !!
Q112) What is the reason for creating a new metastore_db
whenever Hive query is run from a different directory?
Embedded mode:
Whenever Hive runs in embedded mode, it checks whether the metastore exists. If the metastore does not exist then it creates the local metastore.
Whenever Hive runs in embedded mode, it checks whether the metastore exists. If the metastore does not exist then it creates the local metastore.
Property: Default
value
javax.jdo.option.ConnectionURL = “jdbc:derby:;databaseName=metastore_db;create=true”
Q113) When Hive is run in embedded mode, how to share the
metastore within multiple users?
No.
For sharing use the standalone database (like MySQL, PostGresQL) for metastore
For sharing use the standalone database (like MySQL, PostGresQL) for metastore
Q114) How can an application connect to Hive run as a server?
Thrift Client: Hive commands can be called hive command from
programming languages like Java, PHP, Python, Ruby, C++
JDBC Driver: Type 4 (pure Java) JDBC Driver
ODBC driver: ODBC protocol
JDBC Driver: Type 4 (pure Java) JDBC Driver
ODBC driver: ODBC protocol
Q115) List the Primitive Data Types?
DataTypes:
TINYINT
TINYINT
Q116) What problem does Apache Pig solve?
Scenario
1. MapReduce paradigm presented by Hadoop is low level and rigid so developing can be challenging.
2. Jobs are (mainly) in Java where developer needs to think in terms of map and reduce
1. MapReduce paradigm presented by Hadoop is low level and rigid so developing can be challenging.
2. Jobs are (mainly) in Java where developer needs to think in terms of map and reduce
Problem
1. Many common operations like filters, projections, joins requires a custom code
2. Not everyone is a Java expert!!!
3. MapReduce has a long development cycle
1. Many common operations like filters, projections, joins requires a custom code
2. Not everyone is a Java expert!!!
3. MapReduce has a long development cycle
Q117) What is Apache Pig?
Apache Pig is a platform for analyzing large data sets that
consists high-level language for expressing data analysis programs, with
infrastructure for evaluating these programs.
Goals: Ease of programming, Improved Code readability, Flexible, Extensible
Pig Features:
Ease of programming:
Goals: Ease of programming, Improved Code readability, Flexible, Extensible
Pig Features:
Ease of programming:
·
Generates MapReduce
programs automatically
·
Fewer lines of code
Flexible:
·
Metadata is optional
Extensible:
·
Easy extensible by
UDFs
Resides on the client
machine
Q118) In which scenario MapReduce is a better fit than Pig?
Some problems are
harder to express in Pig. For example:
1.
Complex grouping or
joins
2.
Combining lot of
datasets
3.
Replicated join
4.
Complex cross products
In such cases, Pig’s
MAPREDUCE relational operator can be used which allows plugging in Java
MapReduce job.
Q119) In which scenario Pig is better fit than MapReduce?
Pig provides common
data operations (joins, filters, group by, order by, union) and nested data
types (tuple, bag and maps), which are missing from MapReduce.
Q120) Where not to use Pig?
1.
Completely
unstructured data. For example: images, audio, video
2.
When more power to
optimize the code is required
3.
Retrieving a single
record in a very large dataset
Q121) What can be feed to Pig?
We can input structured, semi-structured or unstructured data to
Pig.
For example, CSV’s, TSV’s, Delimited Data, Logs
For example, CSV’s, TSV’s, Delimited Data, Logs
Q122) What are the components of Apache Pig platform?
Pig Engine
Parser, Optimizer and produces sequences of MapReduce programs
Parser, Optimizer and produces sequences of MapReduce programs
Grunt
Pig’s interactive shell
It allows users to enter Pig Latin interactively and interact with HDFS
Pig’s interactive shell
It allows users to enter Pig Latin interactively and interact with HDFS
Pig Latin
High level and easy to understand dataflow language
Provides ease of programming, extensibility and optimization.
High level and easy to understand dataflow language
Provides ease of programming, extensibility and optimization.
Q123) What are the execution modes in Pig?
Pig has two execution modes:
Local mode
No Hadoop / HDFS installation is required
All processing takes place in only one local JVM
Used only for quick prototyping and debugging Pig Latin script
pig -x local
No Hadoop / HDFS installation is required
All processing takes place in only one local JVM
Used only for quick prototyping and debugging Pig Latin script
pig -x local
MapReduce mode (Default)
Parses, checks and optimizes locally
Parses, checks and optimizes locally
1.
Plans execution as one
MapReduce job
2.
Submits job to Hadoop
3.
Monitors job progress
pig or pig -x
mapreduce
Q124) Different running modes for running Pig?
Pig has two running modes:
Interactive mode
Pig commands runs one at a time in the grunt shell
Pig commands runs one at a time in the grunt shell
Batch mode
Commands are in pig script file.
Commands are in pig script file.
Q125) What are the different ways to develop PigLatin scripts?
Plugins are available which features such as syntax/error
highlighting, auto completion etc.
Eclipse plugins
Eclipse plugins
1.
PigEditor
2.
PigPen
3.
Pig-Eclipse
Vim, Emacs, TextMate
plugins also available
Q126) What are the Data types in Pig?
Scalar Types
Int, long, float, double, chararray, bytearray, boolean (since Release 0.10.0)
Int, long, float, double, chararray, bytearray, boolean (since Release 0.10.0)
Complex Types
Map, Tuple, Bag
Map, Tuple, Bag
Q127) Which type in Pig is not required to fit in Memory?
1.
Bag is the type not
required to fit in memory, as it can be quite large.
2.
It can store bags to
disk when necessary.
Q128) What is a Map in Pig?
Map is a chararray to data element mapping, where data element
be of any Pig data type.
It can also be called as a set of key-value pairs where
Keys → chararray and Values → any pig data type
For example [‘student’#’Mahi’, ’Rank’#1]
It can also be called as a set of key-value pairs where
Keys → chararray and Values → any pig data type
For example [‘student’#’Mahi’, ’Rank’#1]
Q129) What is a Tuple in Pig? (~ RDBMS row in a table)
A tuple is an ordered set of fields; fields can be of any data
type.
It can also be called as a sequence of fields of any type.
It can also be called as a sequence of fields of any type.
Explore Hadoop Sample Resumes!
Download & Edit, Get Noticed by Top Employers!Download Now!
Q130) HDFS and Mapreduce Features in Hadoop Versions:
Feature
|
1.x
|
0.22
|
2.x
|
Secure
Authentication
|
Yes
|
No
|
Yes
|
Old Configuration
Names
|
Yes
|
Deprecate
|
Deprecate
|
New Configuration
Names
|
No
|
Yes
|
Yes
|
Old Mapreduce API
|
Yes
|
Yes
|
Yes
|
New Mapreduce
API
|
Yes
|
Yes
|
Yes
|
Mapreduce 1 Runtime
|
No
|
No
|
Yes
|
Mapreduce 2 Runtime
|
No
|
No
|
Yes
|
HDFS Federation
|
No
|
No
|
Yes
|
HDFS High
Availability
|
No
|
No
|
Yes
|
Q131) What is Cluster Rebalancing?
·
The architecture of
HDFS is in flow with data rebalancing schemes.
·
A scheme automatically
move data from one data node into another data node.
If in case, there is a
sudden rise in particular file, a scheme dynamically creates additional replies
and rebalance the data within cluster. This type of data rebalancing schemes
are yet to be positioned.
Q132) What is Data Integrity?
1.
Data Integrity is a method
where a block of data is fetched from a datanode, but it comes corrupted. This
corruption is due to faults in storage device, network and buggy software.
2.
HDFS software employs
checksum checking on the HDFS file contents.
3.
When a client handles
file contents, it verifies that data retrieved matches the checksum of the
relevant checksum file.
4.
If not, then the
client may opt to retrieve the block from variant datanode that replicates the
other block.
Q133) What is Hadoop File System?
1.
Hadoop File System indicating
the compiler to interact with Linux local environment to HDFS environment.
2.
Hadoop File System is
not support the -vi command. Because HDFS is write once.
3.
Hadoop File System is
support for -touchz command.
4.
Hadoop File System
looks the only HDFS directory but not local directory.
5.
We can not create a
file on top of HDFS.
6.
We can not create the
file on local.
7.
We can not update a
file on top of HDFS. We can updations in local, after that file is put into the
HDFS.
8.
Hadoop file system
does not support hard links (or) soft links.
9.
Hadoop File System
does not implement user quotas.
10. Error implementation is sent to stderr &
output is sent to stdout.
Display detailed help for a command:
Hadoop fs - help
Hadoop fs - help
Q134) User Command Archive?
·
Hadoop stores the
small files inefficiently such as each file get stored in a block &
namenode has to keep the metadata information in memory so with this reason
most of the namenode memory will get eat up this small files only which results
in a wastage of memory.
·
To avoid the same
problem we use hadoop archives (or) har files (.har a the extension for all the
archive files).
·
When creating archive
directory the input is converted to mapreduce jobs, so we can call hadoop
archives as a input for our mapreduce programming.
1. Hadoop archives are special format archives.
2. Hadoop archive maps to a file system directory.
3. Hadoop archive always has a .har extension
4. Hadoop archive directory contains metadata.
2. Hadoop archive maps to a file system directory.
3. Hadoop archive always has a .har extension
4. Hadoop archive directory contains metadata.
Q135) What is Serialization?
1.
Serialization transforms
objects into bytes
2.
Hadoop utilizes PR6
for transmitting across the network.
3.
Hadoop employs a very
own serialization format which is writable
4.
Comparison types are
crucial
5.
Hadoop enables a Raw
comparator, that abolishes deserialization
6.
External frameworks
are enables via : enter Avro
Q136) Datanode block scanner?
All the datanodes runs the block scanner, which periodically
verifies all the blocks stored on the datanode. This allows bad blocks to be
detected and fixed before they are read by clients.
It maintains
It maintains
1.
A list of blocks to
verify
2.
It scans them one by
one for checksum errors.
3.
Block scanner report
can be verify by visiting http://datanode:50075/blockScannerReport.
Q137) What is HBASE Data Storage?
HBASE is column
oriented data storage
Column-Oriented:
1.
The reason to store
values on a per column basis instead is based on the assumption
2.
That for specific
queries, not all of the values are needed
3.
Reduced I/O
4.
The data of
column-oriented databases is saved in the way grouped by columns and the
following column values are stored on the contiguous disk locations. This is
quite different from the conventional approach followed by the traditional
databases which stores all the rows contiguously.
1.
Storefile: Store File for each state for each region
for the table.
2.
Block: Blocks within a store file within a store
for each region for the table
5.
Hlog used for
recovering
·
Send
heartbeat(loadinfo) to master
·
Write requests handle
·
Read request handle
·
Flush
·
Compaction
·
Region Splits(Manage)
Q138) What is Hadoop Streaming?
A utility to enable
Mapreduce code in any language: C, Perl, Python, C++, Bash etc. The examples
include a python mapper and an AWK reducer.
Q139) What is the difference between Base & NOSQL?
Favours consistencies are availability (but availability is good
in practice)
Great hadoop integration (very efficient bulk loads, Mapreduce Analysis)
Ordered range partitions(not hash)
Automatically shards/scales (just run on more servers)
Sparse column stronge(not key value)
Great hadoop integration (very efficient bulk loads, Mapreduce Analysis)
Ordered range partitions(not hash)
Automatically shards/scales (just run on more servers)
Sparse column stronge(not key value)
Q140) What is HBASE Client?
The HBase client finds
the HRegion servers that serve the specific row range of interest. The HBase
client, on instantiation, exchanges information with the HBase Master to locate
the ROOT region. The client communicates with the region server of interest
once the ROOT region is located and scans it to locate the META region that
contains the user region’s location which consists of the desired row range.
Q141) Why HBASE?
We can infrastructure, no usage limits
Data Model
Semistructured data in Hbase
Time series ordered
Scaling is built-in (Just add more servers)
But extra indexing is DIY
Very active developer community
Established, mature project (in relative terms)
Matches our own toolset (Java/Linux based)
Data Model
Semistructured data in Hbase
Time series ordered
Scaling is built-in (Just add more servers)
But extra indexing is DIY
Very active developer community
Established, mature project (in relative terms)
Matches our own toolset (Java/Linux based)
Q142) What is ZOOKEEPER?
Master election and server availability
cluster management: Assignment transaction state management
Client contacts zookeeper to bootstrap connection to the HBase cluster.
Region key ranges, region server address
Guarantees consistency of data across clients.
cluster management: Assignment transaction state management
Client contacts zookeeper to bootstrap connection to the HBase cluster.
Region key ranges, region server address
Guarantees consistency of data across clients.
Such a nice blog with the attractive reference links which give the basic ideas on the topic.
ReplyDeleteData Science and Data Analytics
Statistics For Business Analytics