Hadoop

This post is about basic concepts of Big Data and Apache Hadoop framework. It is essential to know about Big Data for today's IT industry since Big Data technologies are getting more popular and powerful now a days.

Big Data:
Big Data is nothing but storing or processing LARGE quantity of data sets. Data can be

Structured Data (ex: RDBMS)
Semi Structured Data (ex: XML Files)
Unstructured Data (ex:Flat files)

Big data overcomes the challenges of traditional databases such as Volume, Velocity, Variety, Complex Data sets. 95% of Today's Data are created in the last three years so the state of data is changed a lot. Popular Companies such as Amazon, Google, Facebook, Twitter, Yahoo have realized the power of Big data and doing R&D on Big Data.

Apache Hadoop:

Apache Hadoop is an open source Big Data framework for distributed storage and distributed processing for very large data sets on computer clusters. Apache Hadoop is a Master-Slave Shared Nothing Architecture.

Apache Hadoop Core Contains

HDFS (Hadoop Distributed File System) is for Distributed Storage.

Map-Reduce (MR) is for Distributed Processing

Yarn is for Distributed Scheduling

Apache Hadoop was developed by Yahoo based on the white papers of Google MapReduce and Google File System.

The Design principles of Apache Hadoop are

To solve computation problem of large data sets Hadoop used large number of commodity hardware. Example:

Yahoo uses 45000 nodes of Hadoop Cluster
Facebook is having largest Hadoop cluster in the planet with 100PB
Twitter generates 400Million tweets a day.

Automatic Parallelization and Distribution.
Automatic Recovery and Fault Tolerance.
Clean and Simple By Map Reduce
By using large number of commodity hardware solved computation problem on large data sets.(ex:Yahoo uses more than 45000 machines with Hadoop Open Source of Map-Reduce)

Comparison of Apache Hadoop and Traditional RDBMS

Apache Hadoop	Traditional RDBMS
Schema on read	Schema on write
For Structure, Semi Structured & Un-Structured	Only for Structured

Hadoop Distributed File System (HDFS):
Hadoop Distributed File System (HDFS) is the Distributed storage area where data are stored for Hadoop processing. HDFS is a Master-Slave architecture where one device controls one or more slave devices.

Master contains Name-Node, Secondary-Name-Node, Job-Tracker.

Slave contains Data-Node, Task-Tracker.

Name-Node

Name Node controls all Data Nodes
Controller of File System operations such as file creation, deletion
Maintains File System Meta-information
Maintains memory map of the entire cluster

Manages Block Mapping which knows all information of data.

Monitors health

Most important node since It is the Single Point Of Failure (SPOF), (ie: if name node goes down, cluster goes down)

Secondary Name-Node

Snapshots (Backup) the Name Node for restoring.
It is not fail over for Name Node

Meta-Data backup on rebuild

No high availability

Data-Node

Stores Actual data and Replicated into 3 nodes by default.

Responsible for Block operations

Handles client requests by passing data to Name Node.

Sends heartbeats to Name-Node with block report (default by every 3 sec.)
Job Tracker: Job tracker is the controller for all task tracker

Master-Slave Flow

Job-Client submits job to Job-Tracker
Job-Tracker talks to Name-Node,creates execution plan and submits work to Task-Tracker
Task-Tracker reports progress via heart beats, manages phases and updates states

Defult HDFS Block is 64MB
HDFS Properties

Large Data
N Times Replication
Failure is normal rather than exception
Fault Tolerence

HDFS Features

Rack awareness
Reliable Storage
High Throughput

Map-Reduce:

Map Reduce is Programming Paradigm.
A Execution Engine which uses Mappers and Reducers
Actually these are code for processing the Large Datasets.
Description on how Map-Reduce Works:

Let us consider our input data contains "Deer Bear River Car Car River Deer Car Bear".
Split Phase(hidden phase): Splits input into no of input splits.
Map Phase: Transforms the input split into Maps( key-value pair) based on user defined.
Shuffle & Sort(hidden phase): Moves map output to reducers and sorts them by the key.
Reduce Phase: Aggregate Map based on user defined code
Final Result: Determines how the file are parsed into the map/reduce pipeline and produces final result.

Hadoop Technology Stack:

Hadoop is collection of framework,the following are the popular Hadoop framework stack.

For Data Access

PIG: Hige-level data-flow scripting Language and execution framework
HIVE: Data-warehouse infrastructure allows SQL-like

For Data Storage

HBASE: Bigtable-like structured storage system, millions of columns and billions of rows
Cassendra: Scalable multi-master NO-SQL database with no single point of failure

For Interation, Visualization, Execution & Development

HCatalog: Meta table management
Lucene: Indexing tool for Searching tool with wildcard
Crunch: Used for Map-Reduce pipe lining with shuffle and sorting

For Data Serialization

Avro: Data serialization system
Thirft: multi language/ language-neutral making framework

For Data Intelligence

Mahout: Machine Learning & Data Mining tool mainly used for Business Intelligence

For Data Integration

Sqoop: Import and export data between RDBMS and Hadoop
Flame: Log Data collection system
Chukwa: Data collection system

For Management & Monitoring

Ambari: Web based tool for monitoring and managing
Zookeeper: High-performance coordination tool
Oozie: Schedule and workflow tool

The typical Hadoop Ecosystem may contain HDFS(Hadoop Distributed File System), HIVE, PIG, HBase, Zookeeper, Sqoop, Flame, Oozie. But it is completely depends upon individuals requirement and experience.

Install Hadoop 2.5.1 on Windows 7 - 64Bit Operating System

This post is about installing Single Node Cluster Hadoop 2.5.1 (latest stable version) on Windows 7 Operating Systems.Hadoop was primarily designed for Linux platform. Hadoop supports for windows from its version 2.2, but we need prepare our platform binaries. Hadoop official website recommend Windows developers to use this build for development environment and not on production, since it is not completely tested success on Windows platform. This post describes the procedure for generating the Hadoop build for Windows platform. Generating Hadoop Build For Windows Platform Step 1:Install Microsoft Windows SDK 7.1 In my case, I have used Windows 7 64 bit Operating System. Download Microsoft Windows SDK 7.1 from Microsoft Official website and install it. While installing Windows SDK,I have faced problem like C++ 2010 Redistribution is already installed. This problem will happen only if we have installed C++ 2010 Redistribution of higher version compared to the Windows SDK. ...

Harish Shan

Search This Blog

Hadoop

Labels

Comments

Post a Comment

Popular posts from this blog

Install Hadoop 2.5.1 on Windows 7 - 64Bit Operating System

How to fix Kindle wrong time left in chapter book

Install Spring Tool Suite on Ubuntu