Skip to main content

Hadoop

This post is about basic concepts of Big Data and Apache Hadoop framework. It is essential to know about Big Data for today's IT industry since Big Data technologies are getting more popular and powerful now a days.

Big Data:
Big Data is nothing but storing or processing LARGE quantity of data sets. Data can be
  • Structured Data (ex: RDBMS)
  • Semi Structured Data (ex: XML Files)
  • Unstructured Data (ex:Flat files)
Big data overcomes the challenges of traditional databases such as Volume, Velocity, Variety, Complex Data sets. 95% of Today's Data are created in the last three years so the state of data is changed a lot. Popular Companies such as Amazon, Google, Facebook, Twitter, Yahoo have realized the power of Big data and doing R&D on Big Data.

Apache Hadoop:

Apache Hadoop is an open source Big Data framework for distributed storage and distributed processing for very large data sets on computer clusters. Apache Hadoop is a Master-Slave Shared Nothing Architecture.

Apache Hadoop Core Contains
  • HDFS (Hadoop Distributed File System) is for Distributed Storage.
  • Map-Reduce (MR) is for Distributed Processing
  • Yarn is for Distributed Scheduling

Apache Hadoop was developed by Yahoo based on the white papers of Google MapReduce and Google File System.

The Design principles of Apache Hadoop are
  • To solve computation problem of large data sets Hadoop used large number of commodity hardware. Example:
    • Yahoo uses 45000 nodes of Hadoop Cluster
    • Facebook is having largest Hadoop cluster in the planet with 100PB
    • Twitter generates 400Million tweets a day.
  • Automatic Parallelization and Distribution.
  • Automatic Recovery and Fault Tolerance.
  • Clean and Simple By Map Reduce
  • By using large number of commodity hardware solved computation problem on large data sets.(ex:Yahoo uses more than 45000 machines with Hadoop Open Source of Map-Reduce)
Comparison of Apache Hadoop and Traditional RDBMS

Apache Hadoop Traditional RDBMS
Schema on read Schema on write
For Structure, Semi Structured & Un-Structured Only for Structured

Hadoop Distributed File System (HDFS):
Hadoop Distributed File System (HDFS) is the Distributed storage area where data are stored for Hadoop processing. HDFS is a Master-Slave architecture where one device controls one or more slave devices.
  • Master contains Name-Node, Secondary-Name-Node, Job-Tracker.
  • Slave contains Data-Node, Task-Tracker.
Name-Node
  • Name Node controls all Data Nodes
  • Controller of File System operations such as file creation, deletion
  • Maintains File System Meta-information
  • Maintains memory map of the entire cluster
  • Manages Block Mapping which knows all information of data.
  • Monitors health
  • Most important node since It is the Single Point Of Failure (SPOF), (ie: if name node goes down, cluster goes down)
Secondary Name-Node
  • Snapshots (Backup) the Name Node for restoring.
  • It is not fail over for Name Node
  • Meta-Data backup on rebuild
  • No high availability
Data-Node
  • Stores Actual data and Replicated into 3 nodes by default.
  • Responsible for Block operations
  • Handles client requests by passing data to Name Node.
  • Sends heartbeats to Name-Node with block report (default by every 3 sec.) 
  • Job Tracker: Job tracker is the controller for all task tracker
    • Master-Slave Flow
      • Job-Client submits job to Job-Tracker
      • Job-Tracker talks to Name-Node,creates execution plan and submits work to Task-Tracker
      • Task-Tracker reports progress via heart beats, manages phases and updates states
  • Defult HDFS Block is 64MB
  • HDFS Properties
    • Large Data
    • N Times Replication
    • Failure is normal rather than exception
    • Fault Tolerence
  • HDFS Features
    • Rack awareness
    • Reliable Storage
    • High Throughput
Map-Reduce:
  • Map Reduce is Programming Paradigm.
  • A Execution Engine which uses Mappers and Reducers
  • Actually these are code for processing the Large Datasets.  
  • Description on how Map-Reduce Works:
    • Let us consider our input data contains "Deer Bear River Car Car River Deer Car Bear".
    • Split Phase(hidden phase): Splits input into no of input splits.
    • Map Phase: Transforms the input split into Maps( key-value pair) based on user defined.
    • Shuffle & Sort(hidden phase): Moves map output to reducers and sorts them by the key.
    • Reduce Phase: Aggregate Map based on user defined code
    • Final Result: Determines how the file are parsed into the map/reduce pipeline and produces final result.
Hadoop Technology Stack:
Hadoop is collection of framework,the following are the popular Hadoop framework stack.
  • For Data Access
    • PIG: Hige-level data-flow scripting Language and execution framework
    • HIVE: Data-warehouse infrastructure allows SQL-like
  • For Data Storage
    • HBASE: Bigtable-like structured storage system, millions of columns and billions of rows
    • Cassendra: Scalable multi-master NO-SQL database with no single point of failure
  • For Interation, Visualization, Execution & Development
    • HCatalog: Meta table management
    • Lucene: Indexing tool for Searching tool with wildcard
    • Crunch: Used for Map-Reduce pipe lining with shuffle and sorting
  • For Data Serialization
    • Avro: Data serialization system
    • Thirft: multi language/ language-neutral making framework
  • For Data Intelligence
    • Mahout: Machine Learning & Data Mining tool mainly used for Business Intelligence
  • For Data Integration
    • Sqoop: Import and export data between RDBMS and Hadoop
    • Flame: Log Data collection system
    • Chukwa: Data collection system
  • For Management & Monitoring 
    • Ambari: Web based tool for monitoring and managing
    • Zookeeper: High-performance coordination tool
    • Oozie: Schedule and workflow tool
The typical Hadoop Ecosystem may contain HDFS(Hadoop Distributed File System), HIVE, PIG, HBase, Zookeeper, Sqoop, Flame, Oozie. But it is completely depends upon individuals requirement and experience.

Comments

  1. Thanks for sharing useful information. I learned something new from your bog. Its very interesting and informative. keep updating. If you are looking for any Data science related information, please visit our website bigdata training institute in bangalore.

    ReplyDelete
  2. Very well and informative post..
    Thanks for sharing with us,
    We are again come on your website,
    Thanks and good day,
    Please visit our site,
    buylogo

    ReplyDelete
  3. Very Nice article,keep sharing more article.
    Keep updating..

    Big Data and Hadoop Training

    ReplyDelete
  4. whenever I am feeling boring I am not playing some kinds of games but on the opposite, I am starting to find some blogs where I can find helpful articles but I am not commenting there but this article really is an awesome article I ever say thanks for sharing it with us.
    if you have an interest in web designing or logo designing then visit us?
    Logo Designers

    ReplyDelete
  5. I love your blog! You always have such interesting and thought-provoking content. google drive india pricing

    ReplyDelete

Post a Comment

Popular posts from this blog

Install Hadoop 2.5.1 on Windows 7 - 64Bit Operating System

This post is about installing Single Node Cluster Hadoop 2.5.1 (latest stable version) on Windows 7 Operating Systems.Hadoop was primarily designed for Linux platform. Hadoop supports for windows from its version 2.2, but we need prepare our platform binaries. Hadoop official website recommend Windows developers to use this build for development environment and not on production, since it is not completely tested success on Windows platform. This post describes the procedure for generating the Hadoop build for Windows platform. Generating Hadoop Build For Windows Platform Step 1:Install Microsoft Windows SDK 7.1 In my case, I have used Windows 7 64 bit Operating System. Download Microsoft Windows SDK 7.1 from Microsoft Official website and install it. While installing Windows SDK,I have faced problem like C++ 2010 Redistribution is already installed. This problem will happen only if we have installed C++ 2010 Redistribution of higher version compared to the Windows SDK. ...

Install Spring Tool Suite on Ubuntu

This post is about installing Spring Tool Suite (STS) on Ubuntu. The Spring Tool Suite is an Eclipse-based development environment that is customized for developing Spring applications. Step 1: Download the latest Spring Tool Suite for Linux from STS official website: http://spring.io/tools/sts/all  Step 2: Extract into any folder which you prefer. My extracted Spring Tool Suite locations is /home/harishshan/springsource Step 3: Create the Menu icon for quick access sudo nano /usr/share/applications/STS.desktop Step 4: Enter the following content [Desktop Entry] Name=SpringSource Tool Suite Comment=SpringSource Tool Suite Exec=/home/harishshan/springsource/sts-3.4.0-RELEASE/STS Icon=/home/harishshan/springsource/sts-3.4.0-RELEASE/icon.xpm StartupNotify=true Terminal=false Type=Application Categories=Development;IDE;Java; Step 5: Now you can check from Quick Menu by typing " Spring "

How to fix Kindle wrong time left in chapter book

The most of kindle reader use the time left in chapter or time left in book options provided at left bottom of the page to estimate completing the book or the current chapter. But sometimes it was not  accurate and displaying 5 min left in chapter instead of 30 min left in chapter. To fix this issue just type ;ReadingTimeReset at search field and enter. Kindly refer the photo for the same.