This post is about basic concepts of Big Data and Apache Hadoop framework. It is essential to know about Big Data for today's IT industry since Big Data technologies are getting more popular and powerful now a days.
Big Data:
Big Data is nothing but storing or processing LARGE quantity of data sets. Data can be
Apache Hadoop:
Big Data:
Big Data is nothing but storing or processing LARGE quantity of data sets. Data can be
- Structured Data (ex: RDBMS)
- Semi Structured Data (ex: XML Files)
- Unstructured Data (ex:Flat files)
Apache Hadoop:
Apache Hadoop is an open source Big Data framework for distributed storage and distributed processing for very large data sets on computer clusters. Apache Hadoop is a Master-Slave Shared Nothing Architecture.
Apache Hadoop Core Contains- HDFS (Hadoop Distributed File System) is for Distributed Storage.
- Map-Reduce (MR) is for Distributed Processing
- Yarn is for Distributed Scheduling
Apache Hadoop was developed by Yahoo based on the white papers of Google MapReduce and Google File System.
The Design principles of Apache Hadoop are- To solve computation problem of large data sets Hadoop used large number of commodity hardware. Example:
- Yahoo uses 45000 nodes of Hadoop Cluster
- Facebook is having largest Hadoop cluster in the planet with 100PB
- Twitter generates 400Million tweets a day.
- Automatic Parallelization and Distribution.
- Automatic Recovery and Fault Tolerance.
- Clean and Simple By Map Reduce
- By using large number of commodity hardware solved computation problem on large data sets.(ex:Yahoo uses more than 45000 machines with Hadoop Open Source of Map-Reduce)
Apache Hadoop | Traditional RDBMS |
Schema on read | Schema on write |
For Structure, Semi Structured & Un-Structured | Only for Structured |
Hadoop Distributed File System (HDFS):
Hadoop Distributed File System (HDFS) is the Distributed storage area where data are stored for Hadoop processing. HDFS is a Master-Slave architecture where one device controls one or more slave devices.
Hadoop Distributed File System (HDFS) is the Distributed storage area where data are stored for Hadoop processing. HDFS is a Master-Slave architecture where one device controls one or more slave devices.
- Master contains Name-Node, Secondary-Name-Node, Job-Tracker.
- Slave contains Data-Node, Task-Tracker.
Name-Node
- Name Node controls all Data Nodes
- Controller of File System operations such as file creation, deletion
- Maintains File System Meta-information
- Maintains memory map of the entire cluster
- Manages Block Mapping which knows all information of data.
- Monitors health
- Most important node since It is the Single Point Of Failure (SPOF), (ie: if name node goes down, cluster goes down)
- Snapshots (Backup) the Name Node for restoring.
- It is not fail over for Name Node
- Meta-Data backup on rebuild
- No high availability
- Stores Actual data and Replicated into 3 nodes by default.
- Responsible for Block operations
- Handles client requests by passing data to Name Node.
- Sends heartbeats to Name-Node with block report (default by every 3 sec.)
- Job Tracker: Job tracker is the controller for all task tracker
- Master-Slave Flow
- Job-Client submits job to Job-Tracker
- Job-Tracker talks to Name-Node,creates execution plan and submits work to Task-Tracker
- Task-Tracker reports progress via heart beats, manages phases and updates states
- Defult HDFS Block is 64MB
- HDFS Properties
- Large Data
- N Times Replication
- Failure is normal rather than exception
- Fault Tolerence
- HDFS Features
- Rack awareness
- Reliable Storage
- High Throughput
Map-Reduce:
- Map Reduce is Programming Paradigm.
- A Execution Engine which uses Mappers and Reducers
- Actually these are code for processing the Large Datasets.
- Description on how Map-Reduce Works:
- Let us consider our input data contains "Deer Bear River Car Car River Deer Car Bear".
- Split Phase(hidden phase): Splits input into no of input splits.
- Map Phase: Transforms the input split into Maps( key-value pair) based on user defined.
- Shuffle & Sort(hidden phase): Moves map output to reducers and sorts them by the key.
- Reduce Phase: Aggregate Map based on user defined code
- Final Result: Determines how the file are parsed into the map/reduce pipeline and produces final result.
Hadoop Technology Stack:
Hadoop is collection of framework,the following are the popular Hadoop framework stack.
- For Data Access
- PIG: Hige-level data-flow scripting Language and execution framework
- HIVE: Data-warehouse infrastructure allows SQL-like
- For Data Storage
- HBASE: Bigtable-like structured storage system, millions of columns and billions of rows
- Cassendra: Scalable multi-master NO-SQL database with no single point of failure
- For Interation, Visualization, Execution & Development
- HCatalog: Meta table management
- Lucene: Indexing tool for Searching tool with wildcard
- Crunch: Used for Map-Reduce pipe lining with shuffle and sorting
- For Data Serialization
- Avro: Data serialization system
- Thirft: multi language/ language-neutral making framework
- For Data Intelligence
- Mahout: Machine Learning & Data Mining tool mainly used for Business Intelligence
- For Data Integration
- Sqoop: Import and export data between RDBMS and Hadoop
- Flame: Log Data collection system
- Chukwa: Data collection system
- For Management & Monitoring
- Ambari: Web based tool for monitoring and managing
- Zookeeper: High-performance coordination tool
- Oozie: Schedule and workflow tool
The typical Hadoop Ecosystem may contain HDFS(Hadoop Distributed File System), HIVE, PIG, HBase, Zookeeper, Sqoop, Flame, Oozie. But it is completely depends upon individuals requirement and experience.
Great summary
ReplyDeleteThanks for sharing useful information. I learned something new from your bog. Its very interesting and informative. keep updating. If you are looking for any Data science related information, please visit our website bigdata training institute in bangalore.
ReplyDeleteVery well and informative post..
ReplyDeleteThanks for sharing with us,
We are again come on your website,
Thanks and good day,
Please visit our site,
buylogo
good content
ReplyDeleteSpark and Scala Online Training
Thanks for the informative article. This is one of the best resources I have found in quite some time.
ReplyDeleteHadoop Admin Training in Chennai
Hadoop Administration Course in Chennai
TOEFL Coaching in Chennai
French Classes in Chennai
chennai spoken english classes
pearson vue test center in chennai
Informatica Training in Chennai
spanish language in chennai
ielts best coaching in chennai
content writing training in chennai
Hadoop Admin Training in Adyar
Hadoop Admin Training in Velachery
It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
ReplyDeleteSelenium Training in chennai | Selenium Training in anna nagar | Selenium Training in omr | Selenium Training in porur | Selenium Training in tambaram | Selenium Training in velachery
Very Nice article,keep sharing more article.
ReplyDeleteKeep updating..
Big Data and Hadoop Training
whenever I am feeling boring I am not playing some kinds of games but on the opposite, I am starting to find some blogs where I can find helpful articles but I am not commenting there but this article really is an awesome article I ever say thanks for sharing it with us.
ReplyDeleteif you have an interest in web designing or logo designing then visit us?
Logo Designers
I love your blog! You always have such interesting and thought-provoking content. google drive india pricing
ReplyDelete