This post is about basic concepts of Big Data and Apache Hadoop framework. It is essential to know about Big Data for today's IT industry since Big Data technologies are getting more popular and powerful now a days.

Big Data:
Big Data is nothing but storing or processing LARGE quantity of data sets. Data can be
  • Structured Data (ex: RDBMS)
  • Semi Structured Data (ex: XML Files)
  • Unstructured Data (ex:Flat files)
Big data overcomes the challenges of traditional databases such as Volume, Velocity, Variety, Complex Data sets. 95% of Today's Data are created in the last three years so the state of data is changed a lot. Popular Companies such as Amazon, Google, Facebook, Twitter, Yahoo have realized the power of Big data and doing R&D on Big Data.

Apache Hadoop:

Apache Hadoop is an open source Big Data framework for distributed storage and distributed processing for very large data sets on computer clusters. Apache Hadoop is a Master-Slave Shared Nothing Architecture.

Apache Hadoop Core Contains
  • HDFS (Hadoop Distributed File System) is for Distributed Storage.
  • Map-Reduce (MR) is for Distributed Processing
  • Yarn is for Distributed Scheduling

Apache Hadoop was developed by Yahoo based on the white papers of Google MapReduce and Google File System.

The Design principles of Apache Hadoop are
  • To solve computation problem of large data sets Hadoop used large number of commodity hardware. Example:
    • Yahoo uses 45000 nodes of Hadoop Cluster
    • Facebook is having largest Hadoop cluster in the planet with 100PB
    • Twitter generates 400Million tweets a day.
  • Automatic Parallelization and Distribution.
  • Automatic Recovery and Fault Tolerance.
  • Clean and Simple By Map Reduce
  • By using large number of commodity hardware solved computation problem on large data sets.(ex:Yahoo uses more than 45000 machines with Hadoop Open Source of Map-Reduce)
Comparison of Apache Hadoop and Traditional RDBMS

Apache Hadoop Traditional RDBMS
Schema on read Schema on write
For Structure, Semi Structured & Un-Structured Only for Structured

Hadoop Distributed File System (HDFS):
Hadoop Distributed File System (HDFS) is the Distributed storage area where data are stored for Hadoop processing. HDFS is a Master-Slave architecture where one device controls one or more slave devices.
  • Master contains Name-Node, Secondary-Name-Node, Job-Tracker.
  • Slave contains Data-Node, Task-Tracker.
Name-Node
  • Name Node controls all Data Nodes
  • Controller of File System operations such as file creation, deletion
  • Maintains File System Meta-information
  • Maintains memory map of the entire cluster
  • Manages Block Mapping which knows all information of data.
  • Monitors health
  • Most important node since It is the Single Point Of Failure (SPOF), (ie: if name node goes down, cluster goes down)
Secondary Name-Node
  • Snapshots (Backup) the Name Node for restoring.
  • It is not fail over for Name Node
  • Meta-Data backup on rebuild
  • No high availability
Data-Node
  • Stores Actual data and Replicated into 3 nodes by default.
  • Responsible for Block operations
  • Handles client requests by passing data to Name Node.
  • Sends heartbeats to Name-Node with block report (default by every 3 sec.) 
  • Job Tracker: Job tracker is the controller for all task tracker
    • Master-Slave Flow
      • Job-Client submits job to Job-Tracker
      • Job-Tracker talks to Name-Node,creates execution plan and submits work to Task-Tracker
      • Task-Tracker reports progress via heart beats, manages phases and updates states
  • Defult HDFS Block is 64MB
  • HDFS Properties
    • Large Data
    • N Times Replication
    • Failure is normal rather than exception
    • Fault Tolerence
  • HDFS Features
    • Rack awareness
    • Reliable Storage
    • High Throughput
Map-Reduce:
  • Map Reduce is Programming Paradigm.
  • A Execution Engine which uses Mappers and Reducers
  • Actually these are code for processing the Large Datasets.  
  • Description on how Map-Reduce Works:
    • Let us consider our input data contains "Deer Bear River Car Car River Deer Car Bear".
    • Split Phase(hidden phase): Splits input into no of input splits.
    • Map Phase: Transforms the input split into Maps( key-value pair) based on user defined.
    • Shuffle & Sort(hidden phase): Moves map output to reducers and sorts them by the key.
    • Reduce Phase: Aggregate Map based on user defined code
    • Final Result: Determines how the file are parsed into the map/reduce pipeline and produces final result.
Hadoop Technology Stack:
Hadoop is collection of framework,the following are the popular Hadoop framework stack.
  • For Data Access
    • PIG: Hige-level data-flow scripting Language and execution framework
    • HIVE: Data-warehouse infrastructure allows SQL-like
  • For Data Storage
    • HBASE: Bigtable-like structured storage system, millions of columns and billions of rows
    • Cassendra: Scalable multi-master NO-SQL database with no single point of failure
  • For Interation, Visualization, Execution & Development
    • HCatalog: Meta table management
    • Lucene: Indexing tool for Searching tool with wildcard
    • Crunch: Used for Map-Reduce pipe lining with shuffle and sorting
  • For Data Serialization
    • Avro: Data serialization system
    • Thirft: multi language/ language-neutral making framework
  • For Data Intelligence
    • Mahout: Machine Learning & Data Mining tool mainly used for Business Intelligence
  • For Data Integration
    • Sqoop: Import and export data between RDBMS and Hadoop
    • Flame: Log Data collection system
    • Chukwa: Data collection system
  • For Management & Monitoring 
    • Ambari: Web based tool for monitoring and managing
    • Zookeeper: High-performance coordination tool
    • Oozie: Schedule and workflow tool
The typical Hadoop Ecosystem may contain HDFS(Hadoop Distributed File System), HIVE, PIG, HBase, Zookeeper, Sqoop, Flame, Oozie. But it is completely depends upon individuals requirement and experience.