Search and You will Find!: [Big Data] Hadoop Core

Hadoop in Layman's term
Lets say you have file that contains name of all the people who lives in your apartment complex. You want to see how many people has same name as yours. You write a program that reads the file and outputs how many people has same name as yours. Now, lets say you want to know how many people in your city has same name as yours. The data is too big to fit in a single computer. So you get some 20 computers and connect them which forms a cluster. You install Hadoop on the cluster. Now, you start writing name of the people of your city in a file in Hadoop File System (HDFS) which in turn starts breaking your data into small chunks and writes into your 20 computers. You keep appending the file until you have written all the names of your city. Now, you submit the program that you previously have written for a single computer to find the count of name match on Hadoop. Hadoop takes your program, and asks your 20 computers to run the program parallelly. Hadoop then asks your 20 computers to aggregate and provide you the result.

Some terms:
node: a small computer capable of storing data and do processing on the data
cluster: combination of lots of node

Hadoop provides 2 functionalities.
0. Distributed fault tolerant data storage (hdfs)
1. Batch processing on stored data (map-reduce)

# HDFS
hdfs is a Unix like distributed file system. It splits large files into small blocks and stores them in different nodes. hdfs stores each block by default 3 times in 3 separate nodes to provide safety of the data in case of a node failure.

## Services
0. many data nodes: these service is run on those nodes that stores data. send heart beat and block information to master node. clients connects to data nodes to read/write data.
1. master name node: stores meta data about which data block is stored in which data node. guides client to write/read data to/from appropriate data node
2. check point node: creates check point for the name node. this is not a hot back up for master name node.

## How hdfs works
Write
0. client connect to name node and asks which data nodes to write data to
1. name node gives data node address to client
2. client connects to data nodes and writes data to data nodes
3. data nodes takes action to replicate data to other data nodes guided by name node

Read
0. client connects to name node and asks which data nodes to read from
1. name node gives data node addresses
2. client connects to data nodes and read data
3. in case of a datanode failure, client reads the data block from another data node guided by name node

# MapReduce
Clients want to process data stored in hdfs. A client submits a "Job" to MapReduce that MapReduce runs across diffrent nodes in the cluster.

## Services
0. Job tracker: master service to monitor jobs. A job is ran as many tasks in several nodes distributedly. retries failed task attempts. Schedules incoming jobs from different clients.
1. Task tracker: runs on the same physical machine as the data node. several tasks executed in a distributed system accomplishes the Job that a client submits. A task has one or more attempts. Sends heartbeat and task status to job tracker. It runs on its own JVM on a datanode.

## How Mapreduce works
0. Clients submits job to job tracker
1. Job tracker assigns tasks to task trackers that are close to the data blocks.
2. Task trackers executes tasks. and writes the result to hdfs with replication
3. if a task tracker fails, job tracker assigns the task to another task tracker

# YARN
Abstract framework for distributed processing. MapReduce is a concrete YARN application. It divides duty of the Job Tracker into Application master, Resource manager. Task tracker acts as a node manager, which have "Containers" on which a map or reduce task can be executed. Number of containers on a node is configurable.

## Map Reduce
Data processing a is done in 2 phases.
0. Map phase: apply map function on input key value pairs to generate intermediate key value pairs. group intermediate key value pairs by intermediate keys. each group will contain one key and one or more values.
Components used during map phase:
a. InputFormat: Reads file line by line.
b. RecordReader: Reads input key, value pair from a line using InputFormat.
c. Mapper: Contains map function to apply on input key, value pairs and produces intermediate key, value pairs.
d. Combiner: Performs a local reduction on intermediate key, value pair.
e. Partitioner: Decides which intermediate key, value should go to which partition.

1. Reduce phase: apply reduce function on grouped intermediate key value pairs.
Components of reduce phase
a. Shuffle: Decides on which partition this reducer should operate on.
b. Sort: Sorts data on a single partition by key.
c. Reducer: given an intermediate key and a set of values, performs reduce operation and produces output key value pair.
d. RecordWriter: used to store one key, value pair.
f. OutputFormat: creates the record writer and writes content of the RecordWriter.

Component used by both phase:
WritableInterface: specifies how to read/write data to/from a file. Integer data is written as IntWritable, read as IntWritable.

Misc Notes
0. A map-reduce job can contain only one mapper job and only one reducer job. So a job such as word counter can be created. To create MR jobs pipeline, framework such as Crunch can be used.
1. You can append data on a hdfs file. There is no way to modify existing content of a file stored in HDFS.
1. Hadoop Streaming: Executing shell, python etc. script as jobs. Example:
hadoop jar hadoop-streaming.jar -input input -output outputdir
-mapper org.apache.hadoop.mapreduce.Mapper -reduce /bin/wc

Reference:
0. hadoop just the basics - slides
1. hadoop just the basics - youtube video

Search and You will Find!

Pages

Monday, January 22, 2018

[Big Data] Hadoop Core

No comments:

Post a Comment