MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.
Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++.
The whole process of Map reduce goes through four phases of execution:
- Splitting
- Mapping
- Shuffling
- Reducing
For example, take following data as input for the Map Reduce.
Easy exam notes
Notes are good
Notes are bad
The final output of the MapReduce task is:
Words | Counts |
---|---|
Easy | 1 |
Exam | 1 |
Notes | 3 |
Are | 2 |
Good | 1 |
Bad | 1 |
The process of Map Reduce is:
- Splitting: Input divided into fixed-size pieces called input splits.
- Mapping: In this phase data in each split is passed to a mapping function to produce output values.
- Shuffling: The same words are clubed together along with their respective frequency.
- Reducing: This phase summarizes the complete dataset.