MapReduce
Published:
MapReduce is a programming model for large data operating on multiple computers. Key functions listed below:
- partitioning the input data
- scheduling the program’s execution across a set of machines
- handling machine failures
- managing the required inter-machine communication
model
- map: input key/value pairs $\to$ intermediate key/value pairs
- reduce: values with same key->combine
implementation
- reduce worker sort the intermediate pairs
- master keep the state of the tasks (idle, in-progress, completed)
Fault Tolerance
worker failure
- master ping every worker periodically
- completed map task on a failure worker will be re-executed (result on local disk), complete reduce task will not (result on global disk)
master failure
change master and continue after checkpoints
semantics in the presence of failures
- a mask task produce $R$ result files, a reduce task produce one result
- weak semantics?
locality
task granularity
- map phase $M$ and reduce phase $R$ should be much greater than worker number
- $M$ to make individual task roughly 16 MB to 64 MB, $R$ a small multiple of worker
Backup tasks
refinement
combiner
- partial merging
- executed after map by the same worker