Batch Processing from the Ground Up: Part III

4 min readOct 26, 2020

In the previous article, we dived into MapReduce. MapReduce took the data processing world by storm for providing fast and distributed data processing procedures that relied on a distributed filesystem to write the input and output of jobs. In this article, we will look at some weaknesses within MapReduce and how the ecosystem has moved beyond MapReduce.

Materialization of Intermediate state

Every MapReduce job is independent of every other job. A MapReduce job takes a directory from HDFS as input and writes it you another directory in HDFS. This can be useful if you want the output of a job to be published widely within an organization. For example, many teams rely on fraud analytics that was processed as input to their own batch processing jobs.

However, sometimes the output of one job is only used as the input to another job, which is maintained by the same team. In this case, the files in the directory are considered an intermediate state; a means of passing data from one job to the next. This process of writing the intermediate state to files is called materialization.

We looked previously at how Unix pipes differ from MapReduce. The pipes don’t fully materialize intermediate state but instead streams the output as input incrementally into a small in-memory buffer. Let’s look at some downsides of MapReduce’s materialization compared to Unix pipes.

A MapReduce task can only start when all tasks of the preceding jobs have completed. However, a Unix pipe works as data streams into the pipe. Having to wait for these tasks to complete can slow down the execution of workflow as a whole.
Mappers are often redundant; they read back the file written by the reducer and prepare it for partitioning and sorting. If the reducer output was partitioned and sorted in the same way as the mapper then you could chain together reducers without needing a mapper.
Storing the intermediate state in a distributed filesystem means data is replicated across many nodes in the network which might be overkill.

Dataflow engines

Execution engines such as Spark and Flink were made to overcome the problems presented in MapReduce. The main thing that Spark and Flink have in common is that they handle the entire workflow as a job rather than breaking it into sub-components.

Execution engines explicitly model that the flow of data through different processing stage are known as data flow engines. Let’s look at how MapReduce is similar to dataflow engines and how they differ.

Commonalities between MapReduce and dataflow engines:

Work by calling a user-defined function to process one record at a time on a single thread
Parallelize work by partitioning inputs
Copy the output of one function over the network to become the input to another function

Differences between MapReduce and dataflow engines:

DataFlow does not have a strict notion of alternating between mappers and reducers. Instead, these functions are called operators.

Now, how do we connect one operator to another operator? Namely, how do we take the output of one operator as input to another operator?

Repartition and sort records by key
Take several outputs and partition in the same way but skip sorting. Sorting is not always needed for example in partitioned hash joins.
For broadcast hash joins the same output of one operator can be sent to all partitions of the join operator.

Now, what are the advantages of this style of processing compared to MapReduce?

Expensive work such as sorting need only be performed when it’s required. MapReduce does this between every map and reduce stage.
No unnecessary map tasks, since the work of a mapper can often be incorporated into the preceding reducer task.
Since the workflow is explicitly defined then the scheduler can optimize for which data should go where. For example, it can place a task that consumes some data in the same machine that produces it so data does not have to go over the network.
Intermediate states can be written to be kept in memory or local disk instead of being written to HDFS.
Operators can start executing as soon as input streams in, instead of waiting for the previous task to complete.
Java Virtual Machines processes can be used to run new operators, reducing overhead compared to MapReduce.

Let’s look at how the approach of materializing intermediate state has an effect on fault tolerance.

Fault Tolerance

MapReduce writes out intermediate state to a distributed filesystem which makes it durable. However, with dataflow engines, these intermediate states are in memory or on a local disk data. Data is lost once a machine goes down. Dataflow engines tolerate this by recomputing the intermediate values which can be done by tracking how the data was computed. Spark uses resilient distributed data(RDD) to track the ancestry of data.

However, recomputing data is not always the answer, since data can be computed non deterministically. For example, looping through a dictionary where an order is not defined or using random variables from statistics. The answer could be to explicitly write intermediate data out to HDFS. The advantage is that not all intermediate data is written to HDFS but data the developer deems necessary.

This concludes our series on batch processing. Thank you, for taking the time to read it. Stay tuned for the next blog post!

Batch Processing from the Ground Up: Part III

Materialization of Intermediate state

Dataflow engines

Fault Tolerance

Written by Sajeed Syed Bakht