Batch Processing from the Ground Up: Part I

Sajeed Syed Bakht
6 min readOct 9, 2020

Batch Processing Systems: A system that takes a large amount of input data and runs a job to process it and produces some output data. Jobs often take a while to complete so it is assumed that the user is not immediately waiting for the job to finish. This job can take up to minutes to days. Usually, batch jobs are scheduled to run periodically. A batch processing system is measured usually by throughput, the time it takes to process a dataset of a certain size.

As we shall see throughout this series of articles, batch processing is essential to having a maintainable, scalable, and reliable data system. We’ll look at the history of batch processing, where it is now, and where it might go in the future. We’ll cover famous algorithms such as MapReduce, how to optimize for MapReduce jobs, and finally the flaws of MapReduce and how the data ecosystem has built tools such as Apache Spark to overcome those flaws.

Before we jump into MapReduce, we’ll cover Unix. This is because the lessons learned from the Unix carry over to a large scale distributed data systems.

Batch Processing with Unix Tools

Starting from a very simple example, let’s say we have a web server that appends a line to a log file after every web request.

For example, one line may look like this(broken into different lines for visual clarity)

216.58.210.78 — — [27/Feb/2015:17:55:11 +0000] “GET /css/typography.css HTTP/1.1” 200 3377 “http://martin.kleppmann.com/" “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36”

This request has the following log definition format

$remote_addr — $remote_user [$time_local] “$request” $status $body_bytes_sent “$http_referer” “$http_user_agent”

Remote_addr is mapped to 216.58.210.78, remote_user is a dash to show that the user is not authenticated. [time_local] is set to [27/Feb/2015:17:55:11 +0000] and so on. In plain words, we can say on February 27th, 2015, and 17:55 UTC, the server received a request for the file /css/typography.css. The request-response was a 200 indicating that it is successful and the response had a byte size of 3377. The http reference was from http://martin.kleppmann.com/.

Let’s say we want to find the top 5 most requested web pages on your site. How would we do this in Unix?

  1. Read the log file
  2. Split lines by white space and only take out the 7th field, i.e. http_referer, which is the requested web page
  3. alphabetically sort the web pages
  4. the uniq command filters out repeated lines by checking whether two adjacent lines are the same, the -c creates a counter for each web page
  5. Sorts requested web pages by count(-n). The -r indicates that this is in reverse order(the highest count is sorted first)
  6. We take the first 5 lines of the file.

The output could look like the following.

This command might look confusing but is extremely powerful. It can process gigabytes of log files in a matter of seconds. Let’s compare Unix to programming language which uses a hash table and illustrate why Unix is more appropriate under certain circumstances. For example, let’s see how this looks in Ruby.

Returns top 5 requested web pages in Ruby

This code uses a hash table to count the occurrences of each web page. Finally, the hash table is sorted by counter value and takes the top 5 entries, and then print them out.

This code may be more readable but there exists a big difference in these approaches.

Sorting vs in-memory aggregation

As shown before the Ruby script uses an in-memory hash table where each URL is mapped to the number of times seen in the log file. However, the UNIX scripts sort them in which multiple occurrences of the same URL is repeated. For example.

nba.com

nba.com

wnba.com

wnba.com

wnba.com

Now, which approach is better? This answer depends on how many different URLs there are. In small-medium sized websites, you can fit all unique URLs to perhaps 1GB of memory. If this working set of unique URLs is small enough then the hash table approach works fine. This is because the working set is small. The working set is the amount of memory to which a job needs random access.

However, if the working set is above the available memory than a sorting approach has the advantage since it makes efficient use of disk space. It’s the same principle used in SS-Tables from https://medium.com/@sajeedsyedbakht/databases-from-the-ground-up-part-ii-3e93906c7497. Data can be sorted in memory and then written out to the disk as segment files. Then multiple segments can be merged using a technique similar to MergeSort.

The UNIX philosophy

Doug McIlroy, the inventor of Unix Pipes, described his approach, “We should have some ways of connecting programs like a garden hose- screw in another segment when it becomes necessary to massage the data another way. This is the way of I/O also.” The analogy of pipes stuck and is now known as the Unix Philosophy. The philosophy is described as the following 4 principles that seem eerily similar to today's DevOps, Agile, CI/CD philosophies.

The Unix philosophy

This approach of building each component to do their job well with the goal of connecting them as input to another program turns these small components into powerful data processing jobs. Let’s look at the second point in greater detail. We expect the output of every program to become the input to another. This philosophy is the driver behind Unix’s uniform interface.

Uniform interface

If we want to make sure a program can read input from any other program and that a program can be used as an output to any other program then we would have to assume they share the same input/output interface. This could be a socket on a TCP connection, a file on a filesystem, a device driver, and so on. Since they share the same interface they can easily be used as input and output. Again, Unix treats this all as an order sequence of bytes(files). Not many pieces of software can act in a way such as Unix. For example, it’s more involved to pipe your email contents and your shopping history, run them through an algorithm, and output them to a social media website.

Separation of logic and wiring

Another interesting point of Unix is that there’s a separation between your processing logic and where the data comes and goes. The program doesn’t care about where the input or output is going/coming from. It just assumes the data will come in and processes it. For example, the sort command doesn’t bother itself without worrying about the details of the input and output. It can read data from standard input and read it out the standard output, or even data from the operating system. There are some limits to what you can do with standard output and standard input. For example, it becoming more tricky when there are multiple outputs and inputs.

Easy to experiment

Data is treated as immutable, so we can always retry a job if it fails. For example, Unix does not mutate sensitive data such as user information, so that if a task fails then we can simply retry it on the same input without worrying about the input data changing. Another benefit is that programs are written in stages. If one part of the pipeline fails then it isn’t necessary to restart the whole pipeline. Merely, we can restart it from where it failed.

This UNIX philosophy will help us generalize to handle distributed systems. The biggest limitation of UNIX is that it can only run one computer. In our next article, we’ll introduce Map Reduce and Distributed Filesystems, which had previously taken the batch processing world by storm for it’s ability to process data in a distributed manner.

--

--