In the previous article, we dived into MapReduce. MapReduce took the data processing world by storm for providing fast and distributed data processing procedures that relied on a distributed filesystem to write the input and output of jobs. In this article, we will look at some weaknesses within MapReduce and how the ecosystem has moved beyond MapReduce.

Materialization of Intermediate state

Every MapReduce job is independent of every other job. A MapReduce job takes a directory from HDFS as input and writes it you another directory in HDFS. This can be useful if you want the output of a job to be published widely…


We last left off on Unix pipelines and how its philosophy can help us scale up batch processing on a distributed network. We then introduced that MapReduce could be a viable solution in a distributed network. Let’s dive into MapReduce.

Similarly to a single Unix Job, a MapReduce job takes one or more inputs and produces one or more outputs. Another similarity is that MapReduce does not modify the input and does not have any side effects when producing the output.

MapReduce reads and writes files on a distributed filesystem. This distributed filesystem could be a storage service such as…


Batch Processing Systems: A system that takes a large amount of input data and runs a job to process it and produces some output data. Jobs often take a while to complete so it is assumed that the user is not immediately waiting for the job to finish. This job can take up to minutes to days. Usually, batch jobs are scheduled to run periodically. A batch processing system is measured usually by throughput, the time it takes to process a dataset of a certain size.

As we shall see throughout this series of articles, batch processing is essential to…


We have shown how to think about data retrieval systems. We have covered LSM-Trees, B-Trees, how to think of segment files, and how memory and disk are used to store data. We have finally built up the necessary vocabulary now to reason about OLAP, OLTP, Data Warehouses, and Data Cubes.

Transaction processing vs Analytical processing

Let’s first show two scenarios of accessing a database and how they differ from each other. First, consider, data is stored in a relational table, and we can do SQL like queries. A patient comes in for a COVID-19 test. The patient tells the hospital their necessary identification such as…


We last left off at hash indexes. We explored how hash indexes sped up the retrieval of data by keeping the offset of the key within a hash table. Also, we explored how they come under two pitfalls.

  1. Our hash table must fit in memory, if the number of unique keys exponentially grows then reading from these hash tables would become expensive.
  2. Range Queries are not efficient. Since each key is unique from one another then it would hard perhaps finding all keys that are from Sajeed00001 to Sajeed00004. We would have to do an individual lookup on each key.


In the growing world of data lingo, you might have heard Online Analytical Processing(OLAP), Online Transaction Processing(OLTP), and Data Cubes. Now, what exactly do these terms mean? Before we dive into this, these topics require a step back. Firstly, what is the fundamental goal of a database? Namely, a database should have a way of efficiently storing and retrieving data. In these series of blog posts, we’ll look at different ways to store and retrieve data.

Now, why is this important for you as the developer/architect? What's the point of knowing the inner working of a database? There’s probably very…


The way we structure data inherently affects how we think and reason about the problem. For example, in a declarative language like SQL, we don’t think about the nitty-gritty of how a Group By works. However, if our data was stored in something like a Python List, then we would have to reason and think about the problem in a much more involved way. In practice, data in an application is stored in layers. One layer could consist of one data model, and a different layer could store our data in another data model.

There are many ways to structure…


Data has become the forefront of powering many applications. From complex machine learning algorithms to social media apps, to government websites, the way data is thought of and processed is essential. For example, these applications need to do the following:

  1. Store data, so they can find it again(databases)
  2. Remember the results of commonly used expensive operations(caches)
  3. Sends messages to another process(stream processing)
  4. Occasionally crunch large amounts of accumulated data (batch processing)

There are so many ways to achieve the following using a seemingly limitless amount of product offerings and options. But how do we choose the best option? …

Sajeed Syed Bakht

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store