Databases from the ground up Part I

In the growing world of data lingo, you might have heard Online Analytical Processing(OLAP), Online Transaction Processing(OLTP), and Data Cubes. Now, what exactly do these terms mean? Before we dive into this, these topics require a step back. Firstly, what is the fundamental goal of a database? Namely, a database should have a way of efficiently storing and retrieving data. In these series of blog posts, we’ll look at different ways to store and retrieve data.

Now, why is this important for you as the developer/architect? What's the point of knowing the inner working of a database? There’s probably very little reason for a developer to implement a storage and retrieval system from scratch. However, in order to optimize your database based on your needs, one should know how they work on a lower level and understand what the trade-offs are.

What are the common types of queries needed to run your application? They could be, get all the blog views by a given user. Or it could be; “I want to find most visited blogs on your website where the average user has stayed on for more than 30 seconds.” Different storage and retrieval systems are particularly built to optimize these different types of queries.

Let’s imagine our database has two columns. One is the ID uniquely identifying the record. The second is contents related to that ID. How would we implement such a database in practice? From before, a database has two goals. Inserting data into the database, and then retrieving that from the database. Let’s write some pseudo-code to implement the simplest database. Let’s assume our database containing all our data is in one file. For simplicity, let’s assume it’s a CSV file.

Def database_set(id,contents):

#concatenate row to the end of the file

open database file

Add (id, contents) to end of the database file

This function accepts an id and it’s associated contents to our database. It simply adds a line to the end of the file. This is similar to adding an element to the end of an array. The time complexity for adding an element to the end of an array is O(1). For e.g. our database file could look like this.

Sajeed, 24

Foie Gras, 27

Sajeed, 27

Note that Sajeed appears twice in this database file. Let’s look at how we retrieve our files.

Def database_get(id):

Loop through rows of the database and return the last contents where the last id is set

For e.g. if we want to get the contents associated with the id “Sajeed” — then we would look at the last occurrence of Sajeed in our database file and return that. So if we do a call such as database_get(Sajeed). It would return 27.

Now, what is the time complexity here? Writing to the database is very fast. We just add our id with its associated contents to the end of the file. O(1). However, the database_get is slow since we have to loop through every file of the database and return the contents of the last occurrence of id. The cost of lookup is O(n).

How do databases lower this lookup cost? Usually, it associated with an index, indicating where in the file this id occurs. This index is an additional structure placed on the original data. The consequence of this is that a tradeoff occurs as our writes to the database become slower since we have to update the index every time data is written. Let’s look at one way to speed up lookups in detail.

Hash Indexes

To continue from our example below, we could use an inbuilt memory structure such as a Hash Index. A hash index would be a data structure that tells the file where the offset is for that given key. A hash index is similar to a “dictionary” data type. Whenever a record is added, the hash map will update the byte-offset for the given key.

In summary, we don’t need to loop through every line in the file. We go directly to the location of the id by using the byte-offset. This is similar to accessing an index in an array. For e.g. if we want to access the 5th index of an array, we can go directly go there(array[5]) with a complexity of O(1).

Another consideration is that since we are continuously appending to a file — there would be a point where our file would run out of space. We could have it so that once our file reaches a certain size, we start writing to a new file. And then, we could use a process of compaction. Once our files reach a certain size, we throw out all duplicate keys. Since we are only appending to a file, the last item is simply the most recent update for that given key. All duplicate keys that come before the last occurrence are thrown out. Let’s analyze the following figure.

This picture illustrates performing compaction and segments together. Each file segment has 12 records. Data file segment 2 is written into after Data file segment 1 has reached the specified size. We have 4 different keys, namely yawn, scratch, mew, and purr. Let’s take a step back. We know that if a key occurs in both data file segments then we would look at the most recent data file segment for the key. That’s the case with keys such as purr, meow, and scratch. To find the key yawn, we would look at the data file segment 1. It’s also important to remember, each data file segment has its own hash index so searching through the segment files is efficient.

There are some important considerations when implementing a hash index, such as “How do we delete keys?”, that we won’t explore in this blog. (Hint: tombstone!)

Let’s look at some of the limitations of our hash index.

  1. Our hash table must fit in memory, if the number of unique keys exponentially grows then reading from these hash tables would become expensive.
  2. Range Queries are not efficient. Since each key is unique from one another then it would hard perhaps finding all keys that are from Sajeed00001 to Sajeed00004. We would have to do an individual lookup on each key.

In our next blog post, we will look into indexing structures that don’t suffer from these limitations.