Databases from the ground up Part II

11 min readOct 1, 2020

We last left off at hash indexes. We explored how hash indexes sped up the retrieval of data by keeping the offset of the key within a hash table. Also, we explored how they come under two pitfalls.

Our hash table must fit in memory, if the number of unique keys exponentially grows then reading from these hash tables would become expensive.
Range Queries are not efficient. Since each key is unique from one another then it would hard perhaps finding all keys that are from Sajeed00001 to Sajeed00004. We would have to do an individual lookup on each key.

Let’s look at another indexing structure that does not suffer from these limitations.

SSTables and LSM-Trees

In hash indexes, let’s remember that values with the same key that occur later in our segment file take precedence over values that are earlier. Other than that, the order of the unique key-value pairs doesn’t matter.

Let’s make one important change to our segment file. Our segment file will order the key-value pairs by key. This format is called Sorted String Table, or SStable for short. The compaction process already ensures that old keys are thrown out, so each key is unique(appears once) within the file. SStables have some important advantages over our previous segment files with hash indexes. Let’s look at three main advantages.

Merging segments is more simple and efficient; using an approach similar to mergesort — we look at all files side by side. Look at the first key within each file and copy the “lowest key” according to sort order and then write that to an output file and repeat.

For e.g

Let’s just do a couple of rounds so you can get the feel. We have “handbag:8786” as the first key in the first file, “handcuffs:2729” as the first key in the second file, and then “handful:44662” as the first key in the 3rd file. The key handbag is the lowest in sort order(appears first alphabetically). That key-value pair gets first written into the new segment file. Next, we can compare the second key in the first file with the first key in the o within the other two files. We see that handcuffs come before handful and handful. So “handcuffs:2729” get written into the new segment file. Lastly, we compare handful, handful, and handful. We pick the handful from the last segment file so “handful:44662”, gets written into the new segment file. We then throw out the other handful keys since we already wrote a handful into the new segment file. We don’t have to worry about these keys showing up again since they are all sorted. This process continues until we have the new segment file shown above. This results in a segment file where all keys occur once and are also sorted.

2. The second advantage is that finding a key does not require keeping the whole hash table in memory. For e.g. let’s say we want to find the value of handiwork without knowing it’s offset. We know it must appear between the offsets of handbag and handsome. We can start at the handbag offset and work our way until we reach handiwork. If we end up reaching handsome then we know the key-value pair does not exist. So, while we do need to keep some indexes, we don’t need to keep indexes for all the keys.

3. We can compress groups of records since read requests need to scan over a range of keys. We can compress the records into a block before writing it to disk. Now our indexes point to the compressed block. This saves disk space and reduces I/O bandwidth use.

One point of confusion is how are they sorted in the first place? Our writes can come in any order. There are many data structures that can help us with this, such as an AVL tree or a red-black tree.

We can make our storage engine work as the following.

First store the write to a tree data structure that can keep order. This tree is sometimes called a memtable(in-memory table)
Write our memtable to an Sorted Segment Table(SSTable), when the memtable gets larger than some threshold.

So whenever we receive a read, we first look within our memtable for the key and then our SStable. We occasionally also run compaction and merging processes as shown before.

Also, we can keep a separate log on disk to which every write is immediately appended. In the event of a crash, our memtable is lost. This log will help us rewrite the memtable. Every time we write our memtable to the SStable then the log is discarded.

AVL tree that keeps alphabetic order of months

The concept of constantly merging SSTables are often called LSM storage engines.

Current technologies that use these ideas: Storage engines like Cassandra and Hbase follow something similar to an SSTable. Lucene an indexing engine for full-text search(the one behind ElasticSearch) also uses ideas similar to this.

Let’s look back at those two pitfalls again from Hash Indexes.

Our hash table must fit in memory, if the number of unique keys exponentially grows then reading from these hash tables would become expensive.
Range Queries are not efficient.

We see that firstly, we use less space with our hash table since we don’t need to keep all the keys. Also, range queries are much faster since our SS-tables are sorted by order.

While LSM storage engines are increasing in popularity, there is another indexing structure that is still the most prominent. Enter the world of B-Trees.

B-Trees:

A B-Tree was originally introduced in 1970 and quickly grew to prominence. It helps power almost all relational databases and even some nonrelational databases.

A B-Tree keeps key-value pairs sorted by pair like SSTables. However, a B-Tree breaks our database down into fixed-sized blocks and writes one block(also called pages) at a time. This is different from LSM engines as we don’t know have a fixed-sized amount of files that it is writing to. Each page can be identified using an address that allows it to refer to other pages. It’s also important to note that these pages are stored on disk instead of memory. We’ll look at the consequences of this later on. Let’s look at a figure.

This B-Tree follows regular tree semantics. We have our root “node” on top and our leaf “node” on the bottom. To read a particular value; we start at the root and follow the references down until we reach our leaf, which contains the value we are looking for.

The number of references that point to child pages is called branching factor. We have a branching factor of 6 in the diagram above.

If we want to update a value for a particular key; we can search for the key in our B-Tree and then change it once we reach it’s leaf values. If we want to add a new key then we need to find the page who’s range covers our value and then we can add it to that page. If there isn’t enough space then it is split into two half-full pages and the parent page is updated for the new ranges needed.

As we can see from the example above, adding key 334 splits the page below the root. Also, the parent’s page is updated to account for the change.

As within tree terminology, there is the concept of a balanced tree. This balanced tree ensures that with n keys, the depth of the tree would have a maximum depth of O(log n). This can help store an enormous amount of data, for example, a four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.

There are many optimizations to make B-Trees reliable, such as a Write Ahead Log(WAL), which is an append-only log written on disk. This is used to restore the B-Tree in case of crashes. We will not cover all of the optimizations in this blog post. Instead, we will focus now on comparing B-Trees and LSM-trees.

Comparing B-Trees and LSM-Trees:

Generally, LSM-Trees are typically faster for writes while B-Trees are faster for reads. Reads are considered slower on LSM-Trees because the algorithm has to go through the memtable and then the sorted segment files. However, like many rules in tech, these rules are not hard and fast. One should still test their application and check if there truly is a performance change.

Let’s look at why LSM-Trees are faster for writing. A B-Tree must write every piece of data to a WAL and again to the B-Tree. The effect of one write resulting in multiple writes to disk is known as write amplification. This is a concern to SSDs since can only overwrite blocks a certain amount of times before they start to fail.

This could be a bottleneck when your application is very focused on consistently writing to the database. Write amplification has a direct cost here since the writes per second can be influenced by how many times we can write to disk, and the disk bandwidth it can handle. So in conclusion, LSM-Trees have a lower write amplification since it is not writing to disk as many times as a B-Tree.

One large downside of LSM-trees is the compaction process. This compaction process can sometimes interfere with the writes and reads to the storage engine. Disks have limited resources so it could be that a write or read needs to wait for the disk to finish the compaction process. Another advantage, of B-Trees, is that the data is in one place within the index of the B-Tree, while there could be many occurrences of the key in different segments(SSTable). This might prove attractive to an application that wants to ensure strong transactional semantics. For example, locking a key is easier in a B-Tree since it occurs only in one place. We can apply those locks directly to our B-Tree.

Secondary indexes?

The situation of key-value pairs shown above can be thought of as a primary key within a relational database. For example, a unique id that helps identify a row, document, or a vertex in a graph database. It’s also very common to have a notion of a secondary index, which is important for efficiently performing joins. For example, user_id is used as a secondary index to find all rows belonging to a user. The key difference is that secondary indexes are not unique, since there could be many rows in a database with the same user_id.

id, user_id, attendence, day
1, 1, 0, 1
2, 1 , 1, 2

This can be solved in two ways. Firstly, by making each value in the index a list of matching row identifiers, or by making each key unique by appending a row identifier to them.

Now, let’s also look at how we can look at different ways to store these indexes. Namely, when we get to find a key, we can have two different options. One is to store a location pointing to the value of our index. This is a reference to where to find the value in a file. This may occur some read costs since we don’t actually store the value with the key. Secondly, we can store the indexed row directly with the key. This is known as a clustered index. We can also do a mix between both a clustered index and point some parts of the data to a file, known as an index with induced columns. This can also help speed up reads but slow down writes since we are duplicating the data.

Searching for multiple values simultaneously

Let’s think about how to reason about location. A location can be thought of as a point. This point is located with longitude and latitude. Let’s say I want to find all COVID-19 testing centers that are between a certain range of longitude and latitude. Since there are multiple ranges to consider our B-Tree and LSM Tree is unable to handle this. One option is to somehow concatenate longitude and latitude into one value known as a space-filling curve. Then, the value could be stored in a B-Tree/LSM tree for efficient retrieval. Or we could keep longitude and latitude separate and use another indexing structure called an R-Tree that we might cover in future blogs(let me know!). Other examples of searching multiple values simultaneously could be finding images in a certain RGB spectrum or finding a weather range based on a certain date.

Fuzzy Searches and Indexes

The preceding examples are an all or nothing affair. Either the key exists or the key doesn’t exist. But what the preceding examples don’t account for is finding keys that are similar. For example, longitude and latitude aren’t always exact. I could stand at the same place on two different occasions and receive a slightly different location co-ordinates, even though I haven’t moved. Or for example, I misspelled a word and want to find the correct spelling for that word. We’ll discuss in future blog posts how to solve this problem, but it’s important to be aware of this. Let’s finally cap off our discussion by discussing the move to keep a whole dataset in memory.

Moving away from Disk and into Memory

Disks have two major advantages. Their contents aren’t lost when the power is turned off and they have a lower cost per gigabyte than RAM. However, RAM is becoming cheaper and many databases are small enough to fit in memory. We can distribute the data amongst multiple computers to accommodate more data.

It’s important to note that we can still use disk space, for example as a Write Ahead Log, but all our reads are served by memory. This speeds up performance because we don’t need to encode data structures to disk. This can that greatly improve performance.

Also, in-memory databases can encode data structures that are hard to do on disk. For example, priority queues and sets that are used in databases such as Redis.

Hopefully, this discussion has made you think about memory, disks, data structures, and what considerations you should make when choosing a storage engine. Stay tuned for the last blog in this series where we cover OLAP, OLTP, and Data Cubes!