Databases from the ground up Part III

Sajeed Syed Bakht
8 min readOct 5, 2020

We have shown how to think about data retrieval systems. We have covered LSM-Trees, B-Trees, how to think of segment files, and how memory and disk are used to store data. We have finally built up the necessary vocabulary now to reason about OLAP, OLTP, Data Warehouses, and Data Cubes.

Transaction processing vs Analytical processing

Let’s first show two scenarios of accessing a database and how they differ from each other. First, consider, data is stored in a relational table, and we can do SQL like queries. A patient comes in for a COVID-19 test. The patient tells the hospital their necessary identification such as their health card number. The hospital checks for the person in the system and then updates a column notifying that this patient has been tested. The access pattern in this case only needs a small number of keys, i.e the record for that patient. The hospital waits for the test to be completed, and then again updates the record for that patient that indeed this patient has fortunately tested negative for COVID-19. This access pattern again requires a small number of records and is interactively updated by the hospital staff. This access pattern is known as online transaction processing(OLTP).

Consider the second scenario. Let’s say the administration wants to know, “What is the percentage of tests that were tested positive?” This access pattern will require a fetch over many records within the database. These sorts of queries also only require looking for over a few columns, namely, has the patient been tested, and if they have been tested, have they tested positive?

These sorts of queries in scenario two are known to help the hospital make better decisions, in the private sector it is known as business intelligence. This access pattern is known as Online Analytic Processing (OLAP). Now since OLAP and OLTP have very different access patterns, it is wise to store them separately. Usually, a business will have many OLTP databases, but one data warehouse used for analytic purposes.

Data Warehousing

A business may have many OLTP systems running. These systems are expected to be highly available and process events and transactions with low latency. In our preceding example, this may be different systems for patients, doctors, staff, tracking inventory, and managing suppliers. Each of these systems are complex, so different teams are used to manage each system. In turn, they can act autonomously from each other. Also, these systems are critical to running our hospitals so it makes sense that we would want them to be highly available and with low latency.

The database administrators would not want someone running large analytic queries through these OLTP systems since these queries are usually expensive and require a large pass through the database. These expensive analytical queries can slow down the OLTP’s main function of processing tests(generally transactions). This brings the need to have a data warehouse.

A data warehouse is a separate database where the analytic teams can make as many queries as they want without affecting the mission-critical OLTP systems. The data warehouse contains a read-copy of the database. Data is extracted from our OLTP systems (continuous or in bulk), transformed, and then loaded into our data warehouse. The process of getting data into a data warehouse is considered Extract-Transform-Load(ETL).

So now, we don’t have to worry about overburdening the OLTP systems with expensive analytical queries. The benefits of this choice don’t end there. Now, we can also optimize our data warehouse to appropriately handle analytical access patterns.

Snowflake and star schema

Dataware houses usually follow a star schema. Now, what exactly is a star schema? A star schema is a way to organize your data. In the middle, there is a fact table, that records events. The fact table then makes references to the dimension tables, that help describe the facts. For example, we could have a fact table for every test done. Every test can be considered a “fact”. The fact table could include test_id, patient_id, and hospital_id. The patient_id could be a reference to a patient_dimension table where columns describing the patient are taken such as patient_id, first name, last name, age, and more. The hospital_id could be a reference to a hospital_dimension table where the hospital is defined such as hospital_id, hospital_name, and address. Now, a Snowflake schema is a variation of a star schema where dimension tables can also be in reference to another table. For example, the hospital_dimension table could have a column such as hospital_type_id, which references a hospital_type_dimension table which would include columns such as plastic_surgery, private/public, and funding type. To fully understand this, take a look at the star schema below and map out our hospital example as they did below with sales.

Column Oriented Storage

We looked at how data was stored together in the second part of this series. Remember that, data was stored in keys, and the value was all the row data associated with it. For example, consider the diagram below.

Looking above, we have the key 42 and beside it is the name and attractions column. However, let’s think of how analytical queries would work. Usually, we only need to look at a couple of columns, not all the row data associated with it. We could have indexes again for certain columns but we would still have to load all this data to memory and then filter to process the query. This example of row-oriented proves to be expensive.

Now imagine, if we wanted to see the number of positive tests in Ontario. The only columns we truly need are two columns, namely province and a column indicating whether the test was positive/negative. Since fact tables are so wide(100’s of columns), then we would need to load this all into memory. Wouldn’t it be nice, if we didn’t have to load the columns that we didn’t need? Enter the column-oriented storage paradigm. This paradigm doesn’t store all the row data associated with it together, but instead, it stores all values in a column together. So, if we wanted to execute the query above, then we can merely just load two files that contain province and positive/negative. Consider, the figure below in the context of sales data.

This storage option also relies on the columns being stored in order of the rows. For example, if I want to reconstruct user 23, then I would look in the 23rd row of each column.

Column Sort Order

In a column store, it usually does not matter in which order the rows are sorted. It’s easiest to store them in the sequence of when the data was inserted. However, we can choose to impose some order similar to SS-Tables in part II.

Now it doesn’t make sense to sort the columns by values since we would lose our ability to extract row specific data such as what are the columns associated with user 23. We know that the kth item in column 1 is corresponding to the kth item in column 2. So we would need to sort an entire row at a time. For example, if we swap the kth and k -1 item in column 1, then we would also have to swap the kth and k-1 item in column 2. So how exactly, would we sort this? Usually, one would have an understanding of common queries and sort based on that. For example, we could sort by the date_key column if are queries are usually related to a certain range of dates. And then we can impose another order if two rows have the same date. This also lends itself easily to column compression and a way of shrinking the data in which will be covered in future blogs.

Another interesting example is that since the nature of our data is usually partitioned and replicated, it could be useful having data sorted in several different ways. Since our data is redundant, we could point it to the file that is best optimized for a specific query.

Data Cubes and Materialized Views

Another aspect of data warehouses are materialized aggregates. Some common queries could be aggregating data together such as SUM, MAX, MIN, and AVG. If the same aggregates are used in many different queries then it could prove useful to cache them.

One way of creating such a cache is a materialized view. This stores an actual copy of the results of the query. One part to note is that whenever the underlying data changes then the materialized view must also be updated. The database can do this automatically but this is an expensive operation that could slow down the database. This is why OLTP systems would stay away from having a materialized view. However, in a read-heavy database, this cost makes sense. A special case of a materialized view is a data cube. Consider, the diagram below.

We can see that a two-dimensional data cube contains aggregates along each axis. Each cell is a summation of the net sales from a given date and product. These summations come from our fact table. At the end, of each row and column, we have total aggregations. Such as a product_sk has a total sales of 14967.09. To generalize, each cell contains an aggregate of an attribute, with some combination of columns. We can then apply this same aggregate along each row or column to get a summary that has been reduced by one dimension. In principle, we can keep adding dimensions such as stores. Each cell would contain sales of some combination of product, date, and store. For example, the query below.

SELECT SUM(net_price)

FROM fact_sales

WHERE data_key =?

AND product_sk =?

AND store = ?

The advantage of a data cube is that now these queries are very fast since they have been precomputed. For example, if we want to know the total amount of sales per store, then we might just need to look at the totals along the appropriate dimensions instead of scanning through what could be millions of rows in your fact table.

The biggest disadvantage is that we don’t have the same flexibility as querying for the raw data. For example, there’s no way of calculating the total amount of sales from items that were ethically made, because it is not considered when populating the cells within the data cube. Most data warehouses overcome this by storing as much raw as possible as well as having some data cubes or materialized views to optimize for certain queries.

Thank you for making it to the end of the post. I hope you’ve come with a better understanding of OLTP, OLAP, and data warehouses!

--

--