A dive into Data Models; how should you structure your data?

The way we structure data inherently affects how we think and reason about the problem. For example, in a declarative language like SQL, we don’t think about the nitty-gritty of how a Group By works. However, if our data was stored in something like a Python List, then we would have to reason and think about the problem in a much more involved way. In practice, data in an application is stored in layers. One layer could consist of one data model, and a different layer could store our data in another data model.

There are many ways to structure your data, and each approach relies on some assumptions on the data itself. Two very popular ways to structure databases are through a relational model, and a document-based model.

Relational Model vs Document Based Model

A relational model is where data is modeled in relation. Think of tables in SQL and your favorite Relational Database Management System(RDBMS). The goal was SQL was to hide implementation details behind a clean interface. The relational model grew very successful, however, this model has been challenged. Enter, NoSQL. NoSQL is a group of data models that are driven by the following forces(not limited to).

  1. A need for higher scalability than relational data models, particularly for high write throughput
  2. Specialized query functions that are not well supported by relational models.
  3. Escaping the restrictiveness of relational models

Let’s dive into one criticism of RDBMS, object-relational mismatches.

Object-relational Mismatch:

Do you remember when you first started programming? It would usually start in an object-oriented language like Python or Java. You would play around with lists, hashmaps, binary trees, and a host of different data structures. But imagine then translating this data into a relational table. There’s seems to be an awkward mismatch.

For example, let’s consider a gamer. This game has a first name, last name, and list of video games played by that gamer. How would we structure this data in SQL format? Note there is a one to many relationships within our data. A user can play multiple games. There are various ways to encode this. One way to splot our data into multiple tables. A users table, and games played table.

Our user's table would have the following column(we’ll insert some data below)

User_Id, First_Name,Last_Name,

5, Sajeed, Bakht

And our Games played table would have the following column

games_played_id,game_name,hours_played,user_id

1, NBA2k20, 150, 5

2, Super Smash Brothers 4, 200,5

(There is a foreign key of user_id existing between the two tables)

Let’s compare this approach with a Document Based Model. We’ll structure our data in a JSON format.

For e.g.

{“user_id”: 5, “first_name”: “Sajeed”, “last_name”: “Bakht”, “country_id

"games_played": [

{"game_name": "NBA2k20", “hours_played”: 150},

{"game_name": "Super Smash Brothers 4", "hours_played": 200 }]}

Many developers feel that this way of storing data is a lot less awkward than translating the data into an SQL format.

The JSON format also has a better locality than SQL formats. Let’s say our data is in an SQL format. If you want to fetch a list of games played by the user, then you would have to do an expensive join between tables. However, in the JSON format, all relevant information is in one place. All it takes is one query.

However, there are some problems with this approach. What if a user plays many games? It could be possible that it grows larger than the maximum document size(In MongoDB it’s 16MB). This example may seem small enough to work but it could be much harder. Take for instance blogs where there could be thousands of comments. However, document-based databased like MongoDB have ways of solving the problem. Namely, there are other ways to structure our document model, namely linking and bucketing, that each involves their own tradeoffs. We will touch on this in a future blog.

Which one should you choose?

There are some crucial points and objectives to consider when choosing between a document or a relational style. One objective to consider is which data model leads to a simpler application code?

  1. If your data resembles has many one to many relations and few many to many relationships, choose the document model.
  2. If your application has a large amount of many to many relationships then choose a relational model.

Note these rules are not black and white, we’ll look deeper into why many to many relationships seem to work better for relational models.

Let’s go back to how we structure data within NoSQL.

{“user_id”: 5, “first_name”: “Sajeed”, “last_name”: “Bakht”, “country_id

“games_played”: [

{“game_name”: “NBA2K20”, “hours_played”: 150},

{“game_name”: “Super Smash Brothers 4”, “hours_played”: 200 }]}

In this example, let’s try to figure out how many users play “NBA 2K20?”. If our data as set out in a relational manner, then this would be a count(*), grouped by game_name. However, in this example, since our data is in document style, we would have to push this query to application code so we would enforce our own custom logic into it. Even if this works initially, changes can happen to the database that affects how we reason about our document model. For e.g, game_name is currently a string representing the game. However, what if we wanted our games to be references to another entity. This could help contain metadata about the game, perhaps the genre of the game. Now, if we want to query per se, “How many action games does user id 8 play”? This query would require a join between the two entities. Trying to join tables together in a Document Model is usually done again, through the application code. This join could be emulated by making multiple queries to the database. This brings a fundamental change to how we reason about the data. In a relational approach, this could all be easily done through normalization of tables and the queries are done in the database. However, in a document style database, this approach would be pushed on to the application code instead.

To reiterate, document data models' main benefits are its schema flexibility, a better performance due to locality, and a lack of object-relational mismatch. The relational model in turn offers better support for joins.

We’ve yet to touch on what it means for a schema to be “flexible”. In a document data model, there is no schema enforced. Arbitrary values and keys can be added and appended to a document. For example, let’s say we want don’t want to store first name and last name but instead full name which is a concatenation between first name and last name. In a document data model, this proves to be easy as new documents can be introduced with a full name value instead of a first and last name. And whenever we are dealing with an old record we can easily have our code deal with it. For e.g.

If(user && !user.full_name){

user.full_name = user.first_name + empty_space + user.last_name

}

However, in the relational approach, this proves to be more involved. We would need an ALTER TABLE command that creates and sets a user.full_name column. Altering tables requires looping through every row of our table which could prove to be a performance inhibitor. Imagine, how this would be affected once our table has millions of rows.

Before we leave off it is important to note that there are more consequences and differences between a document and a relational model. This includes fault tolerance, concurrency, and the implications on distributed systems(what if two tables required for a join are in two different servers?).

Thank you for making it this far. Much more content coming out soon. Feel free to add me on LinkedIn! https://www.linkedin.com/in/sajeed-bakht-72696055/