Reliability, Scalability, and Maintainability is all a Data System Needs

Sajeed Syed Bakht
8 min readSep 23, 2020

Data has become the forefront of powering many applications. From complex machine learning algorithms to social media apps, to government websites, the way data is thought of and processed is essential. For example, these applications need to do the following:

  1. Store data, so they can find it again(databases)
  2. Remember the results of commonly used expensive operations(caches)
  3. Sends messages to another process(stream processing)
  4. Occasionally crunch large amounts of accumulated data (batch processing)

There are so many ways to achieve the following using a seemingly limitless amount of product offerings and options. But how do we choose the best option? This question will be answered throughout the following blogs, but let’s understand what exactly we are talking about when we refer to a Data System.

Because our application has such wide and specific requirements, one tool can’t meet our application requirements. A data system usually consists of many different tools working together. This could consist of a message queue that reads data from an API and stores it into a database and cached before being sent through another message queue and into a data warehouse for analytic purposes. Now when we combine all these tools together we want to ensure some guarantees.

  1. Reliability. Our system should continue working “correctly” in the face of adversity.
  2. Scalability. Our system should be able to handle a large growth of data.
  3. Maintainability. Our data system can be easily maintained and adapted.

Let’s now circle back and unpack these seemingly vague terms.

Reliability:

Reliable systems could be understood as a system that can deal with faults. Our system should be working at a certain performance level and work correctly. For example, if I want to “tweet” on Twitter, I expect the tweet that I wrote to be published on my page in a timely manner. It shouldn’t be so that someone else’s tweet ends up on my timeline. The faults can be varied in nature. Our system should at least have a gameplan when dealing with the following faults and errors. Hardware Faults, Software Errors, and Human Errors.

Hardware Faults:

Hardware errors happen all the time. From RAM crashing to power outages, to hard disks failing. Our first response is usually to add back up servers and clusters in case one fails. Let’s say you spin up ten VMs(Virtual Machines)on Amazon Web Services. What if we need to update them, but that update requires downtime? One possible answer to this is to have a rolling upgrade so that only VM is down at a time. In general, we should have a clear and concise game plan for dealing with these errors.

Software Errors :

Software Errors can be even more dangerous than Hardware Faults. Perhaps, a developer pushes buggy code that causes one component to fail, which in turn creates a domino effect making the whole system crash. An example of this was the possible Y2K event where systems did not consider the year 2000 in their input. These software faults can be undetected for a long time. Many reasons for Software Errors come down to assumptions of our current application and environment. These assumptions may usually hold true but every so often it could be violated. The developers should constantly be checking and understanding what explicit or implicit assumptions they are making about the data system.

Human Errors:

Humans are unreliable. It is very possible for them to make faults when configuring or using the application. These faults can lead to application failure or deletion of files not intended to be deleted. These steps can ensure that it doesn’t happen.

  1. Minimize opportunities for these mistakes. Encourage good use of the application and discourage bad use.
  2. Have “rollback” ability to undue bad use.
  3. Set up monitoring and performance tools to capture errors as they occur
  4. Test thoroughly at all levels to catch errors.Unit Tests, Integration Tests, Manual Tests.

That’s it for reliability, we’ll retouch on this topic more specifically in the following blogs if you’re still thirsting for information. Let’s touch on Scalability now. Let’s unpack this buzzy word.

Scalability :

Let’s hold that our data system is now reliable. It can deal when a VM goes down, makes the correct assumptions on the environment, and protects humans from making critical errors. It’s all rainbow and sunshine right? We’ll what if your application starts to face more demand? Let’s say we work for the Canadian government. Our job is to build an application to process Employment Insurance claims. A calamity like the COVID-19 disease occurs. Now from processing 2,000 claims a week, we shoot up to 100,000 claims a week. Can our system handle it? This illustrates what scalability is in a nutshell. How does our system cope with an increased load? Let’s circle back and unpack what the load is. Our system can grow in different ways. We can jump from 2,000 claims to 100,000 claims overnight, or in a matter of 2 months. Perhaps, certain types of claims only increase drastically overnight while other claims slowly increase. Or perhaps, certain read requests on the site outpace claim(write) requests.

What is Load?

Load can be described in terms of load parameters. Load parameters help quantify the amount of activity on a data system. This could be the number of requests per minute on our application. A ratio of read/write operations to the database, the number of users currently of site, or perhaps the hit rate on our cache system. It is also important to understand what percentile do you care about. For e.g. How is the average claim being processed versus the 99th percentile user.

For example, let’s look at an insurance agent that is responsible for rejecting and accepting claims. This insurance agent is responsible for claims that are from parents that live in the city Barrie, Ontario. They have a home timeline, to view a list of the claims and the ability to accept and reject claims. As the database engineer, we have two possibilities for a timeline for our insurance agent.

  1. Possibility One: A simple SQL Statement (Let’s assume two tables, process request, and Users Table)

Our ProcessRequest Table has the following columns

Process_Id,Time_Of_Process,User_Id,Processed:

Note that User_ID is a foreign key in this table, and our Processed column is a boolean of True or False, indicating whether the claim has been processed or not. Next, lets move to our Users’ Table.

Our Users Table has the following columns. Note that Parent is a boolean that indicates whether they are a parent or not.

User_ID, City, Name, Parent.

Our SQL statement will thus make a join on User ID

SELECT Process_ID,User_ID

FROM ProcessRequest JOIN Users ON ProcessRequest.UserID = Users.UserID

WHERE processed=False AND City = “Barrie” AND Parent=True

2. Possibility Two: We maintain a cache of the timeline, similar to a mailbox. When someone from Barrie that is also a parent makes a claim, it gets sent to our cache, and when our insurance agent processes the claim then it gets removed from the cache.

There is a tradeoff between both options. The insurance agent’s timeline loads much faster in the second scenario since it doesn’t have to do an expensive join through the whole database but there is a lot more work done whenever a user makes a claim. Instead of just hitting the database, it has to be added on to a cache. Let’s say instead of just one insurance agent, we have 3 insurance agents. One agent is responsible for all claims in Barrie, another agent is responsible for all claims from parents, and the other is responsible for parents that live in Barrie. When a parent from Barrie makes a claim then it has to hit all three timelines(caches), and when an insurance agent resolves this claim then it has to be removed from all three timelines.

Performance:

So we have described our load, in terms of our agent’s timelines and process requests. Now how do we discuss the performance of our system? How does our system handle an increase in load?

When we increase our load parameters and keep our resources the same, how is performance affected?

When we increase load parameters, how much should we increase our resources to keep the same performance?

In batch processing, this could be thought of as throughput but in online systems, we can consider this as response time the client sends a request and receives a response. In our insurance example, how long does it take for someone to submit a claim, and the claim to land on the insurance agent’s timeline?

It is important to note here that response time is not just a single value, but a distribution of values. The same request can take a different amount of time due to random factors. It could be that some requests take longer because they intrinsically require more data. It is best to understand response time in terms of percentiles. For example, you can look at outliers within response time to understand how poor they are. For example, a response time for 2s may put it at the 99th percentile of users. That is this request is slower than 99/100 requests. For example, a fanatic user who does about 1000 requests a day could occasionally see some very poor performance. This may leave a bad taste on the user. It may be a requirement for our company to ensure that 99% percentile response times must be lower than a certain threshold. However, there is a trade-off for ensuring this. It may be too expensive to optimize for this percentile and benefits of reducing response time may have diminishing returns.

These percentiles help us negotiate service level objectives(SLO) and service level agreements(SLA). These are contracts that agree on a certain level of performance. For example, an SLA may have an agreement that the service must have a median response time under 0.01 ms and 99th percentile response time of under 1s, and be running for 99.9% of the time.

Approaches for achieving Scalability :

When it comes to scalability, two popular terms arise. Vertical Scaling and Horizontal Scaling.

Vertical Scaling is when we make our existing machines more powerful.

Horizontal Scaling is distributing your load over multiple machines.

There comes another trade-off when choosing to vertically or horizontally scale. When vertical scaling, buying more powerful machines becomes increasingly more expensive. The answer may not necessarily be to go in the opposite direction and buy many cheap machines. Most solutions require a pragmatic solution, that includes horizontal and vertical scaling. This could be distributing your load over a couple of powerful machines.

Also, when should you horizontally scale? Some systems are elastic that can predict when you need to scale out and automatically recruit more machines. Many cloud providers offer this. This could be very helpful in the case of an unexpected spike of users. Other times it may be more useful to manually control when to scale out. The manual approach leads to less operational surprises.

There is no clear cut, one size fit all when designing architecture. For example, a data system that receives 3 thousand requests a second at 1kb size is very different from a system that receives 3 requests a minute at 1GB in size.

In general, our successful architecture makes the right assumptions of cases that occur the most and cases that occur rarely.

That’s enough for scalability, let’s now talk about our last topic, Maintainability.

Maintainability:

Our final goal of a data system is to make sure our system can be properly maintained and updated. Data is an evolving process. New data points need to be collected, data formats changes, new data tools become available. Firstly, our system should simple.

The code should be easily understood and worked on by different teams. We should avoid making the code overcomplex. A solution to this is to constantly refactor code, and have code reviews. Also, complex logic should be abstracted out so teams can be more productive without going through a large learning curve when going through the codebase.

An interesting point is understanding how our data system evolves. Perhaps, a new data point needs to be collected. How would we adapt our old system to account for this? What about the old code that is unaware of this new data format? These will be discussed further in future blogs.

I hope this blog helped you critically think about the goals of data systems. If you want to look more into Data Systems, I highly recommend reading “Data-Intensive Applications” by Martin Kleppman, as well as “Grokking the System Design Interview”. Stay tuned as we dive deeper into the concepts!

--

--