Many applications today still rely on traditional SQL databases like MySQL, MariaDb or PostgreSQL for data storage and data processing. With the growing amount of data and new workloads that are made with this database systems, we often find ourselves in situations where we need to think about scaling such systems.
When it comes to scaling, we might need to think about:
- data storage, if we store more and more data and it becomes expensive or slow working with them
- fast INSERTs and UPDATES for write-heavy workloads
- making SELECT queries faster because of their complexity or because they need to query huge amounts of data
- concurrency if we have many clients interacting with the database
In this article, I will present some basic ideas and starting points on scaling traditional SQL databases.
Update the database
Newer versions of MySQL, PostgreSQL and other traditional SQL databases typically come with performance improvements. Even if the newer database system is not faster as a direct replacement, there might be new features available that we can take advantage of. Keeping our database system up to date expands our options on what we can do with our data and gives us the best out of the box performance. It is a basic enabler for scaling.
The most intuitive idea of scaling is to use better hardware to scale vertically (on one database server). When it comes to hardware we typically look at CPUs, disk and RAM:
- The number of CPUs affects how many queries the database can run and therefore how many clients can be served
- The size of RAM gives us space for indexes, temporary tables and caches the database can store in a fast memory which makes our system faster due to minimizing IO access to disk
- Disk IO speed highly affects the query times, especially for full scan reads (when indexes cannot be used) or for write-heavy systems
- Disk size allows us to store more data
Leverage application cache
If the application’s code is in our control we can cache data in a memory store like Redis or Memcached to avoid querying the database. We don’t have to just cache database reads, but we can also use such systems as buffers for writes, e.g. when we are collecting analytical data and a possible delay is not a problem.
Use efficient data types
When it comes to data, it all starts with data types, in other words how individual pieces of information are physically stored in memory. Choosing appropriate data types is always a balance between efficiency and functionality, affecting the required memory, query performance and number of operations we can do with the specific data type.
There are two basic ways to think about optimal data types:
- logical: how to store general information like a telephone number or an IP address; for instance we can store color as a string, enum or as a collection of RGB integers and we can store IP address as a string or as bytes
- physical: what specific data types we choose for strings, integers or time since we typically have multiple options; e.g. date and time can be typically stored as a timestamp (faster but with limited date range and functionality) or as a more complex datetime which is more developer friendly and flexible, allowing to store timezone etc.
Using an appropriate type for a database column is especially important when we want to index the column, use it in the WHERE clause, use it for joins or when we have a lot of data.
Data normalization and denormalization
We are often trained to normalize data for relational databases in order to reduce data redundancy and improve data integrity. While this is generally useful, we might want to reconsider it for some of our data when it comes to scaling.
Generally speaking the simpler we make our data to be retrieved or saved, the more of it we can have.
A lot of SELECT queries can be optimized by having the data already in the requested form. This can be handy for analyzing large amounts of data. For instance, instead of aggregating data every time, we can aggregate them beforehand. Of course this isn’t always possible, e.g. we cannot aggregate other average aggregates because we would lose precision.
Leverage materialized views
Materialized views are continuously updated data that are updated on write operations or on scheduled time and can be seen as an extension of the previous point. Unlike classic views, materialized views are physically stored and don’t need to be computed when needed. This is useful for situations where a SELECT query would take a long time to produce the result.
As an example, look at PipelineDB extension for PostgreSQL that produces aggregates on real-time data, allowing us to keep aggregated statistics on a write-heavy systems.
Use proper indexes
Using the right indexes on tables can be a huge performance changer. Typically, we should index the columns by which we want to query the data or which we want to use to perform joins. However, adding indexes can also reduce performance for write-heavy workloads, since indexes have to be updated on every INSERT and UPDATE. Also it doesn’t make sense to use indexes on some data types. So it is all about finding the right balance.
Leverage the execution plan for query optimization
For optimizing reads we need to know how the database query planner will execute our queries. We can use EXPLAIN statement to obtain the execution plan that the SQL server will use for our query.
Here we should look if the database plans to use any of the existing indexes on the tables or plans to make a full table scan. It can give us hints whether we should change the structure of our data, add indexes or rewrite the query in a different way. Please note however that full table scan can be also faster or necessary for some queries so it is not a black and white situation.
Choose correct transaction isolation level
The most common relational databases use multi-version concurrency control and make locking very granular (instead of locking the whole table, like MyISAM storage engine does). However, we still need to tell the database how exactly should the concurrency control behave and we can do so by setting appropriate transaction isolation level. There are 4 basic isolation levels in SQL:1992 standard and they will greatly affect how our database system behaves.
Bulk INSERTs and UPDATEs
Writing or updating individual rows in a table is not efficient, but databases like MySQL or PostgreSQL have ways to insert or modify data in bulk. We should leverage this every time we can.
Compress data for storage
Many times we don’t need to query the data by all the columns in a table. If this is the case, we can further optimize the storage size by storing some of the columns in a compressed form. This is especially useful for strings or some types of binary data.
Some databases or their extensions might be already compressing data under the hood so it always makes sense to look first how the data are stored and whether adding compression on the application layer would bring the desired benefit.
Make ALTER TABLEs work
When the amount of data or the workload grows we will experience slower and slower ALTER TABLEs to the point where some of them might not finish in a reasonable time. The first important thing to know is that there are typically two ways of changing the structure of the database tables. One is INPLACE, modifying the original table and the other one is COPY, creating a new table and moving the data afterwards. Databases typically try to modify tables INPLACE if possible, but it might make sense sometimes to explicitly ask for the COPY operation.
When adding a new column, it is better to add it at the end of the table, since inserting a column at a specific place is slow.
When changing existing column it might be a good idea to create a new column first, copy the data and remove the column afterwards.
If the write operations on the table make it difficult to perform ALTER TABLE, it makes sense to stop clients from writing into the table first. This is not always possible, but e.g. collection of analytical data can be postponed using an ingestion buffer in front of the database or we can temporarily pause our ETL jobs working with the table.
The last resort is to simply not do any ALTER TABLEs on large and write-heavy tables, opting for creating a new table to store the new data instead.
Manage concurrent connections
Every database server comes with some basic configuration on the maximum number of concurrent connections. When we need more, we typically need to reconfigure this value. Just increasing the value is not enough though – we need to make sure that our system can actually run well with our desired number of connections.
Add read replicas
Traditional SQL databases can’t typically scale horizontally for write operations by adding more servers, but we can still add other machines in the form of read-only replicas. The way this works is that all write operations are done on the main server and propagated to other machines using Write Ahead Log. All the replicas will therefore apply the same operations in the same order to the underlying storage as the main server. This ensures that the data will be in sync.
Replicas can be then used for scaling the number of read queries and number of connected clients if they only need read operations.
Partitioning allows us to distribute one single table across the filesystem and store individual partitions based on some specific rules. When such rules are chosen well, SQL queries don’t have to query all partitions but can limit their search only to a subset of them. This will cut the time the query will run since not the whole database table needs to be looked into.
Use specialized extensions
When we need to store and work with geospatial, time series and other specialized data, we can sometimes use a database extension like TimescaleDb or PostGIS to make the data processing and storage more efficient.
Sharding is partitioning on steroids and allows us to store parts of the database table (shards) on different servers. Distributed databases are often built with the concept of shards, but traditional SQL servers cannot automatically shard data, since they only work with one main server and replicas. So what can we do?
We can investigate whether some clustering solution like MySQL Cluster for MySQL or Citus for PostgreSQL would solve our scaling problem. This solutions are built to provide “horizontally scalable SQL database” but they are not without limitations. For instance, MySQL Cluster doesn’t work with classic InnoDb engine, so we cannot just write applications like we used to.
Another option is to shard data manually on the application layer. This brings additional complexity to the application where we need to manage multiple connections to different servers and query the databases appropriately, but it is a solution to consider.
Don’t store everything in one table
If the amount of data is our primary issue and our tables are becoming too large, we should think about whether we can put this data elsewhere.
If we do need the data occasionally and we don’t need any UNIQUE indexes across the whole table, we can simply divide the data into multiple tables. For instance, we can separate data from different customers or store older data in archive tables. This way we can keep our tables smaller and more performant.
In some cases we might even decide that it is okay not to store some data in our primary database and either simply delete them or move them to another storage system.
Process data outside of the SQL database
Sometimes relational SQL databases are just not enough. For instance, we might want to create data workflows that are unfit for such systems or we might not be able to make changes to the schema or change how existing programs interact with the database. Maybe we reached our limits in scaling the current data processing and we need more. So what now?
In many cases, the best approach is to move the data to another system and process it there. Fortunately, we can utilize Write Ahead Log (WAL) that stores all changes (INSERTs, UPDATEs, ALTER TABLEs) that the database should perform. We can programmatically read this log and stream it into another system. For instance, we can stream data into Kafka using existing database connectors and allow other applications to read it. Or we can send this data directly to a specialized database, e.g. to an analytical database Clickhouse or to a data lake like Hadoop and process them there. We can also move the data via other means: taking a copy of the database, extracting data with SQL, etc.
If we only want to scale data computation without using a different database system, process data from multiple data sources at once or simply want to organize our data processing better, we can use distributed cluster-computing software such as Apache Spark.
Be aware of the limitations of managed SQL databases
Many people pay for managed SQL databases from cloud providers like AWS or Azure so that they don’t have to worry about maintaining them. However, when we need to scale our traditional SQL store, many scaling options will be actually unavailable to us.
This can include limited ability to change the database configuration, limited hardware options to choose from, limited number of replicas, higher latency, Write Ahead Log might be inaccessible and we might not be able to update the database version to the current, modern version we want. We might not be able to install extensions that would improve performance of our system. Also debugging performance issues on such remote system might be difficult or impossible, since we might not have proper access to the operating system.
So before buying into the dream of never having to worry about a SQL store, we should investigate what the options are and if the managed database offer is actually sufficient.
The presented options are just possible starting points to give us some areas to think about. I didn’t want to go into extra details as scaling SQL databases is not a straightforward operation. We always need to think about our system and the options we have, making every solution unique.