To extract data from a database, you can use SQL. It is fairly quick to learn and incredibly powerful to use. In addition, the language is almost universal: SQL is used in many applications to access, structure and manage data. However, what we often see happen, both to first-time users but also to more experienced analysts and developers, is that the performance of SQL operations leaves much to be desired. In this article we describe, a few common causes for lack of performance and teach you how best to address them.
Which problem
As mentioned, SQL is incredibly useful for accessing, editing and structuring data. Perhaps the most important tasks are the structured storage of data and determining the approach to unlock data; running a SQL query. As long as an SQL query results in a result, the user has nothing to worry about, it seems.
However, in determining the approach to unlocking data, data is physically read from disk, a relatively expensive and slow action. Obviously, this must be done efficiently. Even for more experienced users, there is no reason to stop reading now. After all, processing millions of records in a data warehouse is of a different order than updating a single record. Our experience, therefore, is that there are gains to be made in many places.
Do I have a problem?
The moment the run time of an SQL statement is undesirably long, we speak of a performance problem. For performance problems, it is helpful to start with the approach the software takes to unlocking or editing data. Depending on your software, this is the execution plan or explain plan, see an example here. What I definitely look at is whether the estimate of numbers is accurate. If you see many more records than expected, then a suboptimal approach may be taken. This could be due to the structure of the SQL query, the joins between tables, the keys used, or something else. Some programs even give you hints for improving performance. Below we provide five examples and tips for performance improvement based on our daily practice.
SORT-operations
What makes SQL so powerful is its ability to sort and create groups. For example, the ability to quickly collect data on multiple records and group the results is under the hood of many dashboards. It has several manifestations with the GROUP BY and the DISTINCT being the most familiar. The reason to be vigilant here is that the amount of work (for the database) increases much faster than the increase in the number of records; namely, exponentially. This means that with the growth of the database, managing performance and resource costs can become problematic over time. One solution is to increase processing power or memory, easy to do these days if you work in a cloud environment. But this only shifts the problem to other resources and does not address it at its core.
To illustrate that growth, consider the following example. Take from a deck of cards all the clubs, and put them in order of A, H, …, 3, 2 as quickly as possible. I label this activity as one unit of work. Then shuffle the complete deck of cards and do the same for each color. Four times as many cards, yet more than four times the work. After all, it had to be filtered by color first. You can imagine how much more work and complex it becomes if you try to do this with two or more games of cards.
So assess whether you really need the GROUP BY or DISTINCT. A tip this I got a long time ago is about the columns and expressions in the GROUP BY clause. If I want to select all kinds of details, and I also select the primary key (PK) of those details I only need to include the PK in the GROUP BY and I can still show the details with a MIN() function. After all, with the same HP, all the details are identical. The advantage of this approach is that the (internal) workspace of the SQL interpreter becomes smaller resulting in performance gains. Below is an example; it sounds more complicated than it is.
Indexes or Partitioning
An index is a pretty efficient search structure based on key ordering (one column, or combination of multiple columns). Looking up a record with a key can be done in a small number of steps, and leads to a reference of the internal, physical address of the searched record. Ideal when a very small set of records needs to be looked up.
Data warehouse environments often involve large(er) data sets. Would the processing be through an index, it is often consulted with multiple steps per record. The overhead becomes very large as a result. At Oracle, the tipping point is 2% or lower. This means that an index pays off when less than 2% of all records in the table are selected. For that reason, I am certainly not in favor of using indexes to run queries through indexes.
Ideally, I use table partitioning, dividing the overall table by prior knowledge, attribute or predicate such as year of entry or a product group. Put all the red marbles in bin 1, all the green ones in bin 2 then you will know exactly where to look when you are looking for a green marble. The other trays I don’t have to look at then, so efficient. This deliberate skipping of parts of a table is also called pruning. Then use preferably property to use for partitioning that is often found in joins. If two tables are partitioned on the same property, then the join (and the rest of the query) can also be done in pieces (partition wise joining). This can counteract that exponential resource requirement mentioned with GROUP BY. Note that pruning and partion wise joining cannot be used much, if at all, when “non-operational” properties are used, such as date or a hash.
Joins
Trivial is making sure that with joining tables (and views etc.) the full join condition is used. This means specifying the join type, tables and keys. This is perhaps the best example of controlling the total number of records (aka cardinality). To pay close attention to join conditions, I favor ANSI notation. This means that the JOIN is specified in the FROM clause and not in a table list with the join conditions in the WHERE clause. Necessary tools here are the definitions of primary keys, the unique keys and a good data model where the relationships between entities or tables are clear.
Distrust generated queries
For performance problems, don’t stare blindly at the output of the query itself; focus on the environment as well. I have worked on several assignments with a so-called ETL tool with client-server architecture. And notable on all of these commands was the requirement to “not write your own SQL”; the tool “writes” the SQL itself. In other words, get all the data to the tool, which loads it into working memory and does its thing. Often this is not efficient.
A recent example is an application matching two sets of records, both twenty million records in size. For each record from one set, the record from the other set must be found. However, the generated query chooses to sort all records, an expensive time-intensive solution. By writing your own query and using existing indices and partitioning, this can be done much more effectively.
Testing
If you are improving and want to make performance gains measurable, pay close attention. With great regularity, (potential) performance improvements are tested by comparing run times of the old and new code. However, the outcomes can be highly influenced by the situations below:
- Other processes may claim resources from the servers and network so the measurements are not easily comparable;
- Temporary storage of data in memory (caching) that eliminates the need for many reads from the physical disk (the physical IO).
In general, I value the differences in explain plans more than this kind of idle system testing. When there is really no other way, I try to start all SQL versions at the same time and the load on the machine is more or less equal.
Parallel execution of SQL
If there are no more performance gains to be made in the SQL queries, one option remains: executing a query in parallel. The advantage of executing SQL in parallel is to reduce lead time. The option has a limited shelf life; it works until other resources become the bottleneck, such as a write buffer or memory. However, parallel output also has disadvantages, the short-term load on the server is higher; since more coordination and communication is required.
Briefly summarize the tips for performance gains:
- Limit the use of GROUP BY, DISTINCT or ORDER BY.
- Use ANSI joins, do not join unnecessary tables and use inner joins as much as possible.
- Write out your join conditions completely in the FROM statement and keep your total number of records (cardinality) under control.
- Use indices when they work well; for larger sets it is preferable to partition tables (multiple tables and with the same properties)
- Distrust generated queries; they often contain limited knowledge of the data model and are suboptimal.
- Consider the output of parallel execution of SQL queries. These are a greater short-term burden on the environment but deliver faster results.