Most programming languages are designed for professional developers with knowledge of algorithms and data structure. SQL is different.
SQL is used by analysts, data scientists, product managers, designers and many others. These professionals have access to databases, but they don't always have the intuition and understanding to write efficient queries.
In an effort to make my team write better SQL, I went over reports written by non-developers and code reviews, and gathered common mistakes and missed optimization opportunities in SQL.
Table of Contents
- Be Careful When Dividing Integers
- Guard Against "division by zero" Errors
- Know the Difference Between UNION and UNION ALL
- Be Careful When Counting Nullable Columns
- Be Aware of Timezones
- Avoid Transformations on Indexed Fields
- Use BETWEEN Only For Inclusive Ranges
- Add "Faux" Predicates
- Inline CTE*
- Fetch Only What You Need!
- Reference Column Position in GROUP BY and ORDER BY
- Format Your Query
- Conclusion
Be Careful When Dividing Integers
In PostgreSQL, dividing an integer by an integer results in an integer:
db=# (
SELECT tax / price AS tax_ratio
FROM sale
);
tax_ratio
----------
0
To get the expected result of the division, you need to cast one of the values to float:
db=# (
SELECT tax / price::float AS tax_ratio
FROM sale
);
tax_ratio
----------
0.17
Failing to recognize this pitfall might lead to horribly incorrect results.
Guard Against "division by zero" Errors
Zero division is a notorious error in production:
db=# SELECT 1 / 0
ERROR: division by zero
Division by zero is a logic error, and shouldn't just be "worked around" but fixed such that you don't have a zero divisor in the first place. However, there are situations where a zero denominator is possible. One easy way to protect against zero division errors in such cases, is to make the entire expression null by setting the denominator to null if it equals zero:
db=# SELECT 1 / NULLIF(0, 0);
?column?
----------
-
The function NULLIF returns null if the first argument equals the second argument. In this case, if the denominator is equal to zero.
When dividing any number with NULL, the result is NULL. To force some value, you can wrap the entire expression with COALESCE and provide a fallback value:
db=# SELECT COALESCE(1 / NULLIF(0, 0), 1);
?column?
----------
1
The function COALESCE is very useful. It accepts any number of arguments, and returns the first value which is not null.
Know the Difference Between UNION and UNION ALL
A classic entry level interview question for developers and DBAs is "what is the difference between UNION and UNION ALL".
UNION ALL
concatenate the results of one or more queries. UNION
does the same, but it also eliminates duplicate rows:
db=# (
SELECT created_by_id FROM sale
UNION
SELECT created_by_id FROM past_sale
);
QUERY PLAN
----------------------------------------------------------------------------------
Unique (cost=2654611.00..2723233.86 rows=13724572 width=4)
-> Sort (cost=2654611.00..2688922.43 rows=13724572 width=4)
Sort Key: sale.created_by_id
-> Append (cost=0.00..652261.30 rows=13724572 width=4)
-> Seq Scan on sale (cost=0.00..442374.57 rows=13570157 width=4)
-> Seq Scan on past_sale (cost=0.00..4018.15 rows=154415 width=4)
You can see in the execution plan that after appending the two queries, the database sorted the results and eliminated duplicate rows.
If you don't need to eliminate duplicate rows, it's best to use UNION ALL
:
db=# (
SELECT created_by_id FROM sale
UNION ALL
SELECT created_by_id FROM past_sale
);
QUERY PLAN
----------------------------------------------------------------------------------
Append (cost=0.00..515015.58 rows=13724572 width=4)
-> Seq Scan on sale (cost=0.00..442374.57 rows=13570157 width=4)
-> Seq Scan on past_sale (cost=0.00..4018.15 rows=154415 width=4)
The execution plan is much simpler. The results are appended and a sort is not necessary.
Be Careful When Counting Nullable Columns
When using aggregate functions such as COUNT
, it's important to understand how they handle null values.
For example, take the following table:
db=# \pset null NULL
Null display is "NULL".
db=# WITH tb AS (
SELECT 1 AS id
UNION ALL
SELECT null AS id
)
SELECT *
FROM tb;
id
------
1
NULL
The id
column contains a null value. Counting the id
column:
db=# WITH tb AS (
SELECT 1 AS id
UNION ALL
SELECT null AS id
)
SELECT COUNT(id)
FROM tb;
count
-------
1
There are two rows in the table, but COUNT
returned 1. This is because null values are ignored by COUNT
.
To count rows, use COUNT(*)
:
db=# WITH tb AS (
SELECT 1 AS id
UNION ALL
SELECT null AS id
)
SELECT COUNT(*)
FROM tb;
count
-------
2
This feature can also be useful. For example, if a field called modified
contains null if a row was not changed, you can calculate the percentage of changed rows like this:
db=# (
SELECT COUNT(modified) / COUNT(*)::float AS modified_pct
FROM sale
);
modified_pct
---------------
0.98
Other aggregate functions, such as SUM, will ignore null values. To demonstrate, SUM a field that contains only null values:
db=# WITH tb AS (
SELECT null AS id
UNION ALL
SELECT null AS id
)
SELECT SUM(id::int)
FROM tb;
sum
-------
NULL
These are all documented behaviors, so just be aware!
Be Aware of Timezones
Timezone are always a source of confusion and pitfalls. PostgreSQL does a fair job with timezones, but you still have to pay attention to some things.
A common mistake I see countless times is truncating timestamps without specifying the time zone. Say we want to find out how many sales were made each day:
SELECT created_at::date, COUNT(*)
FROM sale
GROUP BY 1
Without explicitly setting the time zone, you might get different results, depending on the time zone set by the client application:
db=# SELECT now()::date;
now
------------
2019-11-08
db=# SET TIME ZONE 'australia/perth';
SET
db=# SELECT now()::date;
now
------------
2019-11-09
If you are not sure what time zone you are working with, you might be doing it wrong.
When truncating a timestamp, convert to the desired time zone first:
SELECT (timestamp at time zone 'asia/tel_aviv')::date, COUNT(*)
FROM sale
GROUP BY 1;
Setting the time zone is usually the responsibility of the client application. For example, to get the time zone used by psql
:
db=# SHOW timezone;
TimeZone
----------
Israel
(1 row)
db=# SELECT now();
now
-------------------------------
2019-11-09 11:41:45.233529+02
(1 row)
To set the time zone in psql
:
db=# SET timezone TO 'UTC';
SET
db=# SELECT now();
now
-------------------------------
2019-11-09 09:41:55.904474+00
(1 row)
Another important thing to keep in mind is that the time zone of your server can be different than the time zone of your local machine. so if you run queries in you local machine they might yield different results in production. To avoid mistakes, always explicitly set a time zone.
Timezones in PostgreSQL
To get a complete list of time zone names in PostgreSQL query the view pg_timezone_names
.
Avoid Transformations on Indexed Fields
Using functions on an indexed field might prevent the database from using the index on the field:
db=# (
SELECT *
FROM sale
WHERE created at time ZONE 'asia/tel_aviv' > '2019-10-01'
);
QUERY PLAN
----------------------------------------------------------------------------------
Seq Scan on sale (cost=0.00..510225.35 rows=4523386 width=276)
Filter:timezone('asia/tel_aviv', created)>'2019-10-01 00:00:00'::timestamp without time zone
The field created
is indexed, but because we transformed it with timezone
, the index was not used.
One way to utilize the index in this case is to apply the transformation on the right-hand side instead:
db=# (
SELECT *
FROM sale
WHERE created > '2019-10-01' AT TIME ZONE 'asia/tel_aviv'
);
QUERY PLAN
----------------------------------------------------------------------------------
Index Scan using sale_created_ix on sale (cost=0.43..4.51 rows=1 width=276)
Index Cond: (created > '2019-10-01 00:00:00'::timestamp with time zone)
Another common use-case involving dates is filtering a specific period:
db=# (
SELECT *
FROM sale
WHERE created + INTERVAL '1 day' > '2019-10-01'
);
QUERY PLAN
----------------------------------------------------------------------------------
Seq Scan on sale (cost=0.00..510225.35 rows=4523386 width=276)
Filter: ((created + '1 day'::interval) > '2019-10-01 00:00:00+03'::timestamp with time zone)
Like before, the interval function on the field created
prevented the database from utilizing the index. To make the database use the index, apply the transformation on the right-hand side instead of the field:
db=# (
SELECT *
FROM sale
WHERE created > '2019-10-01'::date - INTERVAL '1 day'
);
QUERY PLAN
----------------------------------------------------------------------------------
Index Scan using sale_created_ix on sale (cost=0.43..4.51 rows=1 width=276)
Index Cond: (created > '2019-10-01 00:00:00'::timestamp without time zone)
Use BETWEEN Only For Inclusive Ranges
A common mistake I see very often is when filtering a date range using BETWEEN:
SELECT *
FROM sales
WHERE created BETWEEN '2019-01-01' AND '2020-01-01';
Intuitively, you might think this query is fetching all the sales in 2019, but in fact, it's fetching all the sales made in 2019 and the first day of 2020. BETWEEN
is inclusive, so the query above is equivalent to this query:
SELECT *
FROM sales
WHERE created >= '2019-01-01'
AND created <= '2020-01-01';
To filter results in 2019 you can either write this:
SELECT *
FROM sales
WHERE created BETWEEN '2019-01-01' AND '2019-12-31';
Or better yet:
SELECT *
FROM sales
WHERE created >= '2019-01-01'
AND created < '2020-01-01';
Using BETWEEN
incorrectly might produce overlapping results, for example, counting sales twice in two different periods.
Add "Faux" Predicates
One of the most important components in database implementations, and usually the one that makes one database better than the other, is the query optimizer.
The query optimizer looks at your SQL and generates an execution plan. The execution plan describes how the database in going to access the data necessary to satisfy the query. For example, the optimizer decides whether to use a specific index or not, in what order to execute a join, which table to filter first, and so on.
To generate a good execution plan, the optimizer utilizes metadata and statistics it has on your data. For example, if you apply a filter on a column with a unique constraint, the optimizer knows it can expect exactly one row for each value. In this case, it might conclude that it makes more sense to use an index rather than scan the entire table.
In some circumstances, you have knowledge of your data that the optimizer does not have, or cannot have. You might be able to improve the performance of a query by providing additional information to the optimizer, using what I like to call, a "Faux Predicate".
Take this query for example:
db=# (
SELECT *
FROM sale
WHERE modified < '2019-01-01 asia/tel_aviv'
);
QUERY PLAN
----------------------------------------------------------------------------------
Seq Scan on sale (cost=0.00..510225.35 rows=1357 width=276)
Filter: (modified < '2019-01-01 00:00:00+02'::timestamp with time zone)
The query fetches sales that were modified before 2019. There is no index on this field, so the optimizer generates an execution plan to scan the entire table.
Let's say you have another field in this table with the time the sale was created. Since it's not possible for a sale to be modified before it was created, adding a similar condition on the created
field won't change the result of the query. However, the optimizer might use this information to generate a better execution plan:
db=# (
SELECT *
FROM sale
WHERE modified < '2019-01-01 asia/tel_aviv'
AND created < '2019-01-01 asia/tel_aviv';
);
QUERY PLAN
----------------------------------------------------------------------------------
Index Scan using sale_created_ix on sale (cost=0.44..4.52 rows=1 width=276)
Index Cond: (created < '2019-01-01 00:00:00+02'::timestamp with time zone)
Filter: (modified < '2019-01-01 00:00:00+02'::timestamp with time zone)
After we added the "Faux Predicate" the optimizer decided to use the index on the created
field, and the query got much faster! Note that the previous predicate on the modified
field is still being evaluated, but it's now being applied on much fewer rows.
A "Faux Predicate" should not change the result of the query. It should only be used to provide more information to the optimizer that can improve the query performance. Keep in mind that the database has to evaluate all the predicates, so adding too many might make a query slower.
Inline CTE*
Before PostgreSQL 12, Common Table Expressions (aka CTE) were materialized. This changed in PostgreSQL 12, where CTEs are no longer materialized and are treated like sub-queries.
In versions prior to PostgreSQL 12, when CTEs are used incorrectly they can cause increased memory usage and degraded performance:
db=# WITH cte AS (
SELECT *
FROM sale
)
SELECT *
FROM cte
WHERE created_by_id = 1;
QUERY PLAN
---------------------------------------------------------------------------
CTE Scan on cte (cost=442906.19..748632.12 rows=67939 width=1148)
Filter: (created_by_id = 1)
CTE cte
-> Seq Scan on sale (cost=0.00..442906.19 rows=1999999999 width=276)
(4 rows)
The overall cost of the execution plan seems very high. This is because the database first materialized the result of the common table expression, and only then applied the predicate. The database was unable to utilize the index on the field, and the query ended up not very efficient.
For better performance, inline the CTE (or upgrade to PostgreSQL 12 😉):
db=# (
SELECT *
FROM (
SELECT *
FROM sale
) AS inlined
WHERE created_by_id = 1
);
QUERY PLAN
-------------------------------------------------------------------------------------
Index Scan using sale_created_by_ix on sale (cost=0.43..714.70 rows=277 width=276)
Index Cond: (created_by_id = 1)
For more about CTE in PostgreSQL and how it effects a query execution plan check out Be Careful With CTE in PostgreSQL.
Fetch Only What You Need!
Databases are really good at storing and retrieving data. Other application, not so much. If you fetch data to Excel, SASS, R, Pandas or any other reporting tool - it's best to fetch only what you need.
For example, you sometimes want to get a sense of the data and you might do this:
db=# SELECT * FROM sale;
QUERY PLAN
----------------------------------------------------------------------------------
Seq Scan on sale (cost=0.00..442374.57 rows=13570157 width=276)
This query will fetch the entire table and will most likely cause unnecessary load if you only need several rows.
Some client applications will automatically fetch the data in pages or limit the result set, but to be on the safe side, it's best to set LIMIT
yourself:
db=# (
SELECT *
FROM sale
LIMIT 10
);
QUERY PLAN
----------------------------------------------------------------------------------
Limit (cost=0.00..0.33 rows=10 width=276)
-> Seq Scan on sale (cost=0.00..442374.57 rows=13570157 width=276)
Another common case where unnecessary data is fetched from the database is when a user that is less familiar with SQL is fetching data into some other tool such as Excel or Pandas, only to immediately apply some filter or aggregation to it. This can usually be solved by some sort of training.
Reference Column Position in GROUP BY and ORDER BY
A nice feature in PostgreSQL is that columns can be referenced in GROUP BY and ORDER BY by their position in the SELECT clause:
SELECT
first_name || ' ' || last_name AS full_name,
COUNT(*) as sales
FROM
sale
GROUP BY
1
ORDER BY
2 DESC
The GROUP BY
clause references the expression in first position in the SELECT
clause, full_name
. The ORDER BY
clause references the second expression, the sales
count. By referencing the position we avoided repeating the expression. Aside from saving a few more clicks, if the expression changes in the future we can edit it in only one place.
I realize this tip can be controversial because the column position in the SELECT
clause has no significance and might itself change when the query is edited. However, I found that it improves productivity when writing ad-hoc queries.
Format Your Query
Readability counts. Pick whatever style you and your team feel most comfortable with, and stick with it.
When I got started years ago I write queries like this:
SELECT col1, col2, COUNT(col3)
FROM t1
JOIN t2 ON ta.pk = t2.fk
WHERE col1 = col2
AND col3 > col4
GROUP BY col1, col2,
HAVING COUNT(col3) > 1
I started this way because this is roughly the format Oracle used in their documentation.
Over the years, I encountered many different styles. For example:
SELECT col1, col2, COUNT(col3)
FROM t1
JOIN t2 ON ta.pk = t2.fk
WHERE col1 = col2
AND col3 > col4
GROUP BY col1, col2,
HAVING COUNT(col3) > 1
I can't think of a reason anyone would write like this, it's exhausting to format this manually (but it does look good...)
Nowadays, my team an I use this format:
SELECT
col1,
col2,
COUNT(col3)
FROM
t1
JOIN t2 ON ta.pk = t2.fk
WHERE
col1 = col2
AND col3 > col4
GROUP BY
col1,
col2
HAVING
COUNT(col3) > 1
It's readable, it's flexible, and most importantly, it plays very nicely with git diff
which makes code reviews easier.
Conclusion
Applying the tips above in our day-to-day helps us sustain a healthy database with very little waste. We found that educating developers and non-developers about how to write better SQL can go a long way.
If you have any more SQL tips I might have missed, let me know and I'll be happy to add them here.
UPDATES
- 2019-11-22: Fixed the examples in the "Faux Predicate" section after several keen eyed readers noticed it was backwards.