When I started my career in development, my first job was a DBA. Back then, before AWS RDS, Azure, Google Cloud and the rest of them cloud services, there were two types of DBAs:
The Infrastructure DBA was in charge of setting up the database, configuring the storage and taking care of backups and replication. After setting up the database, the infrastructure DBA would pop up from time to time and do some "instance tuning", things like sizing caches.
The Application DBA got a clean database from the infrastructure DBA, and was in charge of schema design: creating tables, indexes, constraints, and tuning SQL. The application DBA was also the one who implemented ETL processes and data migrations. In teams that used stored procedures, the application DBA would maintain those as well.
Application DBAs were usually part of the development team. They would possess deep domain knowledge so normally they would work on just one or two projects. Infrastructure DBAs would usually be part of some IT team, and would work on many projects simultaneously.
I'm an Application DBA
I never had any desire to fiddle with backups or tune storage (I'm sure it's fascinating!). Until this day I like to say I'm a DBA that knows how to develop applications, and not a developer that knows his way around the database.
In this article I share some non-trivial tips about database development I gathered along the way.
Table of Contents
- Update Only What Needs Updating
- Disable Constraints and Indexes During Bulk Loads
- Use UNLOGGED Tables for Intermediate Data
- Implement Complete Processes Using WITH and RETURNING
- Avoid Indexes on Columns With Low Selectivity
- Use Partial Indexes
- Always Load Sorted Data
- Index Columns With High Correlation Using BRIN
- Make Indexes "Invisible"
- Don't Schedule Long Running Processes at Round Hours
- Conclusion
Update Only What Needs Updating
UPDATE
is a relatively expensive operation. To speed up an UPDATE
command it's best to make sure you only update what needs updating.
Take this query for example that normalizes an email column:
db=# UPDATE users SET email = lower(email);
UPDATE 1010000
Time: 1583.935 ms (00:01.584)
Looks innocent, right? the query updated emails of 1,010,000 users. But, did all rows really needed to update?
db=# UPDATE users SET email = lower(email)
db-# WHERE email != lower(email);
UPDATE 10000
Time: 299.470 ms
Only 10,000 rows needed to update. By reducing the amount of affected rows, the execution time went down from 1.5 seconds to just less than 300ms. Updating fewer rows also saves the database maintenance later on.
This type of large updates are very common in data migration scripts. So the next time you write a migration script, make sure to only update what needs updating.
Disable Constraints and Indexes During Bulk Loads
Constraints are an important part of relational databases: they keep the data consistent and reliable. Their benefits come at a cost though, and it's most noticeable when loading or updating a lot of rows.
To demonstrate, set up a small schema for a store:
DROP TABLE IF EXISTS product CASCADE;
CREATE TABLE product (
id serial PRIMARY KEY,
name TEXT NOT NULL,
price INT NOT NULL
);
INSERT INTO product (name, price)
SELECT random()::text, (random() * 1000)::int
FROM generate_series(0, 10000);
DROP TABLE IF EXISTS customer CASCADE;
CREATE TABLE customer (
id serial PRIMARY KEY,
name TEXT NOT NULL
);
INSERT INTO customer (name)
SELECT random()::text
FROM generate_series(0, 100000);
DROP TABLE IF EXISTS sale;
CREATE TABLE sale (
id serial PRIMARY KEY,
created timestamptz NOT NULL,
product_id int NOT NULL,
customer_id int NOT NULL
);
The schema defines different types of constraints such as "not null" and unique constraints.
To set a baseline, start by adding foreign keys to the sale
table, and then load some data into it:
db=# ALTER TABLE sale ADD CONSTRAINT sale_product_fk
db-# FOREIGN KEY (product_id) REFERENCES product(id);
ALTER TABLE
Time: 18.413 ms
db=# ALTER TABLE sale ADD CONSTRAINT sale_customer_fk
db-# FOREIGN KEY (customer_id) REFERENCES customer(id);
ALTER TABLE
Time: 5.464 ms
db=# CREATE INDEX sale_created_ix ON sale(created);
CREATE INDEX
Time: 12.605 ms
db=# INSERT INTO SALE (created, product_id, customer_id)
db-# SELECT
db-# now() - interval '1 hour' * random() * 1000,
db-# (random() * 10000)::int + 1,
db-# (random() * 100000)::int + 1
db-# FROM generate_series(1, 1000000);
INSERT 0 1000000
Time: 15410.234 ms (00:15.410)
After defining constraints and indexes, loading a million rows to the table took ~15.4s.
Next, try to load the data into the table first, and only then add constraints and indexes:
db=# INSERT INTO SALE (created, product_id, customer_id)
db-# SELECT
db-# now() - interval '1 hour' * random() * 1000,
db-# (random() * 10000)::int + 1,
db-# (random() * 100000)::int + 1
db-# FROM generate_series(1, 1000000);
INSERT 0 1000000
Time: 2277.824 ms (00:02.278)
db=# ALTER TABLE sale ADD CONSTRAINT sale_product_fk
db-# FOREIGN KEY (product_id) REFERENCES product(id);
ALTER TABLE
Time: 169.193 ms
db=# ALTER TABLE sale ADD CONSTRAINT sale_customer_fk
db-# FOREIGN KEY (customer_id) REFERENCES customer(id);
ALTER TABLE
Time: 185.633 ms
db=# CREATE INDEX sale_created_ix ON sale(created);
CREATE INDEX
Time: 484.244 ms
Loading data into a table without indexes and constraints was much faster, 2.27s compared to 15.4s before. Creating the indexes and constraints after the data was loaded into the table took a bit longer, but overall the entire process was much faster, 3.1s compared to 15.4s.
Unfortunately, for indexes PostgreSQL does not provide an easy way of doing this other than dropping and re-creating the indexes. In other databases such as Oracle, you can disable and enable indexes without having to re-create them.
Use UNLOGGED Tables for Intermediate Data
When you modify data in PostgreSQL, the changes are written to the write ahead log (WAL). The WAL is used to maintain integrity, to fast forward the database during recovery and to maintain replication.
Writing to the WAL is often needed, but there are some circumstances where you might be willing to give up some of its uses to make things a bit faster. One example is intermediate tables.
Intermediate tables are disposable tables that stores temporary data used to implement some process. For example, a very common pattern in ETL processes is to load data from a CSV file to an intermediate table, clean the data, and then load it to the target table. In this use-case, the intermediate table is disposable and there is no use for it in backups or replicas.
Intermediate tables that don't need to be restored in case of disaster, and are not needed in replicas, can be set as UNLOGGED:
CREATE UNLOGGED TABLE staging_table ( /* table definition */ );
BEWARE: Before using UNLOGGED
make sure you understand its full implications.
Implement Complete Processes Using WITH
and RETURNING
Say you have a users table, and you find that you have some duplicates in the table:
Table setup
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email TEXT UNIQUE
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INT,
CONSTRAINT orders_user_fk
FOREIGN KEY (user_id)
REFERENCES USERS(id)
);
INSERT INTO users (email) VALUES
('foo@bar.baz'),
('me@hakibenita.com'),
('ME@hakibenita.com');
INSERT INTO orders (user_id) VALUES
(1),
(1),
(2),
(3),
(3);
db=# SELECT u.id, u.email, o.id as order_id
FROM orders o JOIN users u ON o.user_id = u.id;
id | email | order_id
----+-------------------+----------
1 | foo@bar.baz | 1
1 | foo@bar.baz | 2
2 | me@hakibenita.com | 3
3 | ME@hakibenita.com | 4
3 | ME@hakibenita.com | 5
The user haki benita registered twice, once with the email ME@hakibenita.com
and again with me@hakibenita.com
. Because we didn't normalize the emails when we inserted them into the table, we now have to deal with duplication.
To consolidate the duplicate users, we want to:
- Identify duplicate users by lower case email
- Update orders to reference one of the duplicate users
- Remove the duplicate users from the users table
One way to consolidate duplicate users is to use an intermediate table:
db=# CREATE UNLOGGED TABLE duplicate_users AS
db-# SELECT
db-# lower(email) AS normalized_email,
db-# min(id) AS convert_to_user,
db-# array_remove(ARRAY_AGG(id), min(id)) as convert_from_users
db-# FROM
db-# users
db-# GROUP BY
db-# normalized_email
db-# HAVING
db-# count(*) > 1;
CREATE TABLE
db=# SELECT * FROM duplicate_users;
normalized_email | convert_to_user | convert_from_users
-------------------+-----------------+--------------------
me@hakibenita.com | 2 | {3}
The intermediate table holds a mapping of duplicate users. For each user that appears more than once with the same normalized email address, we define the user with the min ID as the user we convert all duplicates to. The other users are kept in an array column, and all the references to these users will be updated.
Using the intermediate table, we update references of duplicate users in the orders
table:
db=# UPDATE
db-# orders o
db-# SET
db-# user_id = du.convert_to_user
db-# FROM
db-# duplicate_users du
db-# WHERE
db-# o.user_id = ANY(du.convert_from_users);
UPDATE 2
Now that there are no more references, we can safely delete the duplicate users from the users
table:
db=# DELETE FROM
db-# users
db-# WHERE
db-# id IN (
db(# SELECT unnest(convert_from_users)
db(# FROM duplicate_users
db(# );
DELETE 1
Notice that we used the function unnest
to "transpose" the array, that is, turn each array element into a row.
This is the result:
db=# SELECT u.id, u.email, o.id as order_id
db-# FROM orders o JOIN users u ON o.user_id = u.id;
id | email | order_id
----+-------------------+----------
1 | foo@bar.baz | 1
1 | foo@bar.baz | 2
2 | me@hakibenita.com | 3
2 | me@hakibenita.com | 4
2 | me@hakibenita.com | 5
Nice, all occurrences of user 3 (ME@hakibenita.com) are converted to user 2 (me@hakibenita.com).
We can also verify that the duplicate users were deleted from the users
table:
db=# SELECT * FROM users;
id | email
----+-------------------
1 | foo@bar.baz
2 | me@hakibenita.com
Now we can get rid of the intermediate table:
db=# DROP TABLE duplicate_users;
DROP TABLE
This is fine, but very long and needs cleaning up! Is there a better way?
Using Common Table Expressions (CTE)
Using Common Table Expressions, also known as the WITH
clause, we can perform the entire process with just one SQL statement:
WITH duplicate_users AS (
SELECT
min(id) AS convert_to_user,
array_remove(ARRAY_AGG(id), min(id)) as convert_from_users
FROM
users
GROUP BY
lower(email)
HAVING
count(*) > 1
),
update_orders_of_duplicate_users AS (
UPDATE
orders o
SET
user_id = du.convert_to_user
FROM
duplicate_users du
WHERE
o.user_id = ANY(du.convert_from_users)
)
DELETE FROM
users
WHERE
id IN (
SELECT
unnest(convert_from_users)
FROM
duplicate_users
);
Instead of creating the intermediate table, we create a common table expression and reuse it multiple times.
Returning Results From CTE
A nice feature of executing DML inside a WITH
clause, is that you can return data from it using the RETURNING
keyword. For example, let's report the number of updated and deleted rows:
WITH duplicate_users AS (
SELECT
min(id) AS convert_to_user,
array_remove(ARRAY_AGG(id), min(id)) as convert_from_users
FROM
users
GROUP BY
lower(email)
HAVING
count(*) > 1
),
update_orders_of_duplicate_users AS (
UPDATE
orders o
SET
user_id = du.convert_to_user
FROM
duplicate_users du
WHERE
o.user_id = ANY(du.convert_from_users)
RETURNING o.id
),
delete_duplicate_user AS (
DELETE FROM
users
WHERE
id IN (
SELECT unnest(convert_from_users)
FROM duplicate_users
)
RETURNING id
)
SELECT
(SELECT count(*) FROM update_orders_of_duplicate_users) AS orders_updated,
(SELECT count(*) FROM delete_duplicate_user) AS users_deleted
;
This is the result:
orders_updated | users_deleted
----------------+---------------
2 | 1
The main appeal of this approach is that the entire process is executed in a single command, so no need to manage a transaction or worry about cleaning up the intermediate table if the process fails.
CAUTION: A reader on Reddit pointed me to a possibly unpredictable behavior of executing DML in common table expressions:
The sub-statements in WITH are executed concurrently with each other and with the main query. Therefore, when using data-modifying statements in WITH, the order in which the specified updates actually happen is unpredictable
This means you cannot rely on the order in which independent sub-statements are executed. It seems that when there is a dependency between sub-statements, like in the example above, you can rely on a dependent sub-statement to execute before it is being used.
Avoid Indexes on Columns With Low Selectivity
Say you have a registration process where users sign up with an email address. To activate the account, they have to verify their email. Your table can look like this:
db=# CREATE TABLE users (
db-# id serial,
db-# username text,
db-# activated boolean
db-#);
CREATE TABLE
Most of your users are good citizens, they sign up with a valid email and immediately activate the account. Let's populate the table with user data, where roughly 90% of the users are activated:
db=# INSERT INTO users (username, activated)
db-# SELECT
db-# md5(random()::text) AS username,
db-# random() < 0.9 AS activated
db-# FROM
db-# generate_series(1, 1000000);
INSERT 0 1000000
db=# SELECT activated, count(*) FROM users GROUP BY activated;
activated | count
-----------+--------
f | 102567
t | 897433
db=# VACUUM ANALYZE users;
VACUUM
To query for activated and unactivated users, you might be tempted to create an index on the column activated
:
db=# CREATE INDEX users_activated_ix ON users(activated);
CREATE INDEX
When you try to query unactivated users, the database is using the index:
db=# EXPLAIN SELECT * FROM users WHERE NOT activated;
QUERY PLAN
--------------------------------------------------------------------------------------
Bitmap Heap Scan on users (cost=1923.32..11282.99 rows=102567 width=38)
Filter: (NOT activated)
-> Bitmap Index Scan on users_activated_ix (cost=0.00..1897.68 rows=102567 width=0)
Index Cond: (activated = false)
The database estimated that the filter will result in 102,567 which are roughly 10% of the table. This is consistent with the data we populated, so the database has a good sense of the data.
However, when you try to query for activated users you find that the database decided not to use the index:
db=# EXPLAIN SELECT * FROM users WHERE activated;
QUERY PLAN
---------------------------------------------------------------
Seq Scan on users (cost=0.00..18334.00 rows=897433 width=38)
Filter: activated
Many developers are often baffled when the database is not using an index. One way of explaining why an index is not always the best choice is this: if you had to read the entire table, would you use the index?
The answer is probably no, because why would you? Reading from disk is expensive and you want to read as little as possible. If for example, a table is 10MB and the index is 1MB, to read the entire table you would have to read 10MB from disk. To read the table using the index you would have to read 11MB from disk. This is wasteful.
With this understanding, let's have a look at the statistics PostgreSQL gather on the table:
db=# SELECT attname, n_distinct, most_common_vals, most_common_freqs
db-# FROM pg_stats
db-# WHERE tablename = 'users' AND attname='activated';
------------------+------------------------
attname | activated
n_distinct | 2
most_common_vals | {t,f}
most_common_freqs | {0.89743334,0.10256667}
When PostgreSQL analyzed the table it found that the column activated
has two distinct values. The value t
in the most_common_vals
column corresponds to the frequency 0.89743334
in the column most_common_freqs
, and the value f
corresponds to the frequency 0.10256667
. This means that after analyzing the table, the database estimates that 89.74% of the table are activated users, and the rest 10.26% are unactivated users.
With these stats, PostgreSQL decided it's best to scan the entire table if it expects 90% of the rows to satisfy the condition. The threshold after which the database may decide to use or not to use the index depends on many factors, and there is no rule of thumb you can use.
Use Partial Indexes
In the previous section we created an index on a boolean column where ~90% of the of the values were true (activated user). When we tried to query for active users, the database did not use the index. However, when we queried unactivated users the database did use the index.
This brings us to the next question.... if the database is not going to use the index to filter active users, why should we index them in the first place?
Before we answer this question let's look at how much the full index on the activated
column weighs:
db=# \di+ users_activated_ix
Schema | Name | Type | Owner | Table | Size
--------+--------------------+-------+-------+-------+------
public | users_activated_ix | index | haki | users | 21 MB
The index is 21MB. Just for reference, the users
table is 65MB. This means the index weighs ~32% the size of the table. We also know that ~90% of the index is likely not going to be used.
In PostgreSQL, there is a way to create an index on only a part of the table, using whats called a partial index:
db=# CREATE INDEX users_unactivated_partial_ix ON users(id)
db-# WHERE not activated;
CREATE INDEX
Using a WHERE
clause, we restrict the rows indexed by the index. Let's first make sure it works:
db=# EXPLAIN SELECT * FROM users WHERE not activated;
QUERY PLAN
------------------------------------------------------------------------------------------------
Index Scan using users_unactivated_partial_ix on users (cost=0.29..3493.60 rows=102567 width=38)
Amazing, the database was smart enough to understand that the predicate we used in the query can be satisfied by the partial index.
There is another benefit to using partial indexes:
db=# \di+ users_unactivated_partial_ix
List of relations
Schema | Name | Type | Owner | Table | Size
--------+------------------------------+-------+-------+-------+---------
public | users_unactivated_partial_ix | index | haki | users | 2216 kB
The partial index weighs only 2.2MB. The full index on the column weighed 21MB. The partial index is exactly 10% the size of the full index, which matches the ratio of inactive users in the table.
Always Load Sorted Data
This is one of the things I comment most about in code reviews. It's not as intuitive as the other tips and it can have a huge impact on performance.
Say you have a large sales fact table:
db=# CREATE TABLE sale_fact (id serial, username text, sold_at date);
CREATE TABLE
Every night, during some ETL process, you load data into the table:
db=# INSERT INTO sale_fact (username, sold_at)
db-# SELECT
db-# md5(random()::text) AS username,
db-# '2020-01-01'::date + (interval '1 day') * round(random() * 365 * 2) AS sold_at
db-# FROM
db-# generate_series(1, 100000);
INSERT 0 100000
db=# VACUUM ANALYZE sale_fact;
VACUUM
To fake a loading process we used random data. We inserted 100K rows with random username, and sale dates from 2020-01-01 to two years forward.
The table is used mostly to produce summary sales reports. Most reports filter by date to get the sales at a specific period. To speed up range scans you create an index on sold_at
:
db=# CREATE INDEX sale_fact_sold_at_ix ON sale_fact(sold_at);
CREATE INDEX
Let's look at the execution plan of a query to fetch all sales made in June 2020:
db=# EXPLAIN (ANALYZE)
db-# SELECT *
db-# FROM sale_fact
db-# WHERE sold_at BETWEEN '2020-07-01' AND '2020-07-31';
QUERY PLAN
-----------------------------------------------------------------------------------------------
Bitmap Heap Scan on sale_fact (cost=108.30..1107.69 rows=4293 width=41)
Recheck Cond: ((sold_at >= '2020-07-01'::date) AND (sold_at <= '2020-07-31'::date))
Heap Blocks: exact=927
-> Bitmap Index Scan on sale_fact_sold_at_ix (cost=0.00..107.22 rows=4293 width=0)
Index Cond: ((sold_at >= '2020-07-01'::date) AND (sold_at <= '2020-07-31'::date))
Planning Time: 0.191 ms
Execution Time: 5.906 ms
After executing the query several times to warm up the cache, the timing settled at ~6ms.
Bitmap Scan
Looking at the execution plan, we can see that the database used a bitmap scan. A bitmap scan works in two stages:
Bitmap Index Scan
: Go through the entire indexsale_fact_sold_at_ix
and map all the table pages that contain relevant rows.Bitmap Heap Scan
: Read the pages that contain relevant rows, and find the rows inside these pages that satisfy the condition.
Pages can contain multiple rows. The first step uses the index to find pages. The second step check for rows inside these pages, hence the "Recheck Cond" operation in the execution plan.
At this point many DBAs and developers will call it a day and move on to the next query. BUT, there's a way to make this query better.
Index Scan
To make things better, we'll make a small change in how we load the data.
db=# TRUNCATE sale_fact;
TRUNCATE TABLE
db=# INSERT INTO sale_fact (username, sold_at)
db-# SELECT
db-# md5(random()::text) AS username,
db-# '2020-01-01'::date + (interval '1 day') * round(random() * 365 * 2) AS sold_at
db-# FROM
db-# generate_series(1, 100000)
db-# ORDER BY sold_at;
INSERT 0 100000
db=# VACUUM ANALYZE sale_fact;
VACUUM
This time, we loaded the data sorted by the sold_at
.
Let's see what the execution plan for the exact same query looks like now:
db=# EXPLAIN (ANALYZE)
db-# SELECT *
db-# FROM sale_fact
db-# WHERE sold_at BETWEEN '2020-07-01' AND '2020-07-31';
QUERY PLAN
---------------------------------------------------------------------------------------------
Index Scan using sale_fact_sold_at_ix on sale_fact (cost=0.29..184.73 rows=4272 width=41)
Index Cond: ((sold_at >= '2020-07-01'::date) AND (sold_at <= '2020-07-31'::date))
Planning Time: 0.145 ms
Execution Time: 2.294 ms
After running the query several times we get a stable timing at round 2.3ms. Compared to the previous query that took ~6ms, we get a consistent saving of ~60%.
Another thing we can see right away, is that the database did not use a bitmap scan this time, but a "regular" index scan. Why is that?
Correlation
When the database is analyzing a table it collects all sort of statistics. One of those statistics is correlation:
Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk.
As the official documentation explains, the correlation measures how "sorted" values of a specific column are on disk.
When the correlation is 1, or close to 1, it means the pages in the tables are stored on disk in roughly the same order as the rows in the table. This is actually very common. For example, auto incrementing ID's will usually have a correlation close to 1. Date and timestamp columns that keeps track of when rows were created will also usually have a correlation close to 1.
When the correlation is -1, the pages of the table are sorted in reverse order relative to the column.
When the correlation is close to 0, it mean the values in the column have no or very little correlation to how the pages of the table are stored.
Going back to our sale_fact
table, when we loaded the data into the table without sorting it first, these were the correlations:
db=# SELECT tablename, attname, correlation
db-# FROM pg_stats
db=# WHERE tablename = 'sale_fact';
tablename | attname | correlation
-----------+----------+--------------
sale | id | 1
sale | username | -0.005344716
sale | sold_at | -0.011389783
The auto generated column id
has a correlation of 1. The sold_at
column has a very low correlation: consecutive values are scattered across the entire table.
When we loaded sorted data into the table, these were the correlations calculated by the database:
tablename | attname | correlation
-----------+----------+----------------
sale_fact | id | 1
sale_fact | username | -0.00041992788
sale_fact | sold_at | 1
The correlation for sold_at
is now 1.
So why did the database use a bitmap scan when the correlation was low, and an index scan when the correlation was close to 1?
- When the correlation was 1, the database estimated that rows in the requested range are likely to be in consecutive pages. In this case, an index scan is likely to read very few pages.
- When the correlation was close to 0, the database estimated that rows in the requested range are likely to be scattered across the entire table. In this case, it makes sense to use a bitmap scan to map the table pages in which rows exist, and only then fetch them and apply the condition.
The next time you load data into a table, think about how the data is going to be queried, and make sure you sort it in a way that indexes used for range scan can benefit from.
CLUSTER
Command
Another way of "sorting a table on disk" by a specific index is to use the CLUSTER
command.
For example:
db=# TRUNCATE sale_fact;
TRUNCATE TABLE
-- Insert rows without sorting
db=# INSERT INTO sale_fact (username, sold_at)
db-# SELECT
db-# md5(random()::text) AS username,
db-# '2020-01-01'::date + (interval '1 day') * round(random() * 365 * 2) AS sold_at
db-# FROM
db-# generate_series(1, 100000)
INSERT 0 100000
db=# ANALYZE sale_fact;
ANALYZE
db=# SELECT tablename, attname, correlation
db-# FROM pg_stats
db-# WHERE tablename = 'sale_fact';
tablename | attname | correlation
-----------+-----------+----------------
sale_fact | sold_at | -5.9702674e-05
sale_fact | id | 1
sale_fact | username | 0.010033822
We loaded data into the table in random order and as a result the correlation of sold_at
is close to zero.
To "rearrange" the table by sold_at
, we used the CLUSTER
command to sort the table on disk according to the index sale_fact_sold_at_ix
:
db=# CLUSTER sale_fact USING sale_fact_sold_at_ix;
CLUSTER
db=# ANALYZE sale_fact;
ANALYZE
db=# SELECT tablename, attname, correlation
db-# FROM pg_stats
db-# WHERE tablename = 'sale_fact';
tablename | attname | correlation
-----------+----------+--------------
sale_fact | sold_at | 1
sale_fact | id | -0.002239401
sale_fact | username | 0.013389298
After the table was clustered we can see that the correlation for sold_at
is 1.
Some things to note about the CLUSTER
command:
- Clustering the table by a specific column may affect the correlation of other column. See for example the correlation of the column
id
after we clustered the table bysold_at
. CLUSTER
is a heavy, blocking operation, so make sure you don't execute it on a live table.
For these two reason it's best to insert the data sorted and not rely on CLUSTER
.
Index Columns With High Correlation Using BRIN
When talking about indexes, most developers will think about B-Tree indexes. But, PostgreSQL provides other types of indexes such as BRIN:
BRIN is designed for handling very large tables in which certain columns have some natural correlation with their physical location within the table
BRIN stands for Block Range Index. According to the documentation, a BRIN index works best for columns with high correlation. As we've already seen in previous sections, some fields such as auto incrementing IDs and timestamps are naturally correlated with the physical structure of the table, hence they are good candidates for a BRIN index.
Under some circumstances, a BRIN index can provide a better "value for money" in terms of size and performance compared to a similar B-Tree index.
A BRIN index works by keeping the range of values within a number of adjacent pages in the table. Say we have these values in a column, each is single table page:
1, 2, 3, 4, 5, 6, 7, 8, 9
A BRIN index works on ranges of adjacent pages in the table. If the number of adjacent pages is set to 3, the index will divide the table into the following ranges:
[1,2,3], [4,5,6], [7,8,9]
For each range, the BRIN index keeps the minimum and maximum value:
[1–3], [4–6], [7–9]
Using the index above, try to search for the value 5:
[1–3]
- Definitely not here[4–6]
- Might be here[7–9]
- Definitely not here
Using the BRIN index we managed to limit our search to blocks 4–6.
Let's take another example, this time the values in the column will have a correlation close to zero, meaning they are not sorted:
[2,9,5], [1,4,7], [3,8,6]
Indexing 3 adjacent blocks produces the following ranges:
[2–9], [1–7], [3–8]
Let's try to search for the value 5:
[2–9]
- Might be here[1–7]
- Might be here[3–8]
- Might be here
In this case the index is not limiting the search at all, hence it is useless.
Understanding pages_per_range
The number of adjacent pages is determined by the parameter pages_per_range
. The number of pages per range effects the size and accuracy of the BRIN index:
- A large
pages_per_range
will produce a small and less accurate index - A small
pages_per_range
will produce a bigger and more accurate index
The default pages_per_range
is 128.
To demonstrate, let's create a BRIN index on ranges of 2 adjacent pages and search for the value 5:
[1–2]
- Definitely not here[3–4]
- Definitely not here[5–6]
- Might be here[7–8]
- Definitely not here[9]
- Definitely not here
Using the index with 2 pages per range we were able to limit the search to blocks 5 and 6. When the range was 3 pages, the index limited the search to blocks 4,5 and 6.
Another difference between the two indexes is that when the range was 3 we only had to keep 3 ranges. When the range was 2 we had to keep 5 ranges so the index was bigger.
Creating a BRIN Index
Using the sales_fact
from before, let's create a BRIN index on the column sold_at
:
db=# CREATE INDEX sale_fact_sold_at_bix ON sale_fact
db-# USING BRIN(sold_at) WITH (pages_per_range = 128);
CREATE INDEX
This creates a BRIN index with the default pages_per_range = 128
.
Let's try to query for a range of sale dates:
db=# EXPLAIN (ANALYZE)
db-# SELECT *
db-# FROM sale_fact
db-# WHERE sold_at BETWEEN '2020-07-01' AND '2020-07-31';
QUERY PLAN
--------------------------------------------------------------------------------------------
Bitmap Heap Scan on sale_fact (cost=13.11..1135.61 rows=4319 width=41)
Recheck Cond: ((sold_at >= '2020-07-01'::date) AND (sold_at <= '2020-07-31'::date))
Rows Removed by Index Recheck: 23130
Heap Blocks: lossy=256
-> Bitmap Index Scan on sale_fact_sold_at_bix (cost=0.00..12.03 rows=12500 width=0)
Index Cond: ((sold_at >= '2020-07-01'::date) AND (sold_at <= '2020-07-31'::date))
Execution Time: 8.877 ms
The database used our BRIN index to get a range of sale dates, but that's not the interesting part...
Optimizing pages_per_range
According to the execution plan, the database removed 23,130 rows from the pages it found using the index. This may indicate that the range we set for the index it too large for this particular query. Let's try to create an index with less pages per range:
db=# CREATE INDEX sale_fact_sold_at_bix64 ON sale_fact
db-# USING BRIN(sold_at) WITH (pages_per_range = 64);
CREATE INDEX
db=# EXPLAIN (ANALYZE)
db- SELECT *
db- FROM sale_fact
db- WHERE sold_at BETWEEN '2020-07-01' AND '2020-07-31';
QUERY PLAN
---------------------------------------------------------------------------------------------
Bitmap Heap Scan on sale_fact (cost=13.10..1048.10 rows=4319 width=41)
Recheck Cond: ((sold_at >= '2020-07-01'::date) AND (sold_at <= '2020-07-31'::date))
Rows Removed by Index Recheck: 9434
Heap Blocks: lossy=128
-> Bitmap Index Scan on sale_fact_sold_at_bix64 (cost=0.00..12.02 rows=6667 width=0)
Index Cond: ((sold_at >= '2020-07-01'::date) AND (sold_at <= '2020-07-31'::date))
Execution Time: 5.491 ms
With 64 pages per range the database removed less rows from the pages it found using the the index, only 9,434 were removed compared with 23,130 when the the range was 128 pages. This means the database had to do less IO and the query was slightly faster, ~5.5ms compared to ~8.9ms.
Testing the index with different values for pages_per_range
produced the following results:
pages_per_range | Rows Removed by Index Recheck |
---|---|
128 | 23,130 |
64 | 9,434 |
8 | 874 |
4 | 446 |
2 | 446 |
We can see that as we decrease pages_per_range
, the index is more accurate and less rows are removed from the pages found using the index.
Note that we optimized the query for a very specific query. This is fine for demonstration purposes, but in real life it's best to use values that meet the needs of most queries.
Evaluating Index Size
Another big selling point for BRIN indexes is their size. In previous sections we created a B-Tree index on the sold_at
field. The size of the index was 2224kB. The size a BRIN index with pages_per_range=128
is only 48kb. That's 46 times smaller than the B-Tree index.
Schema | Name | Type | Owner | Table | Size
--------+-----------------------+-------+-------+-----------+-------
public | sale_fact_sold_at_bix | index | haki | sale_fact | 48 kB
public | sale_fact_sold_at_ix | index | haki | sale_fact | 2224 kB
The size of a BRIN index is also affected by pages_per_range
. For example, a BRIN index with pages_per_range=2
weighs 56kb, which is only slightly bigger than 48kb.
Make Indexes "Invisible"
PostgreSQL has a nice feature called transactional DDL. After years of using Oracle, I got used to DDL commands such as CREATE
, DROP
and ALTER
ending a transaction. However, in PostgreSQL you can perform DDL commands inside a transaction, and changes will take effect only when the transaction is committed.
As I recently discovered, using transactional DDL you can make indexes invisible! This comes in handy when you want to see what an execution plan looks like without some index.
For example, in the sale_fact
table from the previous section we created an index on sold_at
. The execution plan for fetching sales made in July looked like this:
db=# EXPLAIN
db-# SELECT *
db-# FROM sale_fact
db-# WHERE sold_at BETWEEN '2020-07-01' AND '2020-07-31';
QUERY PLAN
--------------------------------------------------------------------------------------------
Index Scan using sale_fact_sold_at_ix on sale_fact (cost=0.42..182.80 rows=4319 width=41)
Index Cond: ((sold_at >= '2020-07-01'::date) AND (sold_at <= '2020-07-31'::date))P
To see what the execution plan would be if the index sale_fact_sold_at_ix
did not exist, we can drop the index inside a transaction and immediately rollback:
db=# BEGIN;
BEGIN
db=# DROP INDEX sale_fact_sold_at_ix;
DROP INDEX
db=# EXPLAIN
db-# SELECT *
db-# FROM sale_fact
db-# WHERE sold_at BETWEEN '2020-07-01' AND '2020-07-31';
QUERY PLAN
---------------------------------------------------------------------------------
Seq Scan on sale_fact (cost=0.00..2435.00 rows=4319 width=41)
Filter: ((sold_at >= '2020-07-01'::date) AND (sold_at <= '2020-07-31'::date))
db=# ROLLBACK;
ROLLBACK
We first start a transaction using BEGIN
. Then we drop the index and generate an execution plan. Notice that the execution plan now uses a full table scan, as if the index does not exist. At this point the transaction is still in progress, so the index is not dropped yet. To finish the transaction without dropping the index we rollback the transaction using the ROLLBACK
command.
Now, make sure the index still exists:
db=# \di+ sale_fact_sold_at_ix
List of relations
Schema | Name | Type | Owner | Table | Size
--------+----------------------+-------+-------+-----------+---------
public | sale_fact_sold_at_ix | index | haki | sale_fact | 2224 kB
Other database that don't support transactional DDL provide other ways to achieve the same goal. For example, Oracle let's you mark an index as invisible, which will cause the optimizer to ignore it.
CAUTION: Dropping an index inside a transaction will lock out concurrent selects, inserts, updates, and deletes on the table while the transaction is active. Use with caution in test environments, and avoid on production databases.
Don't Schedule Long Running Processes at Round Hours
It's a known fact among investors that weird things can happen when a stock's price reaches a nice round number such as 10$, 100$, 1000$. As the following article explains:
[...] asset's price may have a difficult time moving beyond a round number, such as $50 or $100 per share. Most inexperienced traders tend to buy or sell assets when the price is at a whole number because they are more likely to feel that a stock is fairly valued at such levels.
Developers in this sense are not all that different than the investors. When they need to schedule a long running process, they will usually schedule it at a round hour.
This tendency to schedule tasks at round hours can cause some unusual loads during these times. So, if you need to schedule some long running process, you have a better chance of finding a system at rest if you schedule at another time.
Another good idea is to apply a random delay to the task's schedule, so it doesn't run at the same time every time. This way, even if another task is scheduled to run at the same time, it won't be a big problem. If you use systemd
timer units to schedule your tasks, you can use the RandomizedDelaySec option for this.
Conclusion
This article covers some trivial and non-trivial tips from my own experience. Some of these tips are easy to implement, and some require a deeper understanding of how the database works. Databases are the backbone of most modern systems, so taking some time to understand how they work is a good investment for any developer!
This article was reviewed by the great team at pgMustard