Choosing to use a block range index (BRIN) to query a field with high correlation is a no-brainer for the optimizer. The small size of the index and the field's correlation makes BRIN an ideal choice. However, a recent event taught us that correlation can be misleading. Under some easily reproducible circumstances, a BRIN index can result in significantly slower execution even when the indexed field has very high correlation.
In this article I describe how using a BRIN index in presumably "ideal circumstances" can result in degraded performance and suggest a recent new feature of PostgreSQL as a remedy.
Table of Contents
How BRIN Index Works
Block range index works by keeping the minimum and maximum value for a ranges of adjacent table pages. To best understand how BRIN index works, let's build one.
Imagine you have the following values, each in a single table block:
Physical Ordering | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
Value | A |
B |
C |
D |
E |
F |
G |
H |
I |
Logical Ordering | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Every group of 3 adjacent pages is a range:
Physical Ordering | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
Value | A |
B |
C |
D |
E |
F |
G |
H |
I |
Logical Ordering | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
For each range, keep only the minimum and the maximum value:
Range | Minmax |
---|---|
1-3 | A -C |
4-6 | D -F |
7-9 | G -I |
This is a single minimax block range index.
Let's use the index to find the value E
:
Range | Minmax | Is E in range? |
---|---|---|
1-3 | A -C |
💀 Definitely not here |
4-6 | D -F |
💩 Might be here |
7-9 | G -I |
💀 Definitely not here |
Using the BRIN index we can eliminate blocks 1-3 and 7-9, and narrow the search to just blocks 4-5 in the table. That's pretty good.
Using the BRIN index the database still needs to read all the table blocks matched by the index and search for the value. PostgreSQL calls this "index recheck". In this case, the database would read blocks 4, 5 and 6 and search for rows with the value E
.
A BRIN index can't say a value is definitely in a range. It can only say a value is definitely not in the range, or that it might be in the range. Indexes that produce inconclusive results, or false positives, are called "lossy indexes". Lossy indexes often sacrifice accuracy for space, and as we'll see later, a BRIN index is very small compared to the alternatives. Other types of lossy indexes in PostgreSQL include GIN and GIST.
In the index we've just built, the blocks are perfectly ordered by the value they held. To illustrate why this is important for BRIN indexes, consider another example where the blocks are not perfectly ordered:
Physical Ordering | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
Value | B |
I |
E |
A |
D |
G |
C |
H |
F |
Logical Ordering | 2 | 9 | 5 | 1 | 4 | 7 | 3 | 8 | 6 |
Once again we keep just the minimum and maximum for each range of 3 adjacent pages, and look for the value E
:
Range | Minmax | Is E in range? |
---|---|---|
1-3 | B –I |
💩 Might be here |
4-6 | A –G |
💩 Might be here |
7-9 | C –H |
💩 Might be here |
We're unable to eliminate any of the table blocks. The index is useless in this case! In fact, using this index the database would have to read the entire index and the entire table. That's bad.
Correlation
This is how the PostgreSQL documentation defines correlation:
Statistical correlation between physical row ordering and logical ordering of the column values.
Correlation is the coefficient between the logical order of a value and its physical position in the table. In an append-only table for example, an auto incrementing key would have perfect correlation of 1. A random value however, should have very poor correlation, a value close to 0.
The database keeps track of correlation when it analyzes values in table columns. The result is stored in pg_stats.correlation
and the optimizer is factoring it in when deciding if and which indexes to use.
Consider the following time series:
db=# SET time zone UTC;
SET
db=# SELECT setseed(0.4050);
setseed
─────────
(1 row)
db=# CREATE TABLE t (
id int,
happened_at timestamptz,
padding text
);
CREATE TABLE
The table t
contains an identity field id
, a timestamp and some data. To get the ball rolling, we populate the table with 1M rows:
db=# INSERT INTO t (
id,
happened_at,
padding
)
SELECT
n,
'2023-01-01 UTC'::timestamptz + (interval '1 second' * n),
generate_random_string(1000)
FROM
generate_series(1, 1000000) as n;
INSERT 0 1000000
We add some padding to the table to make benchmarks more realistic. If you're curious about this approach read about the surprising impact of medium-size texts on PostgreSQL performance.
db=# SELECT count(*), max(happened_at) FROM t;
count │ max
─────────┼────────────────────────
1000000 │ 2023-01-12 13:46:40+00
The table t
now contains 1M rows with the timestamp field happened_at
going up until 2023-01-12 13:46:40+00
.
You may have noticed that we created the table with incrementing timestamps. We did that to make sure we get perfect correlation:
db=# ANALYZE t;
ANALYZE
db=# SELECT correlation
FROM pg_stats
WHERE tablename = 't'
AND attname = 'happened_at';
correlation
─────────────
1
The correlation of the field happened_at
is 1. This means we have perfect correlation between the physical ordering and the logical ordering of the values of the column happened_at
in the table.
To establish a baseline, execute a query to find values that happened in a duration of one minute with no indexes on the table:
db=# EXPLAIN (ANALYZE)
SELECT * FROM t WHERE happened_at BETWEEN '2023-01-12 13:45:00 UTC' AND '2023-01-12 13:46:00 UTC';
QUERY PLAN
────────────────────────────────────────────────────────────────────────────────────────────────────
Gather (cost=1000.00..150108.10 rows=1 width=1016) (actual time=1330.979..1334.093 rows=61 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on t (cost=0.00..149108.00 rows=1 width=1016) (actual time=1308.808..1308.828 rows=20 loops=3)
Filter: ((happened_at >= '2023-01-12 15:45:00+02'::timestamp with time zone)
AND (happened_at <= '2023-01-12 15:46:00+02'::timestamp with time zone))
Rows Removed by Filter: 333313
Planning Time: 0.193 ms
JIT:
Functions: 6
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 1.750 ms, Inlining 0.000 ms, Optimization 0.855 ms, Emission 10.032 ms, Total 12.637 ms
Execution Time: 1334.676 ms
With a no indexes, a cold cache and two parallel workers the query completed in ~1.3 seconds. After executing the query several times the timing can reach ~200ms.
BRIN Index with Perfect Correlation
Another way of thinking about correlation is that when searching for a range of values, it's very likely that rows with approximate values will be in the same block, or an adjacent block. This feature is extremely important for BRIN indexes, as the documentation emphasises:
BRIN is designed for handling very large tables in which certain columns have some natural correlation with their physical location within the table
The happened_at
column has perfect correlation so it's an ideal candidate for a BRIN index:
db=# CREATE INDEX t_happened_at_brin_minmax ON t
USING brin(happened_at) WITH (pages_per_range=10);
CREATE INDEX
We create a BRIN index on the field happened_at
. We did not explicitly provide an opclass, so the default minmax
is used. We also override the default pages_per_range
and set it to 10. This means the BRIN index will keep the minimum and maximum values of the field happened_at
for every range of 10 adjacent pages in the table.
Being a lossy index, one of the main benefits of BRIN indexes is their small size:
db=# \di+ t_happened_at_brin_minmax
List of relations
─[ RECORD 1 ]─┬──────────────────────────
Schema │ public
Name │ t_happened_at_brin_minmax
Type │ index
Owner │ haki
Table │ t
Persistence │ permanent
Access method │ brin
Size │ 520 kB
Description │
The size of the index is just 520 kB.
To see the BRIN index in action, execute the same query to find values that happened in a duration of one minute:
db=# EXPLAIN (ANALYZE)
SELECT * FROM t WHERE happened_at BETWEEN '2023-01-12 13:45:00 UTC' AND '2023-01-12 13:46:00 UTC';
QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────
Bitmap Heap Scan on t (cost=218.00..494.40 rows=1 width=1016)
(actual time=8.400..8.464 rows=61 loops=1)
Recheck Cond: ((happened_at >= '2023-01-12 13:45:00+00'::timestamp with time zone)
AND (happened_at <= '2023-01-12 13:46:00+00'::timestamp with time zone))
Rows Removed by Index Recheck: 59
Heap Blocks: lossy=18
-> Bitmap Index Scan on t_happened_at_brin_minmax (cost=0.00..218.00 rows=70 width=0)
(actual time=8.371..8.371 rows=180 loops=1)
Index Cond: ((happened_at >= '2023-01-12 13:45:00+00'::timestamp with time zone)
AND (happened_at <= '2023-01-12 13:46:00+00'::timestamp with time zone))
Planning Time: 0.568 ms
Execution Time: 8.514 ms
The database used the BRIN index and returned results pretty quickly, only 8.5 ms. The "Bitmap Index Scan" on the BRIN index produced a list of 18 blocks that may contain relevant values. The database then read those 18 table block and rechecked the condition against the rows in these blocks. 59 rows did not match the condition and were "Removed by Index Recheck", leaving 61 rows in the final result.
To better understand how a BRIN index works, we can use the pageinspect
extension to view the contents of the actual index blocks:
db=# CREATE EXTENSION pageinspect;
CREATE EXTENSION
To find the index blocks, we start by looking at the index metapage info:
db=# SELECT * FROM brin_metapage_info(get_raw_page('t_happened_at_brin_minmax', 0));
magic │ version │ pagesperrange │ lastrevmappage
────────────┼─────────┼───────────────┼────────────────
0xA8109CFA │ 1 │ 10 │ 11
The index blocks starts after the lastrevmappage
. To view the contents of the first index block, we inspect the next block (11 + 1 = 12):
db=# SELECT blknum, value
FROM brin_page_items(get_raw_page('t_happened_at_brin_minmax', 12), 't_happened_at_brin_minmax')
ORDER BY 1 LIMIT 3;
blknum │ value
────────┼────────────────────────────────────────────────────
2910 │ {2023-01-01 05:39:31+00 .. 2023-01-01 05:40:40+00}
2920 │ {2023-01-01 05:40:41+00 .. 2023-01-01 05:41:50+00}
2930 │ {2023-01-01 05:41:51+00 .. 2023-01-01 05:43:00+00}
Let's break down the first row:
- The range consists of 10 table pages (from 2910 to 2920). This is the value we provided for
pages_per_range
when we created the index. - The range contains values from
2023-01-01 05:39:31+00
to2023-01-01 05:40:40+00
, that's 70 seconds. The range should contain ~70 rows. Let's verify:
db=# SELECT count(*) FROM t
WHERE happened_at BETWEEN '2023-01-01 05:39:31+00'::timestamptz AND '2023-01-01 05:40:40+00'::timestamptz;
count
───────
70
BRIN Index with Outliers
So far we've used a BRIN index on a field with perfect correlation - this is the ideal situation for a BRIN index. Next, we create a similar table and introduce some outliers - extreme values that don't follow the natural correlation of the field:
db=# CREATE TABLE t_outliers AS
SELECT
id,
CASE
WHEN id % 70 = 0
THEN happened_at + INTERVAL '1 year' -- <-- Outlier
ELSE happened_at
END,
padding
FROM
t
ORDER BY
id;
SELECT 1000000
The new table is created from the previous table t
. To introduce outliers, every 70 rows we add a year to the value of happened_at
:
db=# SELECT id, happened_at FROM t_outliers ORDER BY id OFFSET 65 LIMIT 10;
id │ happened_at
────┼────────────────────────
66 │ 2023-01-01 02:01:06+02
67 │ 2023-01-01 02:01:07+02
68 │ 2023-01-01 02:01:08+02
69 │ 2023-01-01 02:01:09+02
70 │ 2024-01-01 02:01:10+02 -- <--- outlier
71 │ 2023-01-01 02:01:11+02
72 │ 2023-01-01 02:01:12+02
73 │ 2023-01-01 02:01:13+02
74 │ 2023-01-01 02:01:14+02
75 │ 2023-01-01 02:01:15+02
(10 rows)
Without the outliers we had perfect correlation. So next, let's analyze the table and see what the correlation of the field happened_at
with the outliers:
db=# ANALYZE t_outliers;
ANALYZE
db=# SELECT correlation
FROM pg_stats
WHERE tablename = 't_outliers'
AND attname = 'happened_at';
correlation
─────────────
0.97344303
The correlation is no longer perfect, but it's still very high - 0.97. Under normal circumstances, this type of correlation can be considered high, thus a BRIN index is a good candidate:
db=# CREATE INDEX t_outliers_happened_at_brin_minmax ON t_outliers
USING brin(happened_at) WITH (pages_per_range=10);
CREATE INDEX
With the BRIN index in place, execute the same query as before:
db=# EXPLAIN (ANALYZE)
SELECT * FROM t_outliers
WHERE happened_at BETWEEN '2023-01-12 13:45:00 UTC' AND '2023-01-12 13:46:00 UTC';
QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────
Bitmap Heap Scan on t_outliers (cost=218.00..502.23 rows=1 width=1016)
(actual time=3257.316..3257.372 rows=60 loops=1)
Recheck Cond: ((happened_at >= '2023-01-12 13:45:00+00'::timestamp with time zone)
AND (happened_at <= '2023-01-12 13:46:00+00'::timestamp with time zone))
Rows Removed by Index Recheck: 999940
Heap Blocks: lossy=142858
-> Bitmap Index Scan on t_outliers_happened_at_brin_minmax (cost=0.00..218.00 rows=72 width=0)
(actual time=28.085..28.085 rows=1428580 loops=1)
Index Cond: ((happened_at >= '2023-01-12 13:45:00+00'::timestamp with time zone)
AND (happened_at <= '2023-01-12 13:46:00+00'::timestamp with time zone))
Planning Time: 0.785 ms
Execution Time: 3257.412 ms
That's a big difference! This query completed in more than 3 seconds - that's ~x300 slower compared to the 8.514 ms it took the execute the same query on the table without the outliers, and ~x3 times slower than the full table scan.
To investigate why this query took so much longer than the previous one, we need to take a closer look at the execution plan. The "Rows Removed by Index Recheck" gives us a hint - the database filtered out 999940 rows from the table blocks matched by the BRIN index. Remember, the table contains only 1M rows, so this query effectively scanned the entire table and the index. This is about as bad as it gets!
To get a better sense of what happened here, let's examine the contents of the BRIN index using pageinspect
:
db=# SELECT * FROM brin_metapage_info(get_raw_page('t_outliers_happened_at_brin_minmax', 0));
magic │ version │ pagesperrange │ lastrevmappage
────────────┼─────────┼───────────────┼────────────────
0xA8109CFA │ 1 │ 10 │ 11
(1 row)
db=# SELECT blknum, value
FROM brin_page_items(get_raw_page('t_outliers_happened_at_brin_minmax', 12), 't_outliers_happened_at_brin_minmax')
ORDER BY 1 LIMIT 3;
blknum │ value
────────┼────────────────────────────────────────────────────
2910 │ {2023-01-01 05:39:31+00 .. 2024-01-01 05:40:40+00}
2920 │ {2023-01-01 05:40:41+00 .. 2024-01-01 05:41:50+00}
2930 │ {2023-01-01 05:41:51+00 .. 2024-01-01 05:43:00+00}
(3 rows)
Notice that with the outliers, each index block contains a significantly bigger range of values. Without the outliers, each range contained values in a ~70 seconds time span. With the outliers, each range spans an entire year! When the database used the index to identify pages that may satisfy the query, a lot of ranges matched! This caused the index to produce a lot of false positives and as a result, the database had to read and then sift out a lot of irrelevant rows.
Multi-Minmax BRIN Index with Outliers
The default operator class for BRIN indexes is minimax. Using this operator class, each BRIN index entry contains a single minimum and maximum value for each range of adjacent pages in the table. With perfect correlation this is just fine, but with outliers, the single minmax range can cause the index to return a lot of false positives.
What if instead of keeping a single minimum and maximum value for each range, the database would keep multiple minimum and maximum values? This should make the BRIN index more resilient to outliers. In PostgreSQL 14, a new set of operator classes *_minmax_multi_ops
were added to BRIN that does exactly that.
Let's re-create the BRIN index on the table with outliers, but this time use the new multi-minmax operator class:
db=# DROP INDEX t_outliers_happened_at_brin_minmax;
DROP INDEX
db=# CREATE INDEX t_outliers_happened_at_brin_multi_minmax ON t_outliers
USING brin(happened_at timestamptz_minmax_multi_ops)
WITH (pages_per_range=10);
CREATE INDEX
To create a multi minimax BRIN index we explicitly provide a specific operator class. The type of the column happened_at
is timestamptz
, so we use the corresponding timestamptz_minmax_multi_ops
operator class.
The multi-minmax index holds more information than the single minmax index, so it should be slightly bigger in size:
db=# \di+ t_outliers_happened_at_brin_multi_minmax
List of relations
─[ RECORD 1 ]─┬─────────────────────────────────────────
Schema │ public
Name │ t_outliers_happened_at_brin_multi_minmax
Type │ index
Owner │ haki
Table │ t_outliers
Persistence │ permanent
Access method │ brin
Size │ 1304 kB
Description │
Indeed, the multi minmax index is 1304 kB - bigger than the single minmax index that weighs 520 kB.
To check if we get our money's worth for the multi-minmax index, execute the query on the table with the outliers:
db=# EXPLAIN (ANALYZE)
SELECT * FROM t_outliers WHERE happened_at BETWEEN '2023-01-12 13:45:00 UTC' AND '2023-01-12 13:46:00 UTC';
QUERY PLAN
────────────────────────────────────────────────────────────────────────────────────────────────────────────
Bitmap Heap Scan on t_outliers (cost=1290.00..1574.23 rows=1 width=1016)
(actual time=29.518..29.566 rows=60 loops=1)
Recheck Cond: ((happened_at >= '2023-01-12 13:45:00+00'::timestamp with time zone)
AND (happened_at <= '2023-01-12 13:46:00+00'::timestamp with time zone))
Rows Removed by Index Recheck: 60
Heap Blocks: lossy=18
-> Bitmap Index Scan on t_outliers_happened_at_brin_multi_minmax (cost=0.00..1290.00 rows=72 width=0)
(actual time=29.491..29.492 rows=180 loops=1)
Index Cond: ((happened_at >= '2023-01-12 13:45:00+00'::timestamp with time zone)
AND (happened_at <= '2023-01-12 13:46:00+00'::timestamp with time zone))
Planning Time: 5.807 ms
Execution Time: 29.620 ms
The database used the multi-minmax index and the query completed in 29ms - a big improvement compared to more than 3s with the single range minmax, and not much worse than the query on the table with no outliers which took 8ms.
Notice the number of "Row Removed by Index Recheck" - it is once again very small, only 60 rows. This means the index did a good job at pointing the database to the most relevant pages in the table. In other words, the index produced very little false positives.
To better understand how the multi-minmax BRIN index achieved this improvement, inspect the contents of the first index block:
db=# SELECT blknum, value
FROM brin_page_items(get_raw_page('t_outliers_happened_at_brin_multi_minmax', 12), 't_outliers_happened_at_brin_multi_minmax')
ORDER BY 1 LIMIT 3;
─[ RECORD 1 ]────────────────────────────────────────────────────────────
blknum │ 4500
value │ {{
nranges: 2
nvalues: 14
maxvalues: 32
ranges: {
"2023-01-01 08:45:01+00 ... 2023-01-01 08:45:02+00",
"2023-01-01 08:45:16+00 ... 2023-01-01 08:46:09+00"}
values: {
"2023-01-01 08:45:03+00",
"2023-01-01 08:45:04+00",
"2023-01-01 08:45:05+00",
"2023-01-01 08:45:06+00",
"2023-01-01 08:45:07+00",
"2023-01-01 08:45:08+00",
"2023-01-01 08:45:09+00",
"2023-01-01 08:45:10+00",
"2023-01-01 08:45:11+00",
"2023-01-01 08:45:12+00",
"2023-01-01 08:45:13+00",
"2023-01-01 08:45:14+00",
"2023-01-01 08:45:15+00",
"2024-01-01 08:46:10+00"}}}
The contents of the multi-minmax BRIN index now contains both ranges and values. A value is essentially a single value range. For example, the value 2023-01-01 08:45:15+00
is equivalent to a range 2023-01-01 08:45:15+00 .. 2023-01-01 08:45:15+00
. For optimization reasons, single value ranges are kept as single values and not as ranges.
Potential outliers will usually be in the list of values. Notice the outlier value 2024-01-01 08:46:10+00
is not in any of the ranges. This is how the multi-minmax BRIN index is preventing outliers from affecting the index.
How Multi-minmax BRIN Index Works
In the previous example you may have noticed that the multi range minmax index block contained a lot of values
, some of which are consecutive:
db=# SELECT blknum, value
FROM brin_page_items(get_raw_page('t_outliers_happened_at_brin_multi_minmax', 12), 't_outliers_happened_at_brin_multi_minmax')
ORDER BY 1 LIMIT 3;
─[ RECORD 1 ]────────────────────────────────────────────────────────────
blknum │ 4500
value │ {{
nranges: 2
nvalues: 14
maxvalues: 32
ranges: {
"2023-01-01 08:45:01+00 ... 2023-01-01 08:45:02+00",
"2023-01-01 08:45:16+00 ... 2023-01-01 08:46:09+00"}
values: {
"2023-01-01 08:45:03+00",
"2023-01-01 08:45:04+00",
"2023-01-01 08:45:05+00",
"2023-01-01 08:45:06+00",
"2023-01-01 08:45:07+00",
"2023-01-01 08:45:08+00",
"2023-01-01 08:45:09+00",
"2023-01-01 08:45:10+00",
"2023-01-01 08:45:11+00",
"2023-01-01 08:45:12+00",
"2023-01-01 08:45:13+00",
"2023-01-01 08:45:14+00",
"2023-01-01 08:45:15+00",
"2024-01-01 08:46:10+00"}}}
These values can potentially be combined into their own range of 2023-01-01 08:45:03+00 ... 2023-01-01 08:45:15+00
, or even added to the existing range 2023-01-01 08:45:01+00 ... 2023-01-01 08:45:02+00
. The reason they don't, is the multi-minimax algorithm.
A multi-minimax BRIN index defines values_per_range
- the maximum amount of values per range of adjacent pages. A minmax range takes up 2 values. When adding new values to a page range, the algorithm works as follows:
- Does the range of adjacent pages contain more than
values_per_range
values?- No -> continue
- Yes -> group values into minmax ranges to reduce number of values to
values_per_range / 2
Consider the following simulation using a table with a single integer column n
:
CREATE TABLE t_brin_minmax (n int);
CREATE INDEX t_brin_minmax_index ON t_brin_minmax
USING brin(n int4_minmax_multi_ops(values_per_range=8))
WITH (pages_per_range=2);
We create a table with a single integer field and a multi-minmax BRIN index with 2 pages per range, and a maximum of 8 values per range.
To visualize the multi-minmax BRIN algorithm we add values from 1 to 10 and for each value we print the contents of the index page:
DO $$
DECLARE
n integer;
page_items text := '';
BEGIN
TRUNCATE t_brin_minmax;
FOREACH n IN ARRAY array[1, 2, 3, 4, 5, 6, 7, 8, 9, 10] LOOP
INSERT INTO t_brin_minmax(n) VALUES (i);
SELECT value INTO page_items
FROM brin_page_items(get_raw_page('t_brin_minmax_index', 2), 't_brin_minmax_index');
RAISE NOTICE 'inserted=% --> %', n, page_items;
END LOOP;
END $$;
NOTICE: inserted=1 --> {{nranges: 0 nvalues: 1 maxvalues: 8 values: {1}}}
NOTICE: inserted=2 --> {{nranges: 0 nvalues: 2 maxvalues: 8 values: {1,2}}}
NOTICE: inserted=3 --> {{nranges: 0 nvalues: 3 maxvalues: 8 values: {1,2,3}}}
NOTICE: inserted=4 --> {{nranges: 0 nvalues: 4 maxvalues: 8 values: {1,2,3,4}}}
NOTICE: inserted=5 --> {{nranges: 1 nvalues: 3 maxvalues: 8 ranges: {"4 ... 5"} values: {1,2,3}}}
NOTICE: inserted=6 --> {{nranges: 1 nvalues: 3 maxvalues: 8 ranges: {"4 ... 6"} values: {1,2,3}}}
NOTICE: inserted=7 --> {{nranges: 1 nvalues: 3 maxvalues: 8 ranges: {"4 ... 7"} values: {1,2,3}}}
NOTICE: inserted=8 --> {{nranges: 1 nvalues: 3 maxvalues: 8 ranges: {"4 ... 8"} values: {1,2,3}}}
NOTICE: inserted=9 --> {{nranges: 1 nvalues: 3 maxvalues: 8 ranges: {"4 ... 9"} values: {1,2,3}}}
NOTICE: inserted=10 --> {{nranges: 1 nvalues: 3 maxvalues: 8 ranges: {"4 ... 10"} values: {1,2,3}}}
Values 1
, 2
, 3
, 4
are added to the index one after the other but they are not grouped into a range. Only when the value 5
is added, the number of values exceeds values_per_range / 2 = 4
and the database starts grouping values to reduce the number of entries back to 4. At this point, the database joined 5
to 4
to create the range 4..5
the and the rest of the values are kept as values. As we add more values into the index, the range 4..5
is expanded until it reaches 4..10
. We don't exceed 4 values, so values 1
, 2
and 3
remain as values.
The order of insertion is significant. If we insert values in a different order the resulting index will be different as well:
DO $$
DECLARE
n integer;
page_items text := '';
BEGIN
TRUNCATE t_brin_minmax;
FOREACH n IN ARRAY array[5, 1, 10, 4, 2, 8, 3, 7, 9, 6] LOOP
INSERT INTO t_brin_minmax(n) VALUES (n);
SELECT value INTO page_items
FROM brin_page_items(get_raw_page('t_brin_minmax_index', 2), 't_brin_minmax_index');
RAISE NOTICE 'inserted=% --> %', n, page_items;
END LOOP;
END $$;
NOTICE: inserted=5 --> {{nranges: 0 nvalues: 1 maxvalues: 8 values: {5}}}
NOTICE: inserted=1 --> {{nranges: 0 nvalues: 2 maxvalues: 8 values: {1,5}}}
NOTICE: inserted=10 --> {{nranges: 0 nvalues: 3 maxvalues: 8 values: {1,5,10}}}
NOTICE: inserted=4 --> {{nranges: 0 nvalues: 4 maxvalues: 8 values: {1,4,5,10}}}
NOTICE: inserted=2 --> {{nranges: 1 nvalues: 3 maxvalues: 8 ranges: {"4 ... 5"} values: {1,2,10}}}
NOTICE: inserted=8 --> {{nranges: 2 nvalues: 2 maxvalues: 8 ranges: {"1 ... 2","4 ... 5"} values: {8,10}}}
NOTICE: inserted=3 --> {{nranges: 2 nvalues: 2 maxvalues: 8 ranges: {"1 ... 2","3 ... 5"} values: {8,10}}}
NOTICE: inserted=7 --> {{nranges: 3 nvalues: 1 maxvalues: 8 ranges: {"1 ... 2","3 ... 5","7 ... 8"} values: {10}}}
NOTICE: inserted=9 --> {{nranges: 4 nvalues: 0 maxvalues: 8 ranges: {"1 ... 2","3 ... 5","7 ... 8","9 ... 10"}}}
NOTICE: inserted=6 --> {{nranges: 3 nvalues: 1 maxvalues: 8 ranges: {"1 ... 2","3 ... 5","7 ... 10"} values: {6}}}
The index is now different. Instead of 1 minmax range and 3 values, the index now contains 3 minmax ranges and 1 value.
Results Summary
As mentioned before, one of the main benefits of lossy indexes such as BRIN over other types of index, such as B-Tree or a Hash indexes), is their size. To get a fair comparison for both types of index consider a similar B-Tree index on happened_at
:
db=# CREATE INDEX t_happened_at_btree ON t(happened_at);
CREATE INDEX
db=# \di+ t_happened_at_btree
List of relations
─[ RECORD 1 ]─┬────────────────────
Schema │ public
Name │ t_happened_at_btree
Type │ index
Owner │ haki
Table │ t
Persistence │ permanent
Access method │ btree
Size │ 21 MB
Description │
db=# EXPLAIN (ANALYZE)
SELECT * FROM t WHERE happened_at BETWEEN '2023-01-12 13:45:00 UTC' AND '2023-01-12 13:46:00 UTC';
QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────
Index Scan using t_happened_at_btree on t (cost=0.42..18.75 rows=66 width=1016)
(actual time=0.036..0.109 rows=61 loops=1)
Index Cond: ((happened_at >= '2023-01-12 13:45:00+00'::timestamp with time zone)
AND (happened_at <= '2023-01-12 13:46:00+00'::timestamp with time zone))
Planning Time: 0.244 ms
Execution Time: 0.145 ms
(4 rows)
The B-Tree index is much faster, but also much bigger!
Here is a recap of the timing and size of the indexes we used:
Scenario | Index | Timing | Index Size |
---|---|---|---|
Perfect correlation | No index | 1334 ms | - |
Perfect correlation | BRIN minmax | 8.52 ms | 520 kB |
Perfect correlation | B-Tree | 0.145 ms | 21 MB |
Outliers | BRIN minmax | 3257.412 ms | 520 kB |
Outliers | BRIN multi-minmax | 29.620 ms | 1304 kB |
The BRIN indexes with the single minmax value are the same size. The multi-minmax BRIN index is larger because it stores more information. This may be a small price to pay if your data contain outliers.
The Back Story
The example in this article is contrived - it was engineered to inject an outlier in every range so that the single range BRIN would perform as worst as possible. However, this article was motivated by a similar scenario we reached naturally in our systems.
We have a table of payment processes. Payment processes are usually short lived objects - in less than a minute a normal payment process is either authorized or aborted. However, once in a while a user may abandon a payment process in the middle, leaving the object in a non-terminal state. To overcome that, we've set up a scheduled task to identify these processes and mark them as expired.
When we create a new payment process we keep track of the creation date. It is used by analytical and scheduled queries. The creation date column had a high correlation (~0.97) so we figured a BRIN index would speed things up. At first, queries ran fast, but over time queries become slower and slower to a point where they were constantly timing out. We were puzzled - we had a field with high correlation but the BRIN index performed very poorly. We couldn't figure out why. Only when we inspected the contents of the index using pageinspect
we realized what was the problem.
An update command in PostgreSQL is equivalent to delete and insert. However, when you delete a row the space does not immediately become available for new data, so the updated tuple is written to the next available space in the table (because some of the updated columns are indexed, HOT updates don't help). Previously used space can be reused only after vacuum is executed. Vaccum is not executed all the time, it is triggered automatically only after a certain amount of modifications to the table. The actual frequency depends on the storage settings of the table.
We eventually realized that our expire task shuffled the table, moving rows with recent creation date to old blocks, introducing outliers in the BRIN ranges.
After upgrading our PostgreSQL installation we switched the single minmax BRIN index to a multi minmax BRIN index and the problem was solved. The old BRIN index was 1.6MB and by the time we got to benchmark it, the query timed out (e.g more than 5 minutes). After we switched to multi minmax, the size of the index was 12MB and the benchmark query finished in 5s - more than acceptable for what we used it for.