I recently had to produce reports on a table containing events of a user's account balance. The user can deposit and withdraw from their account, and support personnel can set the account's credit, which is the maximum amount the user can overdraw.
The table looked roughly like this:
db=# SELECT * FROM event;
id | account | type | happened_at | data
----+---------+------------+------------------------+-----------------------------------
1 | 1 | created | 2019-08-01 15:14:13+03 | {"credit": 0, "delta_balance": 0}
2 | 1 | deposited | 2019-08-01 15:15:15+03 | {"delta_balance": 100}
3 | 1 | withdraw | 2019-08-02 09:35:33+03 | {"delta_balance": -50}
4 | 1 | credit_set | 2019-08-03 16:14:12+03 | {"credit": 50}
5 | 1 | withdraw | 2019-08-03 14:45:44+03 | {"delta_balance": -30}
6 | 1 | credit_set | 2019-08-03 16:14:12+03 | {"credit": 100}
7 | 1 | withdraw | 2019-08-03 16:15:09+03 | {"delta_balance": -50}
(7 rows)
Expand for table and data setup
CREATE TABLE event (
id serial primary key,
account int,
type text,
happened_at timestamp with time zone,
data jsonb
);
SET dateformat YMD;
INSERT INTO event (
id, account, type, happened_at, data
)
VALUES
(1, 1, 'created', '2019-08-01 15:14:13', '{"delta_balance": 0, "credit": 0}'),
(2, 1, 'deposited', '2019-08-01 15:15:15', '{"delta_balance": 100}'),
(3, 1, 'withdraw', '2019-08-02 09:35:33', '{"delta_balance": -50}'),
(4, 1, 'credit_set', '2019-08-03 16:14:12', '{"credit": 50}'),
(5, 1, 'withdraw', '2019-08-03 14:45:44', '{"delta_balance": -30}'),
(6, 1, 'credit_set', '2019-08-03 16:14:12', '{"credit": 100}'),
(7, 1, 'withdraw', '2019-08-03 16:15:09', '{"delta_balance": -50}')
;
To get the current balance of an account, we sum the changes in delta_balance
:
db=# SELECT
account,
SUM((data->'delta_balance')::int) AS balance
FROM
event
GROUP BY
account;
account | balance
---------+---------
1 | -30
The data
field contains information specific to each type of event. To extract the value of delta_balance
from the data
column we use the arrow operator provided by PostgreSQL.
The result of the query shows that the current balance of account 1 is -30. This means the account is in overdraft. To check if this is within the allowed range, we need to compare it to the credit set for this account. The credit for account 1 was initially set to 0 when the account was created. The credit was then adjusted twice, and is currently set to 100.
To get the current state of an account, we need its aggregated balance and the latest credit that was set for it.
The Problem
In Oracle there is a function called last
we can be use to get the last credit_set
event. A query using last
might look like this:
-- Oracle
SELECT
account,
MAX(CASE WHEN type = 'credit_set' THEN data ELSE null END)
KEEP (DENSE_RANK LAST ORDER BY id) AS credit
FROM
event
GROUP BY
account;
PostgreSQL also has a LAST_VALUE
analytic function. Analytics functions cannot be used in a group by the way aggregate functions do:
db=# SELECT
account,
LAST_VALUE(data) OVER (
PARTITION BY account ORDER BY id
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS credit
FROM
event
WHERE
type = 'credit_set'
GROUP BY
account;
ERROR: column "event.data" must appear in the GROUP BY clause or be used in an aggregate function
LINE 3: LAST_VALUE(data) OVER (
The error tells us that the data
field used in the analytic function must be used in the group by. This is not really what we want. To use PostgreSQL's LAST_VALUE
function, we need to remove the group by:
db=# SELECT
account,
LAST_VALUE(data) OVER (
PARTITION BY account ORDER BY id
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS credit
FROM
event
WHERE
type = 'credit_set';
account | credit
---------+-----------------
1 | {"credit": 100}
1 | {"credit": 100}
These are not exactly the results we need. Analytic, or window functions, operate on a set of rows, and not in a group by.
PostgreSQL doesn't have a built-in function to obtain the first or last value in a group using group by.
To get the last value in a group by, we need to get creative!
Old Fashioned SQL
The plain SQL solution is to divide and conquer. We already have a query to get the current balance of an account. If we write another query to get the credit for each account, we can join the two together and get the complete state of an account.
To get the last event for each account in PostgreSQL we can use DISTINCT ON
:
db=# SELECT DISTINCT ON (account)
account,
data->'credit' AS credit
FROM
event
WHERE
type = 'credit_set'
ORDER BY
account,
id DESC;
account | credit
---------+--------
1 | 100
Great! Using DISTINCT ON
we got the last credit set for each account.
DISTINCT ON
I've written about the many faces of DISTINCT in PostgreSQL.
The next step is to join the two queries, and get the complete account state:
db=# SELECT
a.account,
a.balance,
b.credit
FROM
(
SELECT
account,
SUM((data->'delta_balance')::int) AS balance
FROM
event
GROUP BY
account
) AS a
JOIN
(
SELECT DISTINCT ON (account)
account,
data->'credit' AS credit
FROM
event
WHERE
type = 'credit_set'
ORDER BY
account,
id DESC
) AS b
ON a.account = b.account;
account | balance | credit
---------+---------+--------
1 | -30 | 100
We got the expected result.
Before we move on, let's take a glance at the execution plan:
QUERY PLAN
---------------------------------------------------------------------------------------------
Hash Join (cost=44.53..49.07 rows=4 width=44)
Hash Cond: (event.account = b.account)
-> HashAggregate (cost=25.00..27.00 rows=200 width=12)
Group Key: event.account
-> Seq Scan on event (cost=0.00..17.50 rows=750 width=36)
-> Hash (cost=19.49..19.49 rows=4 width=36)
-> Subquery Scan on b (cost=19.43..19.49 rows=4 width=36)
-> Unique (cost=19.43..19.45 rows=4 width=40)
-> Sort (cost=19.43..19.44 rows=4 width=40)
Sort Key: event_1.account, event_1.id DESC
-> Seq Scan on event event_1 (cost=0.00..19.39 rows=4 width=40)
Filter: (type = 'credit_set'::text)
The event table is being scanned twice, once for each subquery. The DISTINCT ON
subquery also requires a sort by account
and id
. The two subqueries are then joined using a hash-join.
PostgreSQL is unable to combine the two subqueries into a single scan of the table. If the event table is very large, performing two full table scans, and a sort and a hash join, might become slow and consume a lot of memory.
common table expression
It's tempting to use common table expression (CTE) to make the query more readable. But, CTE's are currently optimization fences, and using it here will most definitely prevent PostgreSQL from performing any optimization that involves both subqueries.
The Array Trick
Using good ol' SQL got us the answer, but it took two passes on the table. We can do better with the following trick:
db#=>SELECT
account,
SUM((data->'delta_balance')::int) AS balance,
(MAX(ARRAY[id, (data->'credit')::int]) FILTER (WHERE type = 'credit_set'))[2] AS credit
FROM
event
GROUP BY
account;
account | balance | credit
---------+---------+--------
1 | -30 | 100
This is so much simpler than the previous question, so let's break it down.
How PostgreSQL Compares Arrays
To understand what exactly is going on here, we first need to understand how PostgreSQL compares arrays:
Array comparisons compare the array contents element-by-element, using the default B-tree comparison function for the element data type.
When comparing arrays, PostgreSQL will go element by element and compare the values according to their type. To demonstrate:
db=# SELECT greatest(ARRAY[1, 200], ARRAY[2, 100]);
greatest
----------
{2,100}
The first element of the second array (2) is larger than the first element of the first array (1), so it's the greatest.
db=# SELECT greatest(ARRAY[1, 200], ARRAY[1, 201]);
greatest
----------
{1,201}
(1 row)
The first elements of both arrays are equal (1), so PostgreSQL moves on to the next element. In this case, the second element of the second array (201) is the greatest.
Max by Key...
Using this feature of PostgreSQL, we construct an array where the first element is the value to sort by, and the second element is the value we want to keep. In our case, we want to get the credit by the max id
:
MAX(ARRAY[id, (data->'credit')::int])
Not all events set credit, so we need to restrict the result to credit_set
events:
MAX(ARRAY[id, (data->'credit')::int]) FILTER (WHERE type = 'credit_set')
The result of this expression is an array:
db#=>SELECT
account,
MAX(ARRAY[id, (data->'credit')::int]) FILTER (WHERE type = 'credit_set'))
FROM
event
GROUP BY
account;
account | max
---------+---------
1 | {6,100}
We only want the second element, the value of credit
:
(MAX(ARRAY[id, (data->'credit')::int]) FILTER (WHERE type = 'credit_set'))[2]
And this is it! This way we can get the last credit set for each account.
Next, let's examine the execution plan:
QUERY PLAN
---------------------------------------------------------------
HashAggregate (cost=32.50..34.50 rows=200 width=16)
Group Key: account
-> Seq Scan on event (cost=0.00..17.50 rows=750 width=72)
Simple plan for a simple query!
The main benefit of this approach is that it only needs one pass of the table and no sort.
Caveats
This approach is very useful, but it has some restrictions:
caveat
All elements must be of the same type.
PostgreSQL does not support arrays with different types of elements. As an example, if we wanted to get the last credit set by date, and not by id:
db=# SELECT
account,
SUM((data->'delta_balance')::int) AS balance,
(MAX(
ARRAY[happened_at, (data->'credit')::int]
) FILTER (WHERE type = 'credit_set'))[2] AS credit
FROM
event
GROUP BY
account;
ERROR: ARRAY types timestamp with time zone and integer cannot be matched
LINE 4: (MAX(ARRAY[happened_at, (data->'credit')::int]) FILTER...
PostgreSQL tells us that an array cannot contain both timestamps and integers.
We can overcome this restriction in some cases by casting one of the elements:
db=# SELECT
account,
SUM((data->'delta_balance')::int) AS balance,
(MAX(
ARRAY[EXTRACT('EPOCH' FROM happened_at), (data->'credit')::int]
) FILTER (WHERE type = 'credit_set'))[2] AS credit
FROM
event
GROUP BY
account;
account | balance | credit
---------+---------+--------
1 | -30 | 100
We converted the timestamp to epoch, which is the number of seconds since 1970. Once both elements are of the same type, we can use the array trick.
caveat
Query might consume a little more memory.
This one if a bit of a stretch, but as we demonstrated in the past, large group and sort keys consume more memory in joins and sorts. Using the array trick, the group key is an array, which is a bit larger than the plain fields we usually sort by.
Also, the array trick can be used to "piggyback" more than one value:
(MAX(ARRAY[
id,
(data->'credit')::int,
EXTRACT('EPOCH' FROM happened_at)
]) FILTER (WHERE type = 'credit_set'))[2:]
This query will return both the last credit set, and the date in which it was set. The entire array is used for sorting, so the more values we put in the array, the larger the group key gets.
Conclusion
The array trick is very useful, and can significantly simplify complicated queries and improve performance. We use it to produce reports on time series data, and to generate read models from event tables.
Call out to my readers
I'm pretty sure using user-defined aggregate function in PostgreSQL it should be possible to create a function with the signature MAX_BY(key, value)
. I haven't had time to dig deep into custom aggregate functions, but if any of the readers do, please share your implementation and i'll be happy to post it here.
UPDATED: Aug 17, 2019
Comments From Readers
In the few days following the publication of this article, I received several suggestions and comments from readers. This is a summary of the comments I received, and my thoughts on them.
One commenter on Reddit suggested using ARRAY_AGG
:
db=# SELECT
account,
((ARRAY_AGG(data ORDER BY happened_at DESC)
FILTER (WHERE type = 'credit_set'))[1] -> 'credit')::int AS credit
FROM
event
GROUP BY account;
account | credit
---------+--------
1 | 50
This approach obviously works, and it doesn't require the key and the value to be of the same type, which is a big limitation of the array trick.
The downside to this approach is that it requires a sort which might become expensive with very large data sets:
QUERY PLAN
------------------------------------------------------------------
GroupAggregate (cost=1.17..1.36 rows=7 width=8)
Group Key: account
-> Sort (cost=1.17..1.19 rows=7 width=76)
Sort Key: account
-> Seq Scan on event (cost=0.00..1.07 rows=7 width=76)
Another commenter on Reddit suggested using window functions in combination with DISTINCT ON
. This is the original suggestion:
SELECT DISTINCT ON (account)
account,
FIRST_VALUE((data->'credit')::int) OVER w,
LAST_VALUE((data->'credit')::int) OVER w,
SUM((data->'credit')::int) OVER w
FROM
event
WHERE
type = 'credit_set'
WINDOW w AS (
PARTITION BY account
ORDER BY id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
The query uses both DISTINCT ON
and window functions. It works by calculating the aggregates using the window function on the entire set (all rows of the account), and then fetch the first or last row using DISTINCT ON
.
To make the window functions behave like a "group by" and calculate the aggregates on the entire set, the bound is defined as BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
, meaning "for the entire partition".
To avoid repeating the window for every aggregate, the query uses a WINDOW clause to define a named window that can be used multiple times in the query.
This query however, is not really working because the where clause is restricted to events with type credit_set
. To get the complete status of the account, we also need to aggregate the balance of all events.
To actually make this approach work, we need to make the following adjustments:
db=# SELECT DISTINCT ON (account)
account,
LAST_VALUE((data->'credit')::int) OVER (
PARTITION BY account
ORDER BY (
CASE
WHEN type = 'credit_set' THEN happened_at
ELSE null
END
) ASC NULLS FIRST
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) as credit,
SUM(COALESCE((data->'delta_balance')::int, 0)) OVER (PARTITION BY account) AS balance
FROM
event;
account | credit | balance
---------+--------+---------
1 | 100 | -30
What changes did we make:
- We had to ditch the where clause so all events are processed.
- We also had to do some "creative sorting" to get the
last_credit
set event. - We removed the named window because it was no longer reused.
The plan also gotten more complicated:
Unique (cost=1.19..1.55 rows=7 width=24)
-> WindowAgg (cost=1.19..1.54 rows=7 width=24)
-> WindowAgg (cost=1.19..1.38 rows=7 width=48)
-> Sort (cost=1.19..1.20 rows=7 width=44)
Sort Key: account, (CASE WHEN (type = 'credit_set'::text)
THEN happened_at ELSE NULL::timestamp with time zone
END) NULLS FIRST
-> Seq Scan on event (cost=0.00..1.09 rows=7 width=44)
Two sorts, and several aggregates. The bottom line, in my opinion, is that this approach is harder to maintain and it yields a significantly more complicated plan. I wouldn't use it in this case. However, like the previous approach, it is not restricted by the type of the key and value.
In response to my tweet, a reader pointed me to an old wiki page with an implementation of two custom aggregate functions FIRST
and LAST
.
After creating the custom aggregates in the database as instructed in the wiki page, the query can look like this:
SELECT
account,
SUM(COALESCE((data->>'delta_balance')::int, 0)) AS balance,
LAST((data->>'credit')::int) FILTER (WHERE type = 'credit_set') AS credit
FROM
event
GROUP BY
account;
The main issue I found with this approach is that the order seems to be arbitrary. First and last can only be defined in the context of some order. I couldn't find a way to provide a field to sort by, so I consider this approach flawed for this use case.
As I suspected, this use case is ideal for custom aggregates and extensions, and indeed, another reader on Reddit pointed me the extension "first_last". The API is roughly similar to the custom aggregate above, but is also offers a way to sort the results so the first and last are not arbitrary.
I did not install the extension, but the query should look something like that:
SELECT
account,
SUM(COALESCE((data->>'delta_balance')::int, 0)) AS balance,
LAST((data->>'credit')::int, 1 ORDER BY happened_at)
FILTER (WHERE type = 'credit_set') AS credit
FROM
event
GROUP BY
account;