How to Use Grouping Sets in Django

How We Cut Response Time In Half Using Advanced SQL In Django


I recently had the pleasure of optimizing an old dashboard. The solution we came up with required some advanced SQL that Django does not support out of the box. In this article I present the solution, how we got to it, and a word of caution.

advanced SQL

This article covers advanced topics in SQL aggregation. If you need to perform GROUP BY in Django ORM check out Understand Group by in Django with SQL.

The Dashboard

The dashboard is of a sales model. It includes a simple table with metrics grouped by merchants and their devices, and a summary line.

The code to produce the table looks roughly like this:

from django.db.models import Sum, Count, Avg, Q

metrics =  {
    'total': Count('id'),
    'avg_charged_amount': Avg('charged_amount'),
    'unique_users': Count('user', distinct=True),
}

results = (
    Sale.objects
    .values('merchant', 'device')
    .annotate(**metrics)
)

The code to produce the summary line uses the same metrics, and looks likes this:

summary =  Sale.objects.annotate(**metrics)

Our admin gets a nice dashboard that looks roughly like this:

A summary line in Django Admin page
A summary line in Django Admin page

see also

On how to create the dashboard above see How to Turn Django Admin Into a Lightweight Dashboard

The Problem

The dashboard was working well for about three years. We got good response times and accurate information. However, as data piled up, performance has degraded to a point where the page became unusable.

To analyze the problem, we inspected the SQL, and timed it. The query to produce the table looks like this:

SELECT
    merchant,
    device,
    COUNT(id) AS total,
    AVG(charged_amount) AS avg_charged_amount,
    COUNT(DISTINCT user_id) AS unique_users
FROM
    sale
GROUP BY
    merchant,
    device

At worst, this query took about 30s to complete.

The next query executed by the dashboard was used to produce the summary line:

SELECT
    COUNT(id) AS total,
    AVG(charged_amount) AS avg_charged_amount,
    COUNT(DISTINCT user_id) AS unique_users
FROM
    sale

This query took roughly the same time, ~30s. Together, at their worst, the two queries took more than a minute to complete.

Aggregate in Memory

Both queries process the exact same data, the only difference is the GROUP BY key. The first query produces results at the merchant and device level, the second produces the same aggregates for the entire dataset.

The first thought that came to mind was to calculate the summary by aggregating the results in-memory.

The first metric, total, is easy to calculate:

summary_total = sum(r['total'] for r in results)

The second metric is the average charged amount. We can't just sum the average for each device and merchant, we need additional information.

To calculate the average charged amount for all merchants and devices, we need to divide the total charged amount by the number of sales. We already have the number of sales, so we need to add a metric for the total charged amount:

metrics =  {
    'total': Count('id'),
    'avg_charged_amount': Avg('charged_amount'),
    'total_charged_amount': Sum('charged_amount'),
    'unique_users': Count('user', distinct=True),
}

Now that we have both total and total_charged_amount, we can compute avg_charged_amount:

summary_total = sum(r['total'] for r in results)
summary_total_charged_amount = sum(r['total_charged_amount'] for r in results)
summary_avg_charged_amount = summary_total_charged_amount / summary_total

We have one metric left, unique_users. This metric counts the unique number of users that visited each device at each merchant. The same user can visit several devices at different merchants. If we sum unique_users we won't get the correct metric for the entire set.

It's impossible to accurately compute distinct values from aggregated results, so the solution must be in the database.


Aggregating in the Database

Most SQL implementations provide several useful functions to aggregate data at different levels.

Database Support

Throughout this article I use PostgreSQL. Similar functions are available in Oracle, MySQL and MSSQL. As far as I know, SQLite has no support for the functions I'm about to use.

Let's start with some data:

CREATE TABLE sale (
    id              integer,
    merchant        varchar,
    device          varchar,
    user_id         integer,
    charged_amount  integer,
    sold_at         timestamp
);

INSERT INTO sale
    (id, merchant,  device, user_id, charged_amount, sold_at)
VALUES
    (1,  'Walmart', 'D1',   1,       54,             NOW() + INTERVAL '1 days'),
    (2,  'Walmart', 'D1',   1,       100,            NOW() + INTERVAL '2 days'),
    (3,  'Walmart', 'D2',   1,       22,             NOW() + INTERVAL '1 days'),
    (4,  'Walmart', 'D2',   3,       14,             NOW() + INTERVAL '3 days'),
    (5,  'Walmart', 'D2',   4,       15,             NOW() + INTERVAL '4 days'),
    (6,  'Walmart', 'D3',   2,       29,             NOW() + INTERVAL '1 days'),
    (7,  'Walmart', 'D3',   2,       45,             NOW() + INTERVAL '3 days'),
    (8,  'Costco',  'D1',   1,       47,             NOW() + INTERVAL '5 days'),
    (9,  'Costco',  'D1',   4,       223,            NOW() + INTERVAL '1 days'),
    (10, 'Costco',  'D1',   5,       67,             NOW() + INTERVAL '2 days'),
    (11, 'Costco',  'D4',   1,       25,             NOW() + INTERVAL '5 days'),
    (12, 'Costco',  'D5',   3,       125,            NOW() + INTERVAL '4 days'),
    (13, 'Costco',  'D5',   4,       150,            NOW() + INTERVAL '1 days');

The query we used in our dashboard produces the following results:

db=# SELECT
    merchant,
    device,
    COUNT(id) AS total,
    AVG(charged_amount) AS avg_charged_amount,
    COUNT(DISTINCT user_id) AS unique_users
FROM
    sale
GROUP BY
    merchant,
    device;

 merchant | device | total |  avg_charged_amount  | unique_users
----------+--------+-------+----------------------+--------------
 Costco   | D1     |     3 | 112.3333333333333333 |            3
 Costco   | D4     |     1 |  25.0000000000000000 |            1
 Costco   | D5     |     2 | 137.5000000000000000 |            2
 Walmart  | D1     |     2 |  77.0000000000000000 |            1
 Walmart  | D2     |     3 |  17.0000000000000000 |            3
 Walmart  | D3     |     2 |  37.0000000000000000 |            1
(6 rows)

The query to produce the summary line:

db=# SELECT
    COUNT(id) AS total,
    AVG(charged_amount) AS avg_charged_amount,
    COUNT(DISTINCT user_id) AS unique_users
FROM
    sale;

 total | avg_charged_amount  | unique_users
-------+---------------------+--------------
    13 | 70.4615384615384615 |            5
(1 row)

Rollup

The first special GROUP BY expression is ROLLUP. As the name suggest, ROLLUP aggregate at the lowest level and up:

db=# SELECT
    device,
    merchant,
    COUNT(*)
FROM
    sale
GROUP BY
    ROLLUP(device, merchant);

 device | merchant | count
--------+----------+-------
        |          |    13
 D3     | Walmart  |     2
 D5     | Costco   |     2
 D1     | Walmart  |     2
 D4     | Costco   |     1
 D1     | Costco   |     3
 D2     | Walmart  |     3
 D2     |          |     3
 D4     |          |     1
 D1     |          |     5
 D5     |          |     2
 D3     |          |     2
(12 rows)

We grouped by two fields, device and merchant, and we got three groups of aggregation:

  1. () all
  2. (device, merchant)
  3. (device)

ROLLUP aggregates "up", so the order of the fields is significant. Let's flip the order of the fields:

db=# SELECT
    device,
    merchant,
    count(*)
FROM
    sale
GROUP BY
    ROLLUP(merchant, device);

 device | merchant | count
--------+----------+-------
        |          |    13
 D4     | Costco   |     1
 D3     | Walmart  |     2
 D1     | Walmart  |     2
 D1     | Costco   |     3
 D5     | Costco   |     2
 D2     | Walmart  |     3
        | Costco   |     6
        | Walmart  |     7

This time we got the following groups:

  1. () all
  2. (merchant, device)
  3. (merchant)

Cube

The next group by expression is most likely borrowed from OLAP, which often mention cubes. The CUBE expression aggregates all possible combinations:

db=# SELECT
    device,
    merchant,
    count(*)
FROM
    sale
GROUP BY
    CUBE(merchant, device);

 device | merchant | count
--------+----------+-------
        |          |    13
 D4     | Costco   |     1
 D3     | Walmart  |     2
 D1     | Walmart  |     2
 D1     | Costco   |     3
 D5     | Costco   |     2
 D2     | Walmart  |     3
        | Costco   |     6
        | Walmart  |     7
 D2     |          |     3
 D4     |          |     1
 D1     |          |     5
 D5     |          |     2
 D3     |          |     2
(14 rows)

The results contain the following groups:

  1. () all
  2. (device, merchant)
  3. (merchant)
  4. (device)

Grouping Sets

Grouping sets allows us to provide the exact groups of aggregation we want. For example, to recreate the results of the ROLLUP above, we can provide the following grouping sets:

db=# SELECT
    device,
    merchant,
    count(*)
FROM
    sale
GROUP BY
    GROUPING SETS (
        (merchant),
        (merchant, device),
        ()
    )

 device | merchant | count
--------+----------+-------
        |          |    13
 D4     | Costco   |     1
 D3     | Walmart  |     2
 D1     | Walmart  |     2
 D1     | Costco   |     3
 D5     | Costco   |     2
 D2     | Walmart  |     3
        | Costco   |     6
        | Walmart  |     7
(9 rows)

Each list of fields inside parentheses in the GROUPING SETS is a group in the result.

Both CUBE and ROLLUP can be implemented using GROUPING SETS. The following table shows the equivalent GROUPING SETS expression for both ROLLUP and CUBE, on two fields a and b:

expression equivalent grouping sets
ROLLUP(a, b) GROUPING SETS ((a, b), (a), ())
CUBE(a, b) GROUPING SETS ((a, b), (a), (b) ())

In our original query we had metrics at the merchant and device level, and we wanted to get a summary line. Using GROUPING SETS, this query will look like this:

db=# SELECT
    merchant,
    device,
    COUNT(id) AS total,
    AVG(charged_amount) AS avg_charged_amount,
    COUNT(DISTINCT user_id) AS unique_users
FROM
    sale
GROUP BY
    GROUPING SETS (
        (merchant, device),
        ()
    )

 merchant | device | total |  avg_charged_amount  | unique_users
----------+--------+-------+----------------------+--------------
 Costco   | D1     |     3 | 112.3333333333333333 |            3
 Costco   | D4     |     1 |  25.0000000000000000 |            1
 Costco   | D5     |     2 | 137.5000000000000000 |            2
 Walmart  | D1     |     2 |  77.0000000000000000 |            1
 Walmart  | D2     |     3 |  17.0000000000000000 |            3
 Walmart  | D3     |     2 |  37.0000000000000000 |            1
          |        |    13 |  70.4615384615384615 |            5
(7 rows)

The first 6 lines are similar to the original query. The last line is similar to the results from the summary query we used.

Using GROUPING SETS we get the results we need in just one query instead of two.


Using Grouping Sets in Django

Now that we have the query, we need to find a way to use it with Django. Unfortunately, Django still has no support for grouping sets. On top of that, the query is generated by Django Admin, and it includes predicates from list filters and date hierarchy. We couldn't just use raw SQL.

We need to find a way to modify a given Django QuerySet, and add grouping sets to it.

Since Django has no built-in support for grouping sets, we are forced to manipulate the query. The base query we need to manipulate is the query that Django generates, along with any predicates and annotations added by Django Admin. Eventually, we want to execute the query in the database the same way Django does.

Getting the Query

A nice feature of Django QuerySet is that it provides the generated SQL:

>>> qs = (
>>> ... Sale.objects
>>> ... .values('merchant', 'device')
>>> ... .annotate(**metrics)
>>> )
>>> print(qs.query)
SELECT "sale"."merchant", "sale"."device", COUNT("sale"."id") AS "total",
AVG("sale"."charged_amount") AS "avg_charged_amount",
COUNT(DISTINCT "sale"."user_id") AS "unique_users" FROM "sale"
GROUP BY "sale"."merchant", "sale"."device"

This is a simple query, can we execute it directly in the database?

>>> from django.db import connection
>>>
>>> with connection.cursor() as cursor:
>>> ... cursor.execute(str(qs.query))
>>> ... results = cursor.fetchall()
>>> results
[('Costco', 'D1', 3, 112.3333333333333333, 3),
('Costco', 'D4', 1, 25.0000000000000000, 1),
...
('Walmart', 'D3', 2, 37.0000000000000000, 1)]

This looks like something we can work with, let's dig deeper...

As mentioned before, the QuerySet is generated by Django Admin, and it might include predicates for list filters and date hierarchy. Let's try to execute a query with a predicate on the sold_at date field:

>>> import datetime
>>> from django.db import connection
>>> from django.utils import timezone
>>>
>>> qs = qs.filter(sold_at__lt=timezone.now())
>>> with connection.cursor() as cursor:
>>> ... cursor.execute(str(qs.query))
*** django.db.utils.ProgrammingError: syntax error at or near "00"
LINE 1: ..."sale"."sold_at" >= 2019-03-02 00:00:00+0...

Looks like Django is unable to execute the query as is. The reason for that is that the text generated by str(qs.query) is just a text representation of the query. Under the hood, Django uses proper bind variables (might also be known as substitution variables) to avoid SQL injection.

Much of the Django ORM QuerySet logic is carried out by an internal class called Query. The class is not documented. The only place to learn about it is in the source. One promising function of Query is sql_with_params. Let's use it on the query above, and see what we get:

>>> qs = qs.filter(sold_at__lt=timezone.now())
>>> sql, params = qs.query.sql_with_params()
>>> sql
SELECT "sale"."merchant", "sale"."device", COUNT("sale"."id") AS "total",
AVG("sale"."charged_amount") AS "avg_charged_amount",
COUNT(DISTINCT "sale"."user_id") AS "unique_users" FROM "sale"
WHERE "sale"."sold_at" < %s
GROUP BY "sale"."merchant", "sale"."device" ORDER BY "sale"."merchant", "sale"."device"
>>> params
(datetime.datetime(2019, 3, 2, 0, 0, tzinfo=<UTC>))

The function sql_with_params returns a tuple. The first argument is the SQL query. The second, is a list of parameters to that query.

The keen-eyed might have spotted the placeholder %s in the query text:

WHERE "sale"."sold_at" < %s

This placeholder corresponds to the parameter we got in the second argument. Let's try to execute the query with the placeholder, and the params:

>>> import datetime
>>> from django.db import connection
>>> from django.utils import timezone
>>>
>>> qs = qs.filter(sold_at__lt=timezone.now())
>>> sql, params = qs.query.sql_with_params()
>>>
>>> with connection.cursor() as cursor:
>>> ... cursor.execute(sql, params)
>>> ... results = cursor.fetchall()
>>> results
[('Costco', 'D1', 3, 112.3333333333333333, 3),
('Costco', 'D4', 1, 25.0000000000000000, 1),
...
('Walmart', 'D3', 2, 37.0000000000000000, 1)]

Great! We are now able to execute a query as Django does. We are ready to manipulate the query.

Manipulating the Query

The query generated by Django includes a simple GROUP BY clause:

GROUP BY "sale"."merchant", "sale"."device" ORDER BY "sale"."merchant", "sale"."device"

We want to replace that with the following group by clause:

GROUP BY GROUPING SETS (("sale"."merchant", "sale"."device"), ()) ORDER BY "sale"."merchant", "sale"."device"

This looks like a job for re.

We want catch the grouped by fields between GROUP BY and ORDER BY, and make them the first group in the GROUPING SET expression. Then, we want add the group () for the summary:

import re

sql_with_grouping_sets = re.sub(
    r'GROUP BY (.*) ORDER',
    r'GROUP BY GROUPING SETS (( \1 ), ()) ORDER',
    sql,
)

Now we can take the modified query, and execute it with the params:

>>> with connection.cursor() as cursor:
>>> ... cursor.execute(sql_with_grouping_sets, params)
>>> ... results = cursor.fetchall()
>>> results
[('Costco', 'D1', 3, 112.3333333333333333, 3),
('Costco', 'D4', 1, 25.0000000000000000, 1),
...
('Walmart', 'D3', 2, 37.0000000000000000, 1)]
(None, None, 13, 70.4615384615384615, 5)]

Lo and behold... We got both the results and the summary line in a single query.

Conclusion

Several important issues to consider about this approach:

  • Don't do this!: This is as bad as it gets. This approach is a nice exercise, and a great opportunity to explore the ORM internals, but the implementation is too fragile. When using an internal, undocumented API, there is no guaranty it wont change unexpectedly in the future. Having said that, we decided to use this approach in one of our internal admin pages. It's a very specific scenario involving a queryset that's not used for any user facing features. It helped us cut the page response time exactly in half and we are pleased with the result.

  • Make the sort order deterministic: When using GROUPING SETS (and ROLLUP or CUBE for that matter), you mix more than one level of aggregation in a single query. To be able to fetch the results in a predictable way, it's important to explicitly sort the results. For example, in the query above, to make sure the summary row is the first row, add the following sort order qs.order_by( F('merchant').desc(nulls_last=False) ).




Similar articles