Understand Group by in Django with SQL

Django QuerySets and SQL side by side


Aggregation is a source of confusion in any type of ORM and Django is no different. The documentation provides a variety of examples and cheat-sheets that demonstrate how to group and aggregate data using the ORM, but I decided to approach this from a different angle.

In this article I put QuerySets and SQL side by side. If SQL is where you are most comfortable, this is the Django GROUP BY cheat-sheet for you.

Image by <a href="https://unsplash.com/photos/D4YrzSwyIEc">Jason Leung</a>
Image by Jason Leung

Table of Contents


How to Group By in Django

To demonstrate different GROUP BY queries, I will use models from Django's built-in django.contrib.auth app.

>>> from django.contrib.auth.models import User

Django ORM produces SQL statements with long aliases. For brevity, I will show a cleaned-up, but equivalent, version of what Django executes.

SQL Logging

To see the SQL actually executed by Django, you can turn on SQL logging in the Django settings.

How to Count Rows

Let's count how many users we have:

SELECT
    COUNT(*)
FROM
    auth_user;
User.objects.count()

Counting rows is so common that Django includes a function for it right on the QuerySet. Unlike other QuerySets we'll see next, count returns a number.

How to Use Aggregate Functions

Django offers two more ways to count rows in a table.

We'll start with aggregate:

SELECT
    COUNT(id) AS id__count
FROM
    auth_user;
from django.db.models import Count

User.objects.aggregate(Count('id'))

To use aggregate we imported the aggregate function Count. The function accepts an expression to count. In this case, we used the name of the primary key column id to count all the rows in the table.

Aggregate NULL

Aggregations ignore NULL values. For more on how aggregations handle NULL, see 12 Common Mistakes and Missed Optimization Opportunities in SQL.

The result of aggregate is a dict:

>>> from django.db.models import Count
>>> User.objects.aggregate(Count('id'))
{"id__count": 891}

The name of the key is derived from the name of the field and the name of the aggregate. In this case, it's id__count. It's a good idea not to rely on this naming convention, and instead provide your own name:

SELECT
    COUNT(id) as total
FROM
    auth_user;
>>> from django.db.models import Count
>>> User.objects.aggregate(total=Count('id'))
{"total": 891}

The name of the argument to aggregate is also the name of the key in the resulting dictionary.

How to Group By

Using aggregate we got the result of applying the aggregate function on the entire table. This is useful, but usually we want to apply the aggregation on groups of rows.

Let's count users by their active status:

SELECT
    is_active,
    COUNT(id) AS total
FROM
    auth_user
GROUP BY
    is_active
(User.objects
.values('is_active')
.annotate(total=Count('id')))

This time we used the function annotate. To produce a GROUP BY we use a combination of values and annotate:

  • values('is_active'): what to group by
  • annotate(total=Count('id')): what to aggregate

The order is important: failing to call values before annotate will not produce aggregate results.

Just like aggregate, the name of the argument to annotate is the key in the result of the evaluated QuerySet. In this case it's total.

How to Filter a QuerySet With Group By

To apply aggregation on a filtered query you can use filter anywhere in the query. For example, count only staff users by their active status:

SELECT
    is_active,
    COUNT(id) AS total
FROM
    auth_user
WHERE
    is_staff = True
GROUP BY
    is_active
(User.objects
.values('is_active')
.filter(is_staff=True)
.annotate(total=Count('id')))

How to Sort a QuerySet With Group By

Like filter, to sort a queryset use order_by anywhere in the query:

SELECT
    is_active,
    COUNT(id) AS total
FROM
    auth_user
GROUP BY
    is_active
ORDER BY
    is_active,
    total
(User.objects
.values('is_active')
.annotate(total=Count('id'))
.order_by('is_staff', 'total'))

Notice that you can sort by both the GROUP BY key and the aggregate field.

How to Combine Multiple Aggregations

To produce multiple aggregations of the same group, add multiple annotations:

SELECT
    is_active,
    COUNT(id) AS total,
    MAX(date_joined) AS last_joined
FROM
    auth_user
GROUP BY
    is_active
from django.db.models import Max

(User.objects
.values('is_active')
.annotate(
    total=Count('id'),
    last_joined=Max('date_joined'),
))

The query will produce the number of active and inactive users, and the last date a user joined in each group.

How to Group by Multiple Fields

Just like performing multiple aggregations, we might also want to group by multiple fields. For example, group by active status and staff status:

SELECT
    is_active,
    is_staff,
    COUNT(id) AS total
FROM
    auth_user
GROUP BY
    is_active,
    is_staff
(User.objects
.values('is_active', 'is_staff')
.annotate(total=Count('id')))

The result of this query includes is_active, is_staff and the number of users in each group.

How to Group by an Expression

Another common use case for GROUP BY is to group by an expression. For example, count the number of users that joined each year:

SELECT
    EXTRACT('year' FROM date_joined),
    COUNT(id) AS total
FROM
    auth_user
GROUP BY
    EXTRACT('year' FROM date_joined)
(User.objects
.values('date_joined__year')
.annotate(total=Count('id')))

Notice that to get the year from the date we used the special expression <field>__year in the first call to values(). The result of the query is a dict, and the name of the key will be date_joined__year.

Sometimes, the built-in expressions are not enough, and you need to aggregate on a more complicated expression. For example, group by users that have logged in since they signed-up:

SELECT
  last_login > date_joined AS logged_since_joined,
  COUNT(id) AS total
FROM
  auth_user
GROUP BY
  last_login > date_joined
from django.db.models import (
    ExpressionWrapper,
    Q, F, BooleanField,
)

(User.objects
.annotate(
    logged_since_joined=ExpressionWrapper(
        Q(last_login__gt=F('date_joined')),
        output_field=BooleanField(),
    )
)
.values('logged_since_joined')
.annotate(total=Count('id'))
.values('logged_since_joined', 'total')

The expression here is fairly complicated. We first use annotate to built the expression, and we mark it as a GROUP BY key by referencing the expression in the following call to values(). From here on, it's exactly the same.

How to Use Conditional Aggregation

Using conditional aggregation, you can aggregate only a part of the group. Conditions come in handy when you have multiple aggregates. For example, count the number of staff and non-staff users by the year they signed-up:

SELECT
    EXTRACT('year' FROM date_joined),

    COUNT(id) FILTER (
        WHERE is_staff = True
    ) AS staff_users,

    COUNT(id) FILTER (
        WHERE is_staff = False
    ) AS non_staff_users

FROM
    auth_user
GROUP BY
    EXTRACT('year' FROM date_joined)
from django.db.models import F, Q

(User.objects
.values('date_joined__year')
.annotate(
    staff_users=(
        Count('id', filter=Q(is_staff=True))
    ),
    non_staff_users=(
        Count('id', filter=Q(is_staff=False))
    ),
))

The SQL above is from PostgreSQL, which along with SQLite is currently the only database backend that supports the FILTER syntax shortcut (formally called "selective aggregates"). For other database backends, the ORM will use CASE ... WHEN instead.

tip

I previously wrote about aggregations with filters. Check out my 9 Django tips for working with databases.

How to Use Having

The HAVING clause is used to filter on the result of an aggregate function. For example, find the years in which more than a 100 users joined:

SELECT
    is_active,
    COUNT(id) AS total
FROM
    auth_user
GROUP BY
    is_active
HAVING
    COUNT(id) > 100
(User.objects
.annotate(year_joined=F('date_joined__year'))
.values('is_active')
.annotate(total=Count('id'))
.filter(total__gt=100))

The filter on the annotated field total added an HAVING clause in the generated SQL.

How to Group by Distinct

For some aggregate functions such as COUNT, it is sometimes desirable to only count distinct occurrences. For example, how many different last names are there per user active status:

SELECT
    is_active,
    COUNT(id) AS total,
    COUNT(DISTINCT last_name) AS unique_names
FROM
    auth_user
GROUP BY
    is_active
(User.objects
.values('is_active')
.annotate(
    total=Count('id'),
    unique_names=Count('last_name', distinct=True),
))

Notice the use of distinct=True in the call to Count.

How to Create Expressions Using Aggregate Fields

Aggregate fields are often just the first step to a greater question. For example, what is the percent of unique last names by user active status:

SELECT
    is_active,
    COUNT(id) AS total,
    COUNT(DISTINCT last_name) AS unique_names,
    (COUNT(DISTINCT last_name)::float
        / COUNT(id)::float) AS pct_unique_names
FROM
    auth_user
GROUP BY
    is_active
from django.db.models import FloatField
from django.db.models.functions import Cast

(User.objects
.values('is_active')
.annotate(
    total=Count('id'),
    unique_names=Count('last_name', distinct=True),
)
.annotate(pct_unique_names=(
    Cast('unique_names', FloatField())
    / Cast('total', FloatField())
))

The first annotate() defines the aggregate fields. The second annotate() uses the aggregate function to construct an expression.

How to Group By Across Relations

So far we've used only data in a single model, but aggregates are often used across relations. The simpler scenario is of a one-to-one or a foreign key relation. For example, say we have a UserProfile with a one-to-one relationship to the User, and we want to count users by the type of profile:

SELECT
    p.type,
    COUNT(u.id) AS total
FROM
    auth_user u
    JOIN user_profile p ON u.id = p.user_id
GROUP BY
    p.type
(User.objects
.values('user_profile__type')
.annotate(total=Count('id')))

Just like GROUP BY expressions, using relations in values will group by that field. Note that the name of the user profile type in the result will be 'user_profile__type'.

How to Group By a Many to Many Relationship

A more complicated type of relation is the many to many relationship. For example, count in how many groups each user is a member:

SELECT
    u.id,
    COUNT(ug.group_id) AS memberships
FROM
    auth_user
    LEFT OUTER JOIN auth_user_groups ug ON (
        u.id = ug.user_id
    )
GROUP BY
    u.id
(User.objects
.annotate(memberships=Count('groups'))
.values('id', 'memberships'))

A user can be a member of more than one group. To count the number of groups the user is member of we used the related name "groups" in the User model. If the related name is not explicitly set (and not explicitly disabled), Django will automatically generate a name in the format {related model model}_set. For example, group_set.


Going Further

To dig deeper into the ORM and GROUP BY in particular, check out these links:




Similar articles