A few weeks ago we encountered a major performance regression in one of our main admin pages. The page took more than 10 seconds to load (at best) and hit the query execution timeout at worst.
The page was an admin list view of a transactions model, one of the main models in our app. The model is used by support personal on a daily basis. It has millions of rows, and used several joins to display relevant information.
The most common use for the page was to filter transactions by a certain period, most commonly the last day. We used Django admin date hierarchy to drill down on the creation date of the records. When we investigated the issue, we found that the date hierarchy was the cause for most of the time spent loading the admin page.
Identifying the Problem
The filtered URL looked like this:
/admin/transactions/created__year=2017&created__month=11
We identified the "heavy" query as the one fetching the data to populate the list. The query performed the join with all the lookup tables and applied the filters we listed in the ModelAdmin. The relevant WHERE clause was:
WHERE created BETWEEN
'2017-01-01T00:00:00+00:00'::timestamptz AND
'2017-12-31T23:59:59.999999+00:00'::timestamptz
AND EXTRACT('month' FROM created AT TIME ZONE 'UTC') = 11)
When we inspected the execution plan of this query we found this snippet at the bottom:
Index Scan using ix on transactions_transaction (cost=0.43..90663.65 rows=4561 width=471)
Index Cond: ((created >= '2017-01-01 02:00:00+02'::timestamp with time zone)
AND (created <= '2018-01-01 01:59:59.999999+02'::timestamp with time zone))
Filter: (date_part('month'::text, timezone('UTC'::text, created)) = '11'::double precision)
PostgreSQL decided to use the index on the created
column. The estimate was 4,561 rows and the cost estimate maxed at 90,000. This estimate was wildly inaccurate which made us drew our first conclusion:
The estimate made by the database is very low and inaccurate.
We first suspected that the low estimate was due to stale statistics on the column. We gathered stats on the column and tried again:
haki=# analyze transaction_transaction (created);
No change. Estimate remained the same.
We decided to give the query another look. We found this snippet in the WHERE clause a bit odd:
WHERE
created BETWEEN
'2017-01-01T00:00:00+00:00'::timestamptz
AND '2017-12-31T23:59:59.999999+00:00'::timestamptz
AND EXTRACT('month' FROM created AT TIME ZONE 'UTC') = 11
Django applied a filter on the full year and then used EXTRACT
to filter only the month.
We decided to check what happens if we remove the EXTRACT
function and instead simplify the condition:
WHERE created BETWEEN
'2017-11-01T00:00:00+00:00'::timestamptz
AND '2017-11-30T23:59:59.999999+00:00'::timestamptz
The execution plan for that query was:
Index Scan using ix on transactions_transaction (cost=0.43..1265.75 rows=13716 width=471)
Index Cond: ((created >= ‘2017–11–01 02:00:00+02'::timestamp with time zone)
AND (created <= ‘2017–12–01 01:59:59.999999+02'::timestamp with time zone))
The estimate is still low but the cost estimate is now significantly lower. This execution plan helped us reach our second conclusion:
The way Django apply the filter makes it difficult for the database to optimize the query.
The Problem With the Way Date Hierarchy is Implemented
Django filters the queryset for a given level in the date hierarchy using a database function to extract the relevant date part.
A function is opaque to the database optimizer. If you have a range-based (BTREE) index on the field, using EXTRACT
does not limit the range at all. The index is not utilized properly which might lead to a sub-optimal execution plan.
Once we had a better understanding of the problem we started discussing possible solutions.
Function based index
A function based index is an index on an expression. In our case, an appropriate function based index might look like this:
CREATE INDEX transactions_transaction_created_month_brin
ON transactions_transaction
USING BRIN(EXTRACT('month' FROM created AT TIME ZONE'UTC'));
The index made the query run faster but it came at a cost.
The downside to this approach is having to maintain additional indexes for each level of the hierarchy (day and month). Additional indexes slow down insert and update operations, and take up space.
Another downside is the index size. We used a BRIN index in this case to minimize the size of the index. The table is naturally clustered by the creation date so this is an ideal use case for a BRIN index. When this is not the case, a similar BTREE index can become quite heavy.
see also
More on how BRIN indexes work in 9-django-tips-for-working-with-databases.
Simplify the Condition Used by Django Date Hierarchy
We decided to see if we can simplify the condition used by Django to apply the date hierarchy.
To implement the filter differently we first need to understand how Django admin applies filters on a queryset.
In django/contrib/admin/views/main.py there is a function called get_filters. It might look scary at first but what it does is:
- Extract all the query parameters from the URL.
- Get all the
ListFilter
declared in theModelAdmin
and apply them one by one to the parameter list. - Each
ListFilter
receive the parameter list, takes what it needs and remove the value from the parameter list. - Any parameters left after all
ListFilter
's were applied are processed using the "default" Django filter. This is why, for example, it's possible to filter the queryset directly from the URL even when aListFilter
is not explicitly defined.
If you look at the date hierarchy query parameters you'll see that there is nothing special about them, they are just regular URL params. This sparked an idea:
Implement a ListFilter to grab the relevant date hierarchy parameters from the parameter list and apply a custom filter on the queryset.
The Implementation
Let's start by grabbing the date hierarchy fields from the parameter list:
class RangeBasedDateHierarchyListFilter(admin.ListFilter):
title = ''
def __init__(self, request, params, model, model_admin):
self.date_hierarchy_field = model_admin.date_hierarchy
self.date_hierarchy = {}
date_hierarchy_field_re = re.compile(
r'^{}__(day|month|year)$'.format(self.date_hierarchy_field)
)
for param in list(params.keys()):
match = date_hierarchy_field_re.match(param)
if match:
period = match.group(1)
self.date_hierarchy[period] = int(params.pop(param))
- We take the
date_hierarchy
field name from themodel_admin.
In our case it wascreated
. - We create a Regex pattern to identify the parameters in the URL. The pattern is always the name of the date hierarchy field + the period (day, month, year).
In our case the possible parameters arecreated__day
,created__month
andcreated__year
. - We iterate the parameter list, pop any date hierarchy parameter that match our criteria and store the period and the value in a dict.
Now, this is where all the magic happens:
class RangeBasedDateHierarchyListFilter(admin.ListFilter):
# ...
def queryset(self, request, queryset):
tz = timezone.get_default_timezone()
from_date, to_date = get_date_range_for_hierarchy(self.date_hierarchy, tz)
return queryset.filter(**{
'{}__gte'.format(self.date_hierarchy_field): from_date,
'{}__lt'.format(self.date_hierarchy_field): to_date,
})
Django will call the queryset
function of our ListFilter
with a queryset, and the function is expected to return the filtered queryset.
In the example above this is happening in a separate function (so we can test it, you know...):
def get_date_range_for_hierarchy(date_hierarchy, tz):
"""Generate date range for date hierarchy.
date_hierarchy <dict>:
year (int)
month (int or None)
day (int or None)
tz <timezone or None>:
The timezone in which to generate the datetimes.
If None, the datetimes will be naive.
Returns (tuple):
from_date (datetime.datetime, aware if tz is set) inclusive
to_date (datetime.datetime, aware if tz is set) exclusive
"""
from_date = datetime.datetime(
date_hierarchy['year'],
date_hierarchy.get('month', 1),
date_hierarchy.get('day', 1),
)
if tz:
from_date = tz.localize(from_date)
if 'day' in date_hierarchy:
to_date = from_date + datetime.timedelta(days=1)
elif 'month' in date_hierarchy:
assert from_date.day == 1
to_date = (from_date + datetime.timedelta(days=32)).replace(day=1)
else:
to_date = from_date.replace(year=from_date.year + 1)
return from_date, to_date
The function receives the dict we constructed in __init__
and returns a date range to filter on. The queryset
function then applies a simpler range filter that our database can better utilize.
To use the simplified range condition in our ModelAdmin
we need to add it as a list_filter
:
class TransactionAdmin(admin.ModelAdmin):
date_hierarchy = 'created'
list_filter = (
#...
RangeBasedDateHierarchyListFilter,
)
The Result
After we deployed this change the performance of the page improved drastically. Our database was happy, the support team was happy and so were we.
source code
See our package django-lightweight-date-hierarchy on github and pypi.