A few weeks ago we had some trouble with an "Export to Excel" functionality in one of our systems. In the process of resolving this issue, we made some interesting discoveries and came up with original solutions.
This article is inspired by the actual issue we used to track this incident over a period of two days. We go through the process of identifying the problem, experimenting and benchmarking different solutions until eventually deploying to production.
These are the main takeaways described in this article:
- Generating xlsx files can consume significant amount of resources.
- Under some circumstances better performance can be gained by not using
prefetch_related
. pyexcelerate
is a fast package for creating simple Excel files.tablib
(anddjango-import-export
) can be patched to usepyexcelerate
and produce excel files faster.
Table of Contents
A few weeks ago we started getting complaints from users about slow response time from one of our systems. A quick glance at the server metrics showed higher than normal CPU usage. This system is mostly IO intensive, so high CPU usage is not something we experience regularly.
The first thing we did was to identify the worker process that is consuming high CPU using htop
. After getting the process identifier (PID) of the process, we used py-spy to get a glance at what it's doing:
$ py-spy -p 8187 -r 1000
This command samples the process a 1000 times per second and provides a top
-like view of the results:
Total Samples 17974
GIL: 0.00%, Active: 0.00%, Threads: 1
OwnTime TotalTime Function (filename:line)
0.000s 173.7s get_response (django/core/handlers/base.py:75)
0.000s 173.7s inner (django/core/handlers/exception.py:34)
0.000s 173.7s __call__ (django/utils/deprecation.py:94)
0.000s 173.7s __call__ (django/core/handlers/wsgi.py:141)
0.000s 173.6s view (django/views/generic/base.py:71)
0.000s 173.6s _get_response (django/core/handlers/base.py:113)
0.000s 173.6s dispatch (django/contrib/auth/mixins.py:52)
0.000s 173.6s dispatch (django/contrib/auth/mixins.py:109)
0.000s 173.6s dispatch (django/views/generic/base.py:97)
0.050s 173.6s get (dashboard/views/list_views.py:100)
0.000s 94.69s get_resource_to_export (dashboard/views/list_views.py:70)
0.000s 94.69s export (dashboard/views/list_views.py:73)
0.000s 94.68s export (dashboard/resources.py:215)
0.000s 83.81s __iter__ (django/db/models/query.py:274)
0.040s 82.73s _fetch_all (django/db/models/query.py:1242)
0.000s 78.84s export (dashboard/views/list_views.py:74)
0.000s 70.58s __iter__ (django/db/models/query.py:55)
0.000s 68.98s execute_sql (django/db/models/sql/compiler.py:1100)
68.81s 68.81s _execute (django/db/backends/utils.py:84)
0.000s 68.81s _execute_with_wrappers (django/db/backends/utils.py:76)
0.000s 68.81s execute (django/db/backends/utils.py:67)
0.000s 50.11s save (tablib/packages/openpyxl3/workbook.py:186)
0.000s 50.11s export_set (tablib/formats/_xlsx.py:46)
0.000s 46.41s save (tablib/packages/openpyxl3/writer/excel.py:124)
0.000s 46.41s save_workbook (tablib/packages/openpyxl3/writer/excel.py:141)
0.000s 42.40s _fetch_all (django/db/models/query.py:1244)
0.000s 42.40s _prefetch_related_objects (django/db/models/query.py:771)
0.000s 42.38s prefetch_related_objects (django/db/models/query.py:1625)
0.000s 41.94s prefetch_one_level (django/db/models/query.py:1738)
0.000s 41.25s get_prefetch_queryset (django/db/models/fields/related_descriptors.py:627)
0.000s 32.30s _write_worksheets (tablib/packages/openpyxl3/writer/excel.py:91)
After monitoring this view for a minute or two, we had a few insights:
- A lot of time is spent fetching data.
- A lot of time is spent on some call to
prefetch_related
. - The problem is in the dashboard, and more specifically in the view that exports data.
With these insights, we wanted to moved on to identify the exact view. We then turned to the nginx access log:
$ journalctl -u nginx -r | grep dashboard
We managed to identify several endpoints that were taking very long to execute. Some of them finished in just under 60 seconds, others were killed by PostgreSQL after hitting the statement_timeout
limit and returned a 500 status code.
At this point we had a pretty good idea where the problem is, but we were still clueless as to why. The next step was to inspect the problematic code, and try to reproduce.
Exporting a QuerySet to Excel
The system is used to report and track violations in public transportation. During an inspection, the inspector documents different types of violations such as dirty bus, bus running late etc. The models for this system look roughly like this:
class ViolationType(models.Model):
name = model.TextField()
class Inspection(models.Model):
id = models.AutoField(primary_key=True)
class Violation(models.Model):
inspection = models.ForeignKey(Inspection, on_delete=models.PROTECT)
violation_type = models.ForeignKey(ViolationType, on_delete=models.PROTECT)
comments = models.TextField()
Every once in a while, a back office user would download the inspection information to Excel for further analysis.
The report includes a lot of information about the inspection, but most importantly, it includes a list of the violation types for each inspection:
inspection, violations
1, dirty floors | full trash can
2, full trash can | no light | missing signs
Using django-import-export
To produce the Excel report we use a package called django-import-export
. Using the package, we define a ModelResource
that can produce an Excel file from a queryset:
from import_export import resources, fields, widgets
from .models import Inspection, Violation
class InspectionResource(resources.ModelResource):
violations = fields.Field(
widget=widgets.ManyToManyWidget(Violation, field='violation_type')
)
class Meta:
model = Inspection
fields = (
'id',
'violations',
)
The query produced by this ModelResource
causes an N+1 queries issue, so before we ever deployed it to production we patched it and added prefetch_related
:
from django.db.models import Prefetch
from import_export import resources, fields, widgets
from .models import Inspection, Violation
class InspectionResource(resources.ModelResource):
violations = fields.Field(
widget=widgets.ManyToManyWidget(Violation, field='violation_type')
)
class Meta:
model = Inspection
fields = (
'id',
'violations',
)
def export(self, queryset=None):
queryset = (
queryset
.prefetch_related(Prefetch(
'violations',
queryset=Violation.objects.select_related('violation_type'),
to_attr='prefetched_violations',
))
)
return super().export(list(queryset))
def dehydrate_violations(self, inspection: Inspection) -> str:
return '\n'.join(
v.violation_type.name
for v in inspection.prefetched_violations
)
To use prefetch_related
in a ModelResource
we had to make the following changes:
-
Override
export
and adjust the query to prefetch the violations usingprefetch_related
. We use thePrefetch
object because we needed to customize the prefetch query, and add the violation type name from a related table. -
Evaluate the query and have the export function return a list instead of a queryset.
django-import-export
usesiterator
to speed up the query. Usingiterator()
, the ORM uses a cursor to iterate over the data in chunks and reduce memory. While this is usually useful, Django is unable to useiterator()
withprefetch_related
. -
Add a custom
dehydrate_
function for the violations field that will render a comma-delimited list of violation type names.
Prefetch Related
The resource was used by the view to produce the Excel report:
from django.http import HttpRequest, HttpResponse
from .models import Inspection
from .resources import InspectionResource
LIMIT = 10000
def export_to_excel(request: HttpRequest) -> HttpResponse:
inspections = Inspection.objects.all()
# Apply some filter on the queryset based on request
data = InspectionResource().export(inspections[:LIMIT]).xlsx
response = HttpResponse(data, content_type='text/xlsx')
response['Content-Disposition'] = 'attachment; filename=export.xlsx'
return response
The view takes a request, apply some filter on the inspections and produces the xlsx file using the ModelResource
.
Finding the Best File Format
Before we can start improving the export process, we need to establish a baseline. To get the timings and identify the hot spots in the call stack we used cProfile
. To identify and time query execution we turned SQL logging on in the Django settings:
# settings.py
LOGGING = {
# ...
'loggers': {
'django.db.backends': {
'level': 'DEBUG',
},
# ...
},
}
The benchmark looked like this:
import cProfile
from .models import Inspection
from .resources import InspectionResource
qs = VehicleInspection.objects.all()[:10000]
cProfile.run('resources.VehicleInspectionResource().export(qs).xlsx')
These were the results of exporting 10,000 rows in xlsx format using prefetch_related
:
56830808 function calls (47142920 primitive calls) in 41.574 seconds
select 5.009
prefetch 8.009
56660555 function calls (47149065 primitive calls) in 39.927 seconds
select 2.356
prefetch 7.991
We ran the benchmark twice to make sure the results were not effected by caches. The function took 40s to complete, and only 10s of it (25%) were spent in the database.
At this point, we suspected that the problem might be in the file format. This assumption was supported by the application server's high CPU usage.
Next, we wanted to try the same benchmark, only instead of xlsx we produced a csv:
cProfile.run('resources.VehicleInspectionResource().export(qs).csv')
These were the results of exporting 10,000 rows in csv format using prefetch_related
:
9179705 function calls (9107672 primitive calls) in 17.429 seconds
select 1.970
prefetch 8.343
Wow! That's a big improvement. This confirmed our suspicion that the actual production of the xlsx was the problem.
Before we moved on, we wanted to check another file format that might be more useful to our users, the old xls format:
cProfile.run('resources.VehicleInspectionResource().export(qs).xls')
These were the results of exporting 10,000 rows in xls format using prefetch_related
:
16317592 function calls (15745704 primitive calls) in 20.694 seconds
select 1.922
prefetch 7.976
OK, so that's surprising. I'm not familiar with the internals of the Microsoft Office file formats, but it seems like the old format is only a little bit slower than the csv format, and much faster than the new xlsx format.
This benchmark results brought up an old dilemma. In the past we used to serve users with only csv files, but they complained a lot about troubles opening the files, and encoding and formatting issues. For this reason we decided to produce xlsx in the first place, so at that time, producing xls files seemed like the best solution.
I should already tell you, using the old xls format was a bad decision, but we didn't know that yet.
Improving the Query
After reducing the overall execution time by half, our next targets were the queries. Two queries are executed to produce the dataset for the export. Before any change is made, it took the "main" query ~2s and the prefetch ~8s to complete.
The "main" query looked like this:
SELECT
"inspection"."id",
-- around 50 more fields from joined tables
FROM
"inspection"
INNER JOIN "auth_user" ON ("inspection"."user_id" = "auth_user"."id")
-- around 11 more joined tables
The resource used a lot of data from related tables, and the query joined ~12 tables and had many fields listed in the SELECT clause. The table is one of the main tables in the database so it is heavily indexed, and the lookup tables were relatively small so the query didn't take long to complete.
The prefetch query looked like this:
SELECT
"violation"."id",
"violation"."inspection_id",
"violation"."violation_type_id",
"violation_type"."id",
"violation_type"."name",
FROM "violation"
INNER JOIN "violation_type" ON (
"violation"."violation_type_id" = "violation_type"."id"
)
WHERE
"violation"."inspection_id" = ANY([
2814, 9330, 8848, 8971, 9372, 9084, 78, 3896, 2609, 5177, 2866
-- another 10,000 inspection IDs
1399, 9348, 914, 8884, 9082, 3356, 2896, 742, 9432, 8926, 9153
])
This query seems innocent, but in fact, it took ~8s to complete. The execution plan of this query looked like this:
Nested Loop (cost=1000.28..2040346.39 rows=26741 width=181)
-> Gather (cost=1000.00..2032378.29 rows=26741 width=115)
Workers Planned: 2
-> Parallel Seq Scan on violation (cost=0.00..2028704.19 rows=11142 width=115)
Filter: (vehicle_inspection_id = ANY ('{2814,9330,....,8926,9153}'::integer[]))
-> Index Scan using violationtype_pkey on violationtype (cost=0.28..0.30 rows=1 width=66)
Index Cond: (id = violation.violation_type_id)
I trimmed the execution plan for brevity, but the Filter
line was three or four pages long, filled with IDs. This got us thinking, is it possible that this huge ANY
filter is what's causing us trouble?
Replacing prefetch_related
with Subquery
and OuterRef
To answer this question we decided to try and implement the query without prefetch_related
. Instead, we decided to use the new Subquery
expression.
Using Subquery
the query using the ORM looked like that:
from django.db.models import OuterRef, Subquery, Value
from django.contrib.postgres.aggregates import ArrayAgg
inspections = inspections.annotate(violations_csv=Subquery(
Violation.objects
# Reference the inspection ID of the outer table, inspection.
.filter(inspection_id=OuterRef('id'))
# Prevent Django from adding a group by column.
.annotate(dummy=Value('1')).values('dummy')
# Construct an array of violation names.
.annotate(violations=ArrayAgg('violation_type__name', distinct=True))
.values('violations')
))
If you never experimented with Subquery
there is a lot to take in here. Before we break it down, this is what the query looks like:
SELECT
"inspection"."id",
(
SELECT
ARRAY_AGG(DISTINCT U2."name") AS "violations"
FROM
"violation" U0
INNER JOIN "violationtype" U2 ON (U0."violation_type_id" = U2."id")
WHERE
U0."inspection_id" = ("inspection"."id")
) AS "violations_csv",
-- around 50 more fields from joined tables
FROM
"inspection"
INNER JOIN "auth_user" ON ("inspection"."user_id" = "auth_user"."id")
-- around 11 more joined tables
Now, let's break it down:
Subquery
is a query expression that can only exist inside another query. In this case, the outer query isinspection
.Subquery
in used inannotate
so the result of the subquery is stored in a another column for each row.- We added a dummy annotation to prevent Django from grouping the results. The subquery is executed for each inspection, this is what the filter using
OuterRef
does. For this reason, we don't need to group by any other column. - The subquery must return at most one row, so we group the names into an array using
ARRAY_AGG
.
After all this hard work, we were keen to see if this is the silver bullet we were waiting for, but in fact, when we executed this on 10,000 rows it choked. To see it through, we executed the export function with only 1,000 rows.
These were the results of exporting 1,000 rows in xls format using subquery:
1571053 function calls (1514505 primitive calls) in 60.962 seconds
select 59.917
The query is now crazy slow. I won't paste the execution plan because there were so many other tables, but PostgreSQL used a nested loop join on the top level of the query to produce the value for this field. Surprisingly, the database did a significantly worse job than the ORM did in this case.
Using an Iterator
Before we completely abandoned this solution, we wanted to check one last thing. We previously mentioned that django-import-export
is using iterator()
to create a cursor over the results. We also mentioned that using prefetch_related
prevents us from using iterator()
. Well, we no longer use prefetch_related
so we might as well check if using iterator()
makes any difference.
These were the results of exporting 1,000 rows in xls format using subquery and iterator:
1571580 function calls (1514788 primitive calls) in 62.130 seconds
select 60.618
The iterator made no difference.
Simplifying the Query
In a final attempt to get something out of this expedition, we wanted to see if the complexity of the query prevented PostgreSQL from finding an optimal execution plan. To do that, we could have adjusted the database parameters from_collapse_limit
and join_collapse_limit
and let PostgreSQL take all the time and resources it needs to find an optimal execution plan, but instead, we decided to strip all other fields from the resources besides id
and violations
.
These were the results of exporting 1,000 rows containing only the id and violations fields in xls format using subquery and iterator:
6937 function calls (6350 primitive calls) in 57.280 seconds
select 57.255
No change, this is officially a dead end!
Manual Prefetch
After a quick lunch break we decided it's time pull out the big guns. If Django's prefetch implementation wasn't working for us, and PostgreSQL was unable to produce a decent execution plan, we will just have to do it ourselves.
To implement our own "prefetch" we needed to adjust some of the other functions in the resource:
from import_export import resources, fields
from .models import Inspection, Violation
class InspectionResource(resources.ModelResource):
violations = fields.Field()
class Meta:
model = Inspection
fields = (
'id',
'violations',
)
def export(self, queryset=None):
# Manually prefetch the violations.
self.prefetched_violations = dict(
Violation.objects
.filter(inspection_id__in=(
queryset
# Clean all joins.
.select_related(None)
.values_list('pk')
)
.annotate(
violations_csv=ArrayAgg('violation_type__name'),
)
.values_list(
'vehicle_inspection_id',
'violations_csv',
)
)
return super().export(queryset)
def dehydrate_violations(self, inspection: Inspection) -> str:
return '\n'.join(self.prefetched_violations.get(inspection.id, []))
This looks like a lot, but it's actually not:
-
We create our own "prefetch related" dict
prefetched_violations
:- The key is the violation ID, and the value is an array containing the violation names (
violations_csv
). - To fetch only relevant violations, we use filter using
queryset
to filter only the necessary inspections. - We executed
select_related(None)
to remove all previously setselect_related
tables, and make the ORM remove any unnecessary joins.
- The key is the violation ID, and the value is an array containing the violation names (
-
We return the original queryset to the
export
function which produces the Excel file. -
To construct the value for the
violations
field, we use theprefetched_violations
we populated duringexport
. This is the "lookup" part of the prefetch. While using Django'sprefetch_related
we have access to this value on the instance, when we do it manually we have to look it up ourselves. -
Once again, since we no longer use Django's
prefetch_related
we were able to use an iterator. So, instead of evaluating the query we return a queryset.
We already got disappointed after putting in a lot of effort the last time, let's see if this time the hard work paid off.
These were the results of exporting 10,000 rows in xls format using manual prefetch and iterator:
15281887 function calls (14721333 primitive calls) in 11.411 seconds
select 0.833
manual prefetch 0.107
Compared to the 40 seconds we started with, this is an overall 75% improvement. 20s were reduced by switching to xls format, another 10s were from manually doing the prefetch.
We are ready for production!
Trouble in Paradise
Quickly after rolling out the new version to production we started getting complaints from users not being able to open the file.
Remember I told you using xls was a bad idea? Well, when users started downloading the xls files they got a nasty message saying the file is corrupt, and excel, thank god, managed to salvage some of the data (which is way worse!).
One might ask, "but how come you didn't catch this in QA?". Well, that's just another reason we hate working with Excel. When we tested it locally on our Linux desktops using LibreOffice, it worked just fine.
So let's recap:
- xlsx is slow and consumes a lot of CPU.
- xls is not supported by the excel version used by our users.
- csv has many encoding and formatting issues, and proved to be unusable in the past.
Using a Different Excel Writer
As always, when all options suck and the future is looking bleak, we turned to Google.
A quick search of "python excel performance" brought up this gist which compares 4 different Excel writers in Python (gotta love the internet!).
These are the benchmark results:
# Source: https://gist.github.com/jmcnamara/ba25c2bf4ba0777065eb
Versions:
python : 2.7.2
openpyxl : 2.2.1
pyexcelerate: 0.6.6
xlsxwriter : 0.7.2
xlwt : 1.0.0
Dimensions:
Rows = 10000
Cols = 50
Times:
pyexcelerate : 10.63
xlwt : 16.93
xlsxwriter (optimised): 20.37
xlsxwriter : 24.24
openpyxl (optimised): 26.63
openpyxl : 35.75
According to the results, there is a big difference between the xlsx libraries.
As mentioned before, we use django-import-export
to produce excel files from Django models and querysets. Under the hood, django-import-export
is using the popular tablib
package to do the actual export.
Tablib offers export and import capabilities to and from many formats, but it doesn't do any of the heavy lifting itself. To produce xlsx files, tablib is using the package openpyxl
.
A Faster Excel Writer in Python
Looking back at the benchmark results, openpyxl
is the slowest among all packages. It looks like by switching to the fastest implementation, pyexcelerate
we might be able to gain some significant improvement for this export process.
The package pyexcelerate
looked great from start. The tag line is just what we needed:
PyExcelerate is a Python for writing Excel-compatible XLSX spreadsheet files, with an emphasis on speed.
Even the snarky subtitles on the "Usage" section in the README were just what we wanted: fast, faster and fastest!
With such promising benchmarks and README, we had to try it out!
Patching tablib
We already have an entire system built on top of django-import-export
and tablib
, and we didn't want to start making changes everywhere. So instead, we looked for a way to patch tablib, and make it use pyexcelerate
instead of openpyxl
.
After some digging, we found that tablib uses an internal function called _register_formats
to add export and import formats such as csv, xls and xlsx. To get a list of available formats, tablib imports a collection called available
from the module formats
. The contents of the file formats/__init__.py
where the collection is defined, look like this:
# -*- coding: utf-8 -*-
""" Tablib - formats
"""
from . import _csv as csv
from . import _json as json
from . import _xls as xls
from . import _yaml as yaml
from . import _tsv as tsv
from . import _html as html
from . import _xlsx as xlsx
from . import _ods as ods
from . import _dbf as dbf
from . import _latex as latex
from . import _df as df
from . import _rst as rst
from . import _jira as jira
available = (json, xls, yaml, csv, dbf, tsv, html, jira, latex, xlsx, ods, df, rst)
The interesting part is the contents of the file _xlsx.py. The file defines some functions to export and import from Excel using openpyxl
.
To patch tablib
, we first need to implement a similar interface to the one in _xlsx.py
using pyexcelerate
, and then register it in tablib
.
Let's start with implementing _xlsx.py
using pyexcelerate
:
# fast_xlsx.py
import itertools
from io import BytesIO
from tablib.formats._xlsx import * # noqa
from pyexcelerate import Workbook
# Override the default xlsx implementation
title = 'xlsx'
def export_set(dataset, freeze_panes=True):
"""Returns XLSX representation of Dataset."""
title = dataset.title or 'Sheet1'
wb = Workbook()
wb.new_sheet(title, data=itertools.chain([dataset.headers], dataset))
stream = BytesIO()
wb.save(stream)
return stream.getvalue()
def export_book(databook, freeze_panes=True):
"""Returns XLSX representation of DataBook."""
assert len(databook._datasets) == 1
return export_set(databook._datasets[0], freeze_panes)
def dset_sheet(*args, **kwargs):
assert False, 'How did you get here?'
This is a simple implementation of the main functions. It lacks some functionalities such a multiple sheets, but it was fine for our needs.
Next, we need to make tablib
register this file instead of the existing xlsx format. To do that, we created a new file called monkeypatches.py
:
# monkeypatches.py
import tablib
from . import fast_xlsx
# Override default xlsx format with a faster implementation
# using `pyexcelerate` (export only).
tablib.formats.available += (fast_xlsx, )
To apply the patch to tablib
, we import our implementation and add it to the available formats list. We then import this file in the module's __init__.py
so every time the system starts up, tablib
is patched.
Now for the moment of truth, did all this hard work finally paid off?
These were the results of exporting 10,000 rows in xlsx format with pyexcelerate
using manual prefetch and iterator:
13627507 function calls (13566956 primitive calls) in 10.944 seconds
select 0.137
manual prefetch 2.219
The hard work definitely paid off! Just so we have an honest comparison, these are the results of exporting 10,000 rows in xlsx format without patching tablib
using manual prefetch and iterator:
55982358 function calls (46155371 primitive calls) in 29.965 seconds
select 0.137
manual prefetch 1.724
That's a 64% improvement compared to the default implementation provided by tablib
, and a 75% improvements compared to the 40s we started with.
Results Summary
This a summary of all the results mentioned in the article:
Time | Rows | Format | Method |
---|---|---|---|
39.927s | 10,000 | xlsx | prefetch_related (Django) |
17.429s | 10,000 | csv | prefetch_related (Django) |
20.694s | 10,000 | xls | prefetch_related (Django) |
60.962 | 1,000 | xls | subquery |
62.130 | 1,000 | xls | subquery and iterator |
57.280s | 1,000 | xls | simplified query, subquery and iterator |
29.965s | 10,000 | xlsx | default tablib implementation, manual prefetch and iterator |
11.411s | 10,000 | xls | using manual prefetch and iterator |
10.944s | 10,000 | xlsx | using pyexcelerate , manual prefetch and iterator |
Seifa
We try to study every incident and take actions to prevent similar incidents from happening in the future. During this incident, some of our users did experience slowness for a short period of time, however, the "Export to Excel" functionality did not really killed our app.
Following this incident, there are a few open questions we haven't had the chance to fully explore yet:
-
Why was the prefetch query so slow? The difference boils down to executing
Model.objects.filter(fk__in = [1,2,3,4....9,999, 10,000])
vs executingModel.objects.filter(fk__in = OtherModel.objects.filter( ... ).values_list('pk'))
. When we tried to compare the two in the database, we found no difference, but the built-inprefetch_related
was significantly slower. Is it possible that time is being spent generating the query in Python? -
Can
openpyxl3
performance be improved? When I talked to John, the author of the Excel writers benchmark, he mentioned thatopenpyxl3
can be faster iflxml
is installed. -
Is xlsx really the best format? Can we eliminate some of the problems we had with csv by switching to a different textual format such as tsv?
If you have the answer to any of these questions feel free to share them with me and i'll be happy to post the response.
UPDATED: Aug 19, 2019
Comments From Readers
A reader from lobste.rs ran a quick benchmark to check how faster openpyxl
can get using lxml
. These were his results:
Versions:
python: 3.6.8
Dimensions:
Cols = 50
Sheets = 1
Proportion text = 0.10
optimised = True
Rows = 10000
Times:
openpyxl: 2.6.3 using LXML True: 3.70
openpyxl: 2.6.3 using LXML False: 6.03
Rows = 1000
Times:
openpyxl: 2.6.3 using LXML True: 0.37
openpyxl: 2.6.3 using LXML False: 0.57
This benchmark shows that openpyxl
can be made almost twice as fast just by installing lxml
. However, pyexcelerate
improved the speed by a factor of 3.
Many reader on Reddit and Lobsters suggested that a better approach would be to generate the Excel file on the client side using Javascript. This is definitely something worth considering when designing a new system, even thought I think this approach might be problematic for very large files.