Concurrency is not very intuitive. You need to train your brain to consider what happens when multiple processes execute a certain code block at the same time. There are several issues I often encounter:
-
Failing to recognize potential concurrency issues: It's not uncommon for both beginner and seasoned developers to completely miss a potential concurrency problem. When this happens, and the concurrency issue end up causing bugs, it's usually very hard to trace and debug.
-
Dismiss concurrency issues due to low likelihood: If you recognized a potential concurrency issue, at some point you probably thought to yourself "what are the chances if this happening...". It's very tempting to dismiss concurrency issues when the likelihood is low. However, I personally found that concurrency issues tend to creep up at the worst time - when your system is under significant load and you have very little time (and grace) to come up with a solution.
-
Abusing locks: If you recognized a potential issue and decided to handle it properly, your next step will usually involve some kind of lock. Sometimes locks are necessary, but more often than not they can be avoided, or replaced by more permissive locks.
In this article I present common concurrency challenges and how to overcome them with minimal locking.
Table of Contents
- A Simple Web Application
- Naive Implementation
- Handling Possible Collisions
- Time-of-check to Time-of-use
- Locking
- Lock in the Database
- Asking for Forgiveness
- Asking for Forgiveness in PostgreSQL
- Identifying Race Conditions
- Select for Update
- Increment in the Database
- Update and Immediately Return
- Take Away
A Simple Web Application
A URL shortener provides short URLs that redirects to other URLs. There are several reasons why you would want that:
- Include links in space-constrained places: Links in SMS messages, Tweets etc.
- Track the number of clicks: Ads, campaigns, newsletter links, promotional emails etc.
To most developers a URL shortener sounds like a straight forward project. This is why it makes a great example to demonstrate common concurrency issues, and how easy they are to miss and get wrong. I'm using Python, Django and PostgreSQL, but the concepts apply to any programming language and RDBMs.
To build your URL shortener start by creating a new Django project:
$ python -m venv venv
$ source venv/bin/activate
$ pip install django
$ django-admin startproject project
$ cd project
This will create a Python virtual environment, install the latest Django version and create a Django project named project
.
For the URL shortener, add a new app called shorturl
in the project:
$ python manage.py startapp shorturl
Next, register the new app in settings.py
:
# settings.py
INSTALLED_APPS = [
# ...
"shorturl.apps.ShortUrlConfig",
]
Django uses SQLlite by default. If you want to configure Django to use PostgreSQL instead, install psycopg and create a database:
$ pip install psycopg2
$ createdb -O yourdbuser shorturl
Then edit the following parameters in settings.py
:
# settings.py
DATABASES = {
"default": {
"ENGINE": "django.db.backends.postgresql",
"NAME": "shorturl",
"USER": "yourdbuser",
}
}
Finally, run the initial migrations:
$ python manage.py migrate
You are now ready for the interesting part!
Naive Implementation
A short URL is composed of a short unique identifier that points to some target URL, and a counter to keep track of the number of hits. A Django model for a short URL can look like this:
from django.db import models
class ShortUrl(models.Model):
key: str = models.CharField(max_length=20, unique=True)
target_url: str = models.URLField()
hits: int = models.PositiveIntegerField()
A simple function to create a new short URL can look like this:
import secrets, string
CHARACTERS = string.ascii_letters + string.digits
class ShortUrl(models.Model):
# ...
@classmethod
def create(cls, target_url: str) -> ShortUrl:
key = ''.join((secrets.choice(CHARACTERS) for _ in range(5)))
return cls.objects.create(key=key, target_url=target_url, hits=0)
The function accepts a target URL, generates a random key from a list of possible characters and saves a new ShortUrl
to the database. You can now use this function to create a new short URL:
>>> shorturl = ShortURL.create('https://hakibenita.com')
>>> vars(shorturl)
{'_state': <django.db.models.base.ModelState at 0x7fd5e05558a0>,
'id': 1,
'created_at': datetime.datetime(2022, 4, 29, 8, 2, 18, 615165, tzinfo=datetime.timezone.utc),
'key': 'c6UFG',
'target_url': 'https://hakibenita.com',
'hits': 0}
This seems innocent enough, so what can possibly go wrong?
Handling Possible Collisions
Say your URL shortener becomes a wild success and you have millions of new short URLs created every day. At some point, the function that generates the random key may produce a key that already exist:
>>> shorturl = ShortUrl.create('https://hakibenita.com/tag/django')
IntegrityError: duplicate key value violates unique constraint "shorturl_shorturl_key_uk"
DETAIL: Key (key)=(c6UFG) already exists.
Notice that the function generated the random key c6UFG
which is similar to a short URL you previously created. The key
column has a unique constraint defined on it, so you got a database error.
In order to avoid a unique constraint violation, you might try to check the key in advance, like this:
class ShortUrl(models.Model):
@classmethod
def create(cls, target_url: str) -> ShortUrl:
while True:
key = ''.join((secrets.choice(CHARACTERS) for _ in range(5)))
if not cls.objects.filter(key=key).exists():
break
return cls.objects.create(key=key, target_url=target_url, hits=0)
Instead of creating the short URL immediately, you first check that the random key you generated does not already exist. If you find that is does, you keep iterating on random keys until you find one that doesn't.
Aside from the fact that this function can now potentially go into an infinite loop, there is another problem!
Time-of-check to Time-of-use
The purpose of checking that the key does not exist in advance was to avoid a database error, but is it really impossible to end up with a unique constraint violation this way? Consider the following scenario:
Process 1 | Process 2 |
---|---|
Check key c6UFG -> ✅ |
|
Check key c6UFG -> ✅ |
|
Use Key c6UFG |
|
Use key c6UFG -> 💥 Already exists |
Let's break it down:
- Process #1 generates the random key
c6UFG
and checks that it does not already exist - Before Process #1 has a chance to write the new shorturl to the database, process #2 generates the same key
c6UFG
and checks that it does not already exist. At this point, it doesn't! - Process #2 writes the short URL to the database and completes successfully
- Process #1 now tries to save a short URL with the same key
c6UFG
, which it checked in advance, and fails with a unique constraint violation
This is a very common concurrency issue commonly referred to as "time-of-check to time-of-use", or "TOCTOU". The name describes the issue pretty well - the problem is caused when another process changes the data between the time a process checked a value until the time it used it. In this case, Process #2 added a short URL with the same key after Process #1 had checked it, but before it used it.
Locking
Whenever you have a problem with multiple processes accessing the same resource at the same time, the most intuitive solutions is a lock. But, what exactly should you lock?
A simple architecture for a web application usually contains a web application process and a database:
If that's your setup, you might be able to obtain a lock at the process level and make sure you are the only one accessing a certain resource at a time.
However, a common setup for Django (as well as other web applications) is to run with multiple worker processes:
With multiple worker processes on the same server it's no longer enough to lock a resource within a single process. However, since the two processes are running on the same server, you might be tempted to try and find a solution at the operating system level. But, is it enough?
If your system is running on multiple servers, with multiple worker processes on each one, even the OS can't save you. You now might think the lowest common denominator is the application itself. You can spend days trying to come up with an original way to coordinate a lock between the servers, but would that solve your problem?
Modern systems can run multiple applications on top of the same database. We for example, do it with Django admin.
At this point it becomes clearer that the lowest common denominator, the resource that all servers, processes and applications share, is the database. If you want to "lock" a resource, you better do it in the database.
Lock in the Database
Now that you know where to lock, there is another challenge. If you were updating an existing row in the database you could have locked that specific row, but this is not the case. You want to create a new row, for a new short URL, so what can you possibly lock?
One option is to lock the entire table:
from django.db import connection, transaction
class ShortUrl(models.Model):
@classmethod
def create(target_url: str) -> ShortUrl:
with transaction.atomic(), connection.cursor() as cursor:
cursor.execute('LOCK TABLE shorturl_shorturl IN EXCLUSIVE MODE;')
while True:
key = ''.join((secrets.choice(CHARACTERS) for _ in range(5)))
if not cls.objects.filter(key=key).exists():
break
return cls.objects.create(key=key, target_url=target_url, hits=0)
To prevent other processes from creating short URLs and potentially causing a unique constraint violation, you locked the entire shorturl table.
First, to obtain a lock in the database you need to operate inside a database transaction:
with transaction.atomic():
In most cases, Django operates in "autocommit" mode, meaning it implicitly opens a transaction for every command, and commits immediately after. In this case you want to execute multiple commands in the same database transaction, so you explicitly control the transaction using the transaction.atomic
context.
The Django ORM does not provide functions for locking tables. To lock a table in the database you need to use a cursor and execute raw SQL:
with transaction.atomic(), connection.cursor() as cursor:
cursor.execute('LOCK TABLE shorturl_shorturl IN EXCLUSIVE MODE;')
After you obtained an exclusive lock on the table, no other transaction can obtain the same lock until you release it. This guarantees that the data cannot change between the time you checked the values and the time you used them.
So, problem solved?
Asking for Forgiveness
Imagine your app makes it to the top page of a very popular news site. In a matter of minutes you start getting thousands of hits, and users are creating thousands of new short URLs. Now imagine your system can only create a single short URL from a single user at a time, and all other users need to wait in line. Is that acceptable?
When you locked the entire table you made sure no other transaction can make changes to the table until you are done. This means adding new short URLs is now safe, but it can also get pretty slow when you have many concurrent requests.
What if you could ditch the lock? What if you could make your function safe without preventing multiple users from creating short URLs at the same time? Have another look at the exception you got in the beginning:
>>> shorturl = ShortUrl.create('https://hakibenita.com/tag/django')
IntegrityError: duplicate key value violates unique constraint "shorturl_shorturl_key_uk"
DETAIL: Key (key)=(c6UFG) already exists.
When the function attempted to create a short URL with a key that already existed, the database returned an error. This is because the key
column has a unique constraint defined on it. So, if the database is already making sure that there are no duplicates, why not rely on it?
With this in mind, you can now change your function to handle the unique constraint violation:
from django.db import IntegrityError
class ShortUrl(models.Model):
@classmethod
def create(cls, target_url: str) -> ShortUrl:
while True:
key = ''.join((secrets.choice(CHARACTERS) for _ in range(5)))
try:
return ShortUrl.objects.create(key=key, target_url=target_url, hits=0)
except IntegrityError:
# Key exists, try try again!
continue
The function now generates a random key and attempts to create a new short URL. If the command fails with an IntegrityError
it means a short URL with the same key
already exists. In this case, the function will generate another random key
and try again until it succeeds.
The first thing you might notice is that there is no explicit lock, and there is no explicit database transaction. Multiple processes can now safely create short URLs without getting an IntegrityError
.
According to this quote from the Python glossary, this approach is also much more "pythonic":
EAFP
Common Python coding style [ ... ] This clean and fast style is characterized by the presence of many try and except statements.
The term EAFP stands for "Easier to ask for forgiveness than permission", and this is exactly what you did. Instead of checking in advance, you tried to write the row to the database and handled the exception. The opposite of EAFP is LBYL, "Look Before You Leap", which is what you did when you checked the values in advance.
Python as a language encourages "asking or forgiveness" (EAFP), as opposed to other languages such as JavaScript or Java that encourage you to check values in advance, the LBYL style.
Asking for Forgiveness in PostgreSQL
Sometimes it's necessary to generate short URLs as part of another transaction. For example, if you include the short URL in a notification you save to the database, your code can look like this:
>>> with transaction.atomic():
... shorturl = ShortUrl.create('https://hakibenita.com')
... notification = Notification.create(to='...', message='...')
To make sure both the notification and the short URL are created together, you execute the code inside a database transaction. This is a perfectly valid use of database transactions - this is exactly what they are for!
If you use PostgreSQL, under some circumstances you might encounter this error:
InternalError: current transaction is aborted, commands ignored until end of transaction block
In PostgreSQL, when there is an error during a transaction, all other commands are aborted until the end of the transaction. Now recall how ShortUrl.create
is implemented - you iterate over random keys and attempt to create until you don't get an IntegrityError
. This means that if you execute your function inside a transaction and it did trigger an IntegrityError
, PostgreSQL will abort the transaction, and you won't be able to proceed with the rest of the code.
To accommodate possible exceptions within transaction in PostgreSQL, you can use another transaction:
from django.db import transaction, IntegrityError
class ShortUrl(models.Model):
@classmethod
def create(cls, target_url: str) -> ShortUrl:
while True:
key = ''.join((secrets.choice(CHARACTERS) for _ in range(5)))
try:
# In PostgreSQL, when an SQL command fails (usually due to
# IntegrityError) it is not possible to execute other commands
# until the end of the atomic block. To be able to retry different
# keys multiple times after, we execute the command in its own
# atomic block.
# https://docs.djangoproject.com/en/4.0/topics/db/transactions/#handling-exceptions-within-postgresql-transactions
with transaction.atomic():
return ShortUrl.objects.create(key=key, target_url=target_url, hits=0)
except IntegrityError:
# Key exists, try try again!
continue
The additional transaction introduces very little overhead, it simply restricts the effect a possible exception can have on any outer transactions.
Identifying Race Conditions
Now that you have all of these short URLs in your system, it's time to actually use them. The URL shortener system includes a view that redirects a short URL to its target URL, and increments the counter. The view can look like this:
# views.py
from django.http import HttpRequest, HttpResponse, HttpResponseRedirect, Http404
from django.views.decorators.http import require_http_methods
from .models import ShortUrl
@require_http_methods(["GET"])
def resolve_short_url(request: HttpRequest, key: str) -> HttpResponse:
try:
shorturl = ShortUrl.resolve(key)
except ShortUrl.DoesNotExist:
raise Http404()
return HttpResponseRedirect(shorturl.target_url)
The view attempts to "resolve" a key to a ShortURL
instance, and if it finds one, redirects to it.
A naive implementation of the function ShortUrl.resolve
can look like this:
class ShortUrl(models.Model):
@classmethod
def resolve(cls, key: str) -> ShortUrl:
shorturl = cls.objects.get(key=key)
shorturl.hits += 1
shorturl.save(update_fields=['hits'])
return shorturl
The function accepts the key as argument and attempts to find a short URL with that key. If the key is not found, .get(key=key)
throws a ShortUrl.DoesNotExist
exception and the view will return a 404 response. If a short URL is found, the hit counter is incremented and saved, the object is returned to the view and the user is redirected to the target URL.
So, where is the problem?
As you already experienced when you created the short URL, concurrency issues often require special attention. Imagine what will happen if multiple users are trying to resolve the same key at the same time. Consider the following scenario:
Process 1 | Process 2 | Hits |
---|---|---|
Select hits -> 0 | 0 | |
Select hits -> 0 | 0 | |
Update hits -> 1 | 1 | |
💥 Update hits -> 1 | 1 |
In this scenario, two processes are resolving the same URL at the same time. The short URL is resolved twice, but the hits counter is 1. This is incorrect!
Select for Update
The problem with the naive implementation is that when multiple concurrent processes resolve the same short URL at the same time, the counter can get out of sync. Each of the processes is incrementing the counter based on the value it received when it fetched the row from the database, and that's how the counter gets out of sync.
What if you could lock the row to prevent multiple processes from selecting and updating it at the same time? Consider the following implementation:
from django.db import transaction
class ShortUrl(models.Model):
@classmethod
def resolve(cls, key: str) -> ShortUrl:
with transaction.atomic():
shorturl = (
cls.objects
.select_for_update()
.get(key=key)
)
shorturl.hits = shorturl.hits + 1
shorturl.save(update_fields=['hits'])
return shorturl
The function now opens a database transaction, and uses select_for_update
to lock the row. If the lock is obtained, other processes cannot obtain the same lock on the row until the transaction is finished. This means the counter can no longer get out of sync, because only a single process can fetch and update it at the same time. But it also means that any concurrent processes must either wait or fail.
Imagine you launch a big campaign to hundreds of thousands of users. To check how effective your campaign is, you use your URL shortener to keep track of how many users clicked the links. Immediately when you send out the campaign it lands in your users' emails and thousands of them click the links. Now imagine each user needs to wait in line until the previous one fetched the row and updated the hit counter in the database. Sounds like it might be a problem...
Increment in the Database
Using select_for_update
you solved the problem of the hit counter going out of sync, but your system is now doing a very poor job at the one thing it should be doing very well - redirect short URLs!
The main issue with the previous approach is that you incremented the counter based on what you fetched. With many concurrent processes operating at the same time, it is very much possible that since you fetched the row, the counter was incremented multiple times by other processes and you just don't know about it.
What if instead of incrementing the counter based on what you have stored in memory, you instruct the database to update based on what it currently has stored?
from django.db.models import F
class ShortUrl(models.Model):
@classmethod
def resolve(cls, key: str) -> ShortUrl:
shorturl = cls.objects.get(key=key)
shorturl.hits = F('hits') + 1
shorturl.save(update_fields=['hits'])
return shorturl
The function now uses an F expression to update the counter relative to what is in the database.
The difference seems very mild, so the best way to understand it is to look at the SQL generated by the update commands. The naive approach will execute the following command:
UPDATE shorturl_shorturl
SET hits = 1
WHERE id = 154;
This will update hits to 1, regardless of the current value of hits in the database.
The function using the F expression will execute the following command:
UPDATE shorturl_shorturl
SET hits = hits + 1
WHERE id = 154;
The hit counter is now incremented by one and not set to a fixed value. This is how using an F expression solves the problem without obtaining an explicit lock on the row.
Update and Immediately Return
Using F expressions solved the problem without an exclusive lock, but there are still two minor downsides to this approach:
-
There are two round trips to the database: the function first access the database to fetch the short URL by key, and then access it again to update the row.
-
The hit counter on the returned instance is not updated: the function is incrementing the value in the database, but the short URL object it returns stores the hits counter prior to when the hits counter was incremented.
Normally, the F expression solution with two round trips to the database is good enough, no reason to go through the trouble of trying to optimize it. However in this case, the system should be able to accommodate sudden bursts and redirect very quickly, so it might be worth the trouble.
To solve both issues, you can extend the solution beyond Django's ORM built-in capabilities with some SQL magic:
from django.db import transaction
class ShortUrl(models.Model):
@classmethod
def resolve(cls, key: str) -> ShortUrl:
short_url = cls.objects.raw('''
UPDATE shorturl_shorturl
SET hits = hits + 1
WHERE key = %s
RETURNING *
''', [key])
if not short_url:
raise cls.DoesNotExist()
return short_url[0]
Let's break it down:
-
Use
RETURNING
to get updated results: In PostgreSQL and SQLite you can return the rows affected by an UPDATE statement. The rows returned by theRETURNING
clause are the updated rows. Even though you only update the columnhits
, you can use*
to return the entire row with the updated values. -
Use
.raw
to construct aShortUrl
instance: Django ORM allows you to construct ORM objects from raw SQL. -
If no rows were affected, raise
DoesNotExist
: if no rows were affected it means there is no short URL for the provided key. To mimic Django's behavior in this case, raise aDoesNotExist
exception, otherwise return the object.
Binding the table name
Under some circumstances you might want to avoid explicitly using the name of the table. Mainly if you have a reuseable model and you cannot be sure what the name of the database table is.
Using string concatenation to set the name of the table is not acceptable as it puts the query at risk of SQL Injection. Using psycopg, the Python driver for PostgreSQL, there is a way to safely bind identifiers such as table names:
from psycopg2.sql import SQL, Identifier
@classmethod
def resolve(cls, key: str) -> ShortUrl:
short_url = cls.objects.raw(
raw_query=SQL("""
UPDATE {}
SET hits = hits + 1
WHERE key = %s
RETURNING *
""").format(Identifier(cls._meta.db_table)),
params=(key,),
)
if not short_url:
raise cls.DoesNotExist()
return short_url[0]
The function now uses the name of the table as parameter.
Check out Preventing SQL Injection Attacks With Python for more about how to safely compose SQL for PostgreSQL in Python using Psycopg2.
Take Away
Throughout this article you used several different approaches to solve very common concurrency issues in two seemingly simple tasks:
- Create short URL
- ⛔ #1: Failing to recognize possible collisions
- ⛔ #2: Time-of-check time-of-use (TOCTOU)
- ✅ #1: Lock!
- ✅ #2: Ask for forgiveness
- Increment hit counter
- ⛔ #3: Ignore race conditions
- ✅ #3: Select for update
- ✅ #4: Increment in the database
- ✅ #5: Update and immediately return
Some of these approaches were fragile, and broke under even minor loads, and others has their advantages and disadvantages.
The main take away is this:
-
Keep concurrency in mind
Concurrency issues are very hard to miss, and even harder to debug. Recognizing concurrency issues requires special attention and awareness. This article covers the awareness part, so now you just need to pay attention. -
Don't let probabilities distract you
It's very tempting to dismiss concurrency issues when the likelihood is low. Hopefully now that you know a few more ways to handle concurrency you are in a better position to resist the temptation. -
Avoid locks if possible
Locks slow things down and cause contention so it's best to avoid them when possible. Now you know how.
For more advanced approaches for managing concurrency in Django, check out How to Manage Concurrency in Django Models.