One Database Transaction Too Many

How I told hundreds of users they got paid when they didn't!


Have you ever wondered how bugs are born? I'm not talking about the trivial kind you can catch with simple unit testing. I'm talking about bugs that may not be apparent on first sight, but are so obvious in retrospect.

This is a story about how I accidentally sent hundreds of users messages they got paid when they didn't!

What it feels like when you realized you made a mistake<br><small>Illustration by <a href="https://www.abstrakt.design/">Milica Vezmar Basara</a></small>
What it feels like when you realized you made a mistake
Illustration by Milica Vezmar Basara

Table of Contents


The Story

We have a process in the system where we pay out money to merchants and other types of users. The payout process is a pretty big deal for most users because this is how they get paid.

Creating a Payout

To facilitate the payout process we have a Django model called PayoutProcess. To create a new payout we use a function that looks roughly like this:

from __future__ import annotations
from django.db import model, transaction as db_transaction

class PayoutProcess(models.Model):

    #... fields

    @classmethod
    def create(cls, to: User, amount: int) -> PayoutProcess:
        # ... Validate input ...

        with db_transaction.atomic():
            payout = cls.objects.create(
                to=user,
                amount=amount,
                status='pending',
            )

            # Create related objects etc...
        return payout

The simplified version of this function creates a new instance of a payout process and returns it. In real life this function validates the input and creates several related objects. To make sure all of the related objects are created along with the payout process instance, we use a database transaction.

The new instance now represents a payout process in the system, and the payout module is responsible for completing the payout. A payout can be fulfilled in many different ways, such as by bank transfer, credit card and other methods. Not all payout methods are immediate, so a payout is an asynchronous process that can take some time to complete.

When the payout is eventually paid, the module updates the instance status:

class PayoutProcess(models.Model):

    @classmethod
    def mark_paid(cls, pk: int) -> PayoutProcess:
        with db_transaction.atomic():
            payout = cls.objects.select_for_update().get(pk=pk)
            if payout.status != 'pending':
                raise StateError()
            payout.status = 'paid'
            payout.save()
        return payout

The function fetches the payout, checks its state and marks it as paid. So far so good!

Sending Notifications

At some point our staffers came to us with an idea. They said it would be nice if the system would notify users that they got paid. We thought that this is a great idea! Who doesn't like to get a message saying they got some $$$?

The payout module is a core module in our system. We have payouts going out to different types of users and top level apps use it to create payouts in different contexts. For example, one app sends out commission payments to merchants, and another issues payments to business partners.

To keep the payout module independent and decoupled from the apps that use it, the top level apps are the ones sending the notification to the users. The problem is that after the top level app created the payout process, the payout module handles the actual payment internally, and the top level app has no way of knowing about it unless it constantly monitors its status.

PayoutMerchantsRefundsBankCreditCardCreate PayoutPay

Top level app creates a payout

To have the top level app respond to changes in the payout module, we need to have a mechanism to let the top level app know that something had changed. The tricky part is that the top level apps depend on the payout module, so the payout module cannot depend back on them. It will cause a circular dependency.

PayoutMerchantsRefundsBankCreditCardCreate PayoutPayPayoutPaidNotification

Top level app gets notified when the payout is paid

One way to avoid circular dependencies and keep modules decoupled in Django is to use signals:

# payouts/signals.py
from django.dispatch import Signal

payout_paid = Signal()

After declaring the signal we can send it when a payout is paid. This is done by the model:

from . import signals

class PayoutProcess(models.Model):

    @classmethod
    def mark_paid(cls, pk: int) -> PayoutProcess:
        with db_transaction.atomic():
            payout = cls.objects.select_for_update().get(pk=pk)
            if payout.status != 'pending':
                raise StateError()
            payout.status = 'paid'
            payout.save()

        signals.payout_paid.send_robust(sender=PayoutProcess, payout=payout)
        return payout

Now a top level app can listen to this signal, and send a notification to the user:

from django.dispatch import receiver

import payout.signals
import payout.models
from .models import MerchantPayoutProcess

@receiver(payout.signals.payout_paid)
def on_merchant_was_paid(sender, payout: payout.models.PayoutProcess) -> None:
    try:
        p = MerchantPayoutProcess.objects.get(payout_id=payout.id)
    except MerchantPayoutProcess.DoesNotExist:
        # Not a merchant payout
        return

    p.user.email_user(f'Dear merchant, you got paid {payout.amount}$!')

When the signal receiver is triggered, it first checks to see if it's one of its own payouts. If it is, it fetches the related object, in this case a payout to a merchant, and sends a notification to the user.

N receivers

With this scheme, if you have N receivers, then every dispatch causes N-1 useless queries. This can be avoided by adding a bit of context to the signal.

dispatch_uid

It's usually a good idea to set dispatch_uid on signal receivers. The documentation explains it well.

The benefit of using signals this way is that the low level payout module can communicate with apps that depend on it, without forming a dependency back on them. This pattern eliminates the circular dependency and keeps low level modules independent and decoupled.

Working in Bulk

This design was working pretty well - payouts were flying out and the users were happy.

At some point the staffers came back with another idea. They said that work is picking up and they now want to automate and streamline some of it. They asked if it was possible to mark payouts as paid in bulk. After a quick discussion we decided it's best to have the bulk process "all or nothing", meaning, if the operation fails for even one of the payouts in the bulk, it should not be applied to any of them.

We figured this would be a straightforward task, all we have to do is execute the command on all of the given payouts inside of a database transaction:

from django.db import transaction as db_transaction

def mark_paid_in_bulk(payout_ids: list[int]) -> None:
    with db_transaction.atomic():
        for pk in payout_ids:
            PayoutProcess.mark_paid(pk)

The bulk process simply iterates over payout IDs and marks each one as paid. To make sure the process is atomic, or "all or nothing", we wrap the loop in a database transaction.

Easy, right? This is where it gets hairy.

The Bug

This bulk process was also working great for a while. Staffers would upload Excel files (what else), and the system would go over the payouts and mark them all as paid.

One day, the person that usually does this was on holiday, and asked someone else to do it instead. The other person prepared the Excel file and uploaded it to the system. This new person was not familiar with the process so they made some mistakes with the amounts. As a result, the system rejected some of the payouts.

Now what does a normal person do when a system reports an error? They try again and again...

At some point we started getting complaints from users saying they are getting a ton of messages that they got paid. Some were happy, but others opened the app to check the details and saw that they were in fact not paid, and realized it must be a mistake.

At this point hundreds of users got these messages, but none of them did get paid! So what's causing the issue? How are notifications being sent when all of the payouts are still marked as pending? A closer look at the implementation of the bulk process revealed the problem.

Nested Transactions

The function that marks a payout as paid is executed inside of a database transaction. To make sure that the signal is sent only when the payout status is committed to the database, the signal is sent after the transaction is completed:

@classmethod
def mark_paid(cls, pk: int) -> PayoutProcess:
    with db_transaction.atomic():
        payout = cls.objects.select_for_update().get(pk=pk)
        if payout.status != 'pending':
            raise StateError()
        payout.status = 'paid'
        payout.save()

    signals.payout_paid.send_robust(sender=PayoutProcess, payout=payout)
    return payout

When this function is executed on a single payout it works as expected. But what happens if we add the bulk process:

with db_transaction.atomic():
    for pk in payout_ids:

        # inline `mark_paid()`
        with db_transaction.atomic():
            payout = cls.objects.select_for_update().get(pk=pk)
            if payout.status != 'pending':
                raise StateError()
            payout.status = 'paid'
            payout.save()

        signals.payout_paid.send_robust(sender=PayoutProcess, payout=payout)

Ha! The bulk process is using its own database transaction! When the signal is sent the payout can still be rolled back if a later payout in the bulk fails.

Just to illustrate, if we mark three payouts in bulk and we fail to mark the third one, all three payouts are rolled back, but notifications for the first two had already been sent:

>>> from django.db import transaction as db_transaction
>>> with db_transaction.atomic():
...     for fail in [False, False, True]:
...         with db_transaction.atomic():
...             if fail:
...                 raise Exception('Failed!')
...         print('Message sent!')
...
Message sent!
Message sent!
Exception: Failed!

Notice how the first two messages were sent even though the third failure caused the outer transaction to rollback all three.


Remedies

The single mark_paid function assumes that it is not executed inside of a database transaction, but it is not checking or enforcing it in any way. This is a problem.

Assert Atomic Block

Before Django 3.2 we had some cases where we wanted to make sure a function is executed, or not executed, inside a database transaction. We ended up implementing two functions:

# common/db.py
from django.db import connection

def assert_is_in_atomic_block() -> None:
    assert connection.in_atomic_block, (
        'This function must be run inside of a DB transaction.'
    )

def assert_is_not_in_atomic_block() -> None:
    assert not connection.in_atomic_block, (
        'This function must not be run inside of a DB transaction.'
    )

Using these utility functions we could prevent some code from being executed inside a database transaction:

import common.db

def do_not_run_inside_a_db_transaction():
    common.db.assert_is_not_in_atomic_block()
    # Rest of function goes here...

Running this code block inside of an atomic block will now trigger an assertion error at runtime:

>>> from django.db import transaction as db_transaction
>>> with db_transaction.atomic():
...     do_not_run_inside_a_db_transaction()
AssertionError: This function must not be run inside of a DB transaction.

The main downside to this approach is that unless explicitly stated otherwise, tests will run inside a database transaction. This will cause any test which uses a transaction to fail. To overcome that we ended up patching these functions in tests:

@pytest.fixture(scope='session', autouse=True)
def patch_is_in_db_transaction():
    # Patch atomic transaction check in tests.
    # The checks can't be run in tests because tests are always wrapped in a transaction.
    patch_in = mock.patch('common.db.assert_is_in_atomic_block')
    patch_not_in = mock.patch('common.db.assert_is_not_in_atomic_block')
    with patch_in, patch_not_in:
        yield

This function creates a fixture which is automatically applied for the entire test session. The fixture mocks the two functions and disables their functionality.

Durable Transaction

Starting with Django 3.2, there is another way to prevent a transaction from being executed inside of another transaction, by marking a transaction as "durable":

with db_transaction.atomic(durable=True):
    payout = cls.objects.select_for_update().get(pk=pk)
    if payout.status != 'pending':
        raise StateError()
    payout.status = 'paid'
    payout.save()

signals.payout_paid.send_robust(sender=PayoutProcess, payout=payout)

If you try to open a durable transaction inside of another transaction, a RuntimeError is raised:

>>> from django.db import transaction as db_transaction
>>> with db_transaction.atomic():
...     with db_transaction.atomic(durable=True):
...             pass
...
RuntimeError: A durable atomic block cannot be nested within another atomic block.

Using a durable transaction may have prevented this issue from happening, but it would have also made the bulk feature impossible, or at least very complicated to implement!

Sending Signal on Commit

Another way to tackle this issue is to instead try to make sure the signal is only sent when the overall transaction is successfully committed. One way of doing this is using on_commit.

Using on_commit we can register a function to be executed only when the transaction is actually committed. To illustrate how using on_commit can solve the issue, consider the following example:

>>> from django.db import transaction as db_transaction
... with db_transaction.atomic():
...     for i, fail in enumerate([False, False, True], 1):
...         with db_transaction.atomic():
...             print(f'processing {i}...')
...             if fail:
...                 raise Exception('Failed!')
...             db_transaction.on_commit(lambda: print(f'Message sent!'))
processing 1...
processing 2...
processing 3...
Exception: Failed!

In the example we loop over three values where the third one is expected to fail. To print the message only when the transaction is committed successfully we use on_commit. Notice in the output that three items were processed, but since the third one failed, the entire process failed and none of the messages were sent.

To illustrate what happens when all items succeed, consider the following example:

>>> from django.db import transaction as db_transaction
... with db_transaction.atomic():
...     for i, fail in enumerate([False, False, False], 1):
...         with db_transaction.atomic():
...             print(f'processing {i}...')
...             if fail:
...                 raise Exception('Failed!')
...             db_transaction.on_commit(lambda: print(f'Message sent!'))
processing 1...
processing 2...
processing 3...
Message sent!
Message sent!
Message sent!

Amazing! Three items were processed and three messages were sent. We can now apply a similar fix to our payout module:

from . import signals

class PayoutProcess(models.Model):

    @classmethod
    def mark_paid(cls, pk: int) -> PayoutProcess:
        with db_transaction.atomic():
            payout = cls.objects.select_for_update().get(pk=pk)
            if payout.status != 'pending':
                raise StateError()
            payout.status = 'paid'
            payout.save()

        db_transaction.on_commit(lambda signals.payout_paid.send_robust(PayoutProcess, payout))
        return payout

When a payout is marked as paid the function now sends the signal only when the transaction is committed. This makes the function safe to execute inside of another transaction!

Using a Queue

When dealing with such problems it's not uncommon to immediately think about queues. As a thought exercise, let's examine two common patterns often referred to as "queues".

Async Tasks

Async task runners such as Celery are very popular. They let you execute tasks asynchronously, now, at a later time or at a predetermined time. Using async tasks in this case would not solve the issue:

  • Fire an async task in on_commit
    If we set aside the fact that the payout module is not the one sending the messages, the outcome in this case is exactly the same as sending a signal in on_commit and firing an async task from the receiver (which is what we do).

  • Fire an async task instead of sending a signal
    This will suffer from the same issues as the signal. If the bulk process fails, the task was already fired and the message will be sent.

  • Schedule an async task for a later time and check the status before sending
    This may work in some cases, but there are other problems:

    • We have a race: How long after the payout was processed should the task be executed? 1s? 10s? 1m? What if the bulk process takes 2m to run? When the task will be fired the transaction will not be commited yet and the message will not be sent. What do you do then?

    • We do extra work: You now have to fetch the payout again before sending the message.

    • We send the message later: If we wait, users can receive the message a few minutes or even hours after they were paid. This may not be a big deal in some cases, but in others sending the message close to when the event occurred may be crucial.

Another downside to using an async task runner, is that now you need to have an async task runner. If you already have one it may no be so bad, but if you don't, it can be a pain to set up and operate.

Transactional Queue

If you decided to implement a queue in the database you are likely one step closer to a proper solution. Instead of using signals, you can stage tasks to a database table that acts as a queue.

The main benefit of using a queue table in the database is that the task will be added only when the transaction is committed. This plays very nicely with the overall transaction management of the process, and guarantees that the task is only added when it should.

The challenging part is how to make sure the tasks are being picked up shortly after they were added to the queue. If you are using a cron job to process tasks, the sending may be delayed up to the cron job's repeat interval. If you use database triggers, LISTEN/NOTIFY or similar to trigger processing of tasks then the delay can be shorter.


Testing

We ended up implementing the on_commit solution because it required very little changes to existing code. However, after we finished making the changes to the code, we faced yet another challenge - the tests!

Testing with Django

Our tests included scenarios to make sure a notification is sent when a payout is paid:

def test_should_send_notification(db, mailoutbox, merchant_user: User) -> None:
    comm = MerchantCommission.create_payout(merchant_user, amount=100_00)
    PayoutProcess.mark_paid(comm.payout_process_id)
    assert len(mailoutbox) == 1

After we made the change to send the signal in on_commit, all of these tests failed. After some debugging we found that the receiver function we registered for the signals was not executed, but only in the tests!

The fact that the on_commit handler is not fired is not surprising if you know how tests are executed. To speed things up, Django starts a database transaction at the beginning of every test and then rolls it back immediately after. Executing tests in this manner is a fast way to prevent tests that change data in the database from affecting each other.

To make it possible to test things that are triggered in on_commit without using slow transactional tests, Django 3.2 added a new context manager called captureOnCommitCallbacks (Ticket #30457):

from django.core import mail
from django.test import TestCase

class TestPayoutProcess(TestCase):

    def test_should_send_notification(self, merchant_user: User) -> None:
        comm = MerchantCommission.create_payout(merchant_user, amount=100_00)
        with self.captureOnCommitCallbacks(execute=True):
            PayoutProcess.mark_paid(comm.payout_process_id)
        assert len(mail.outbox) == 1

The context manager is available on instances of TestCase, and when execute=True any on_commit handlers will also be executed, not just captured.

Testing with Pytest

Unfortunately we are not using Django's TestCase directly anymore, we are using pytest, and we were not in a position to start rewriting stuff. Lucky for us, pytest-django implemented equivalent functionality. A quick upgrade to pytest-django version 4.4 and we were ready to go:

def test_should_send_notification(
    db,
    mailoutbox,
    django_capture_on_commit_callbacks,
    merchant_user: User,
) -> None:
    comm = MerchantCommission.create_payout(merchant_user, amount=100_00)
    with django_capture_on_commit_callbacks(execute=True):
        PayoutProcess.mark_paid(comm.payout_process_id)
    assert len(mailoutbox) == 1

The fixture django_capture_on_commit_callbacks is based on the Django function. Once you inject it you can use it just like you would the Django one.


The "bug" caused by this nested transaction ended up causing some users to get multiple messages saying they got paid, but all of these users were eventually paid.


Thoughts on Django Signals

As described in this story, Django signals are useful for implementing interactions between modules without creating explicit dependencies between them. The official documentation about Signals also provide this as the main reason for using signals:

Django includes a “signal dispatcher” which allows decoupled applications get notified when actions occur elsewhere in the framework. In a nutshell, signals allow certain senders to notify a set of receivers that some action has taken place. They’re especially useful when many pieces of code may be interested in the same events.

If you look at how signals are implemented in Django you'll find that there is not a lot of magic under the hood. The function connect adds a function to a list of receivers, and when a signal is send (or send_robust) the signal object iterates over the list of receiver functions, and executes them one by one.

This is very similar to a pub-sub pattern, but it lacks some of the guarantees of more advanced implementations. One of the main disadvantages of Django signals is that there is no guarantee that a "message" ever reaches the destination. If for example, the server crashes while a signal is being broadcast, some receivers may not be executed and they will not be attempted when the service starts up again. This can become a problem if you rely on signals exclusively to trigger certain actions in the system.




Similar articles