Synchronous domain events

Correct before fast

Geschrieben von Timo Rieber am 21. Oktober 2025

In cloudapps, a course registration triggers a receivable, an invoice, and a confirmation email. That chain ran through an async event store. Commands published domain events into a shared queue, a scheduler popped them every five seconds, background threads dispatched them to handlers. While refactoring the payment domain, we looked at that machinery and asked what it was actually buying us.

The answer: not much. A five-second window where a registration existed but its invoice didn’t. It never caused an incident. But it could have, and the infrastructure keeping that window open was surprisingly heavy for what it did.

What the scheduler cost

The gap itself wasn’t the main problem - we’d never seen it bite. The problem was the machinery required to maintain it.

The dispatch infrastructure was a mutable list behind a threading lock:

class NotificationApplicationService:
    def publish_notifications(self) -> None:
        with self.__lock:
            while self.event_store.has_next():
                event = self.event_store.pop_next()
                event_name = fullname(event)
                for listener in self.listeners:
                    if event_name not in listener.interested_in():
                        continue
                    try:
                        listener.handle(event)
                    except BaseException as exc:
                        logging.warning(...)
Python

A failing handler logged a warning and continued. The scheduler called this for every tenant, every five seconds. Sixty-six lines of event store and notification service, a threading lock, a scheduler job, a publish_notifications method on every engine. Even the acceptance tests bypassed the scheduler entirely - they called publish_notifications directly after each command because testing through a five-second timer wasn’t practical.

That was the tell. The handlers creating receivables and sending emails weren’t slow. They weren’t accessing external APIs with unpredictable latency. They were running queries and inserts against the same database the command had just written to. The async architecture was solving a throughput problem that didn’t exist.

Dispatch on the spot

One commit replaced the entire dispatch infrastructure. The CommandProcessor now publishes events right after the command’s transaction:

class CommandProcessor:
    def handle[R](self, command: Command[R]) -> R:
        recorded_events = []

        with self.transaction_manager.start_or_join():
            with event_publisher.listen(
                lambda e: recorded_events.append(e)
            ):
                result = command.execute(self.command_context)

        self.event_notifier.notify(recorded_events)
        return result
Python

Events still accumulate during command execution and dispatch to the same listeners. But no queue sits between the command and its handlers. notify() iterates registered handlers in a direct loop. The event store, the notification service, the scheduler job - gone. 54 files changed in a single commit, 253 lines removed, 139 added. The bulk of the deletions were test fixtures wiring up infrastructure that no longer existed.

If a command fails, its transaction rolls back and no events fire - the handlers never see an event for something that didn’t happen. The other direction is different: if a handler fails after the command’s transaction committed, the registration stands. The handler’s own error handling applies, and the event notifier logs the failure and continues to the next listener. That’s not the same guarantee a saga or outbox pattern gives you. But for handlers that are database writes against the same system, failures are exceptions in the literal sense - connection loss, schema bugs - not a normal operating mode you’d design a retry queue around.

Where async belongs

We never measured a performance difference. Emails in cloudapps are domain objects too - a handler creates a mail record, and a separate queue delivers it asynchronously. The event handlers that moved from deferred to synchronous were doing database writes against the same database. The latency cost we’d assumed would come with synchronous dispatch never showed up.

Async wasn’t solving a latency constraint here. It was the default, and the default carried complexity we stopped questioning until we looked at it fresh. For analytics, audit logs, or side effects where arrival time doesn’t matter - eventual consistency is fine. When the handler books revenue against a receivable, the gap between “registered” and “invoiced” is a latent billing bug, even if it never fired.