I walked away from a job interview a few years ago on this point.

One of the technical questions was "if you have a db and a message queue, how do you get your update to alter both or neither (i.e. transactionally)"?

I thought about it for a couple of minutes, then came back with something like "I can't, and you can't either." Then I proposed the usual spiel about using a replicated-state-machine/write-ahead-log/event-sourcing (whatever it might be called at the time) and leaning into eventual consistency as the only practical solution.

He asked if I'd heard about the outbox pattern, so I let him describe it. Sure enough it sounded like this article. The secret to transacting across the database D and the message queue Q:

  (D,Q)
is to split D into two parts (the State and the Outbox), transact across those instead

  (S,O)    Q
and then just pretend that you have a transaction across D and Q.

One of my favorite pieces of technical writing is Brandur Leach’s “Transactionally Staged Job Drains in Postgres” where he reasons through the outbox pattern from first principles. I remember reading it for the first time and feeling like I had been let in on a big secret. Clever, simple, powerful. I still use the pattern all the time.

With an inbox/outbox pattern it's possible. The incoming message might be processed more than once, and an outgoing message might be sent more than once. That's the limitation, and the system needs to be able to handle it.

If you can't de-duplicate messages it's not possible, that's true.

I'm not following. Doesn't the outbox pattern just pass the buck?

The motive seems to be a naive process that enqueues a message and then commits to a database - two independent actions. But a well-behaved process would commit to a database, and then only if successful enqueue a message. That's better but still not atomic - commit, crash, and no message queued.

So the solution is a two-table write - the outbox pattern. But the process that reads the outbox must commit both a query and delete before sending the message. That's the same risk as the agreement well-behaved program - commit, crash, and no message queued. Except now you introduced another pipeline element so your overall complexity increases, and so too risk.

What if you never delete messages from the outbox? Well, what you have now is no longer an outbox nor a database nor useful for large volumes. What if you implement a database to track procesed messages. Return to square one - that's the same problem you were initially trying to solve.

What if you fetch, enqueue, and then delete? Ohh... that works. In case of a crash the message remains in the outbox. It may be processed in duplicate, but eventually if successfully it will be deleted from the outbox.

The message broker then receives a possibly duplicate message. It must consult its internal database, and if the message is unique, route it. So right back at square one. Can't have atomicity and uniqueness.

Outbox's power is that it turns an atomicity problem into an idempotency problem. You atomically write to the outbox, then you have an idempotent "workflow" that processes events from the outbox. This turns "at most once" semantics (where an event could be dropped entirely) to "at least once" semantics (where the event processing could run multiple times). For many systems, that's a big improvement.

The outbox is basically a local queue in front of the remote queue.

That's a good tradeoff I suppose. I've been racking my brain trying to find a solution recently that solves both of these but haven't been able to.

What I had landed on was idempotency on a best effort basis and just made the event processing safely retryable without violating any system invariants.

Too much work. Don't try to act as the sender yourself. The outbox pattern leaves that as an exercise for the reader. How many receivers do you have? Do they come and go while you're trying to send messages? Do they need to find out about old messages that were send before they came online?

There is much better alternative than this motte-and-bailey argument of "outbox GUARANTEES blah for a distributed system - but only within a single node".

Just write down what happened in Kafka. N followers read from Kafka to find out what happened.

Kafka is actually distributed tech. You can lose nodes and keep operating.

No need to design for atomicity. You either wrote to Kafka or you didn't.

The message is written to the outbox table in the same transaction as the database changes. Only if the transaction completes, the message is actually created, and other tables are updated.

In a second step the message is taken from the outbox and gets sent to the queue/broker. Only after it was sent out, the message is removed from the outbox. If the sending fails, it stays in the outbox and is retried. If the deletion of the message from the outbox fails after sending, it's getting re-sent later. So you can get a duplicated out-message.

Message brokers usually don't de-duplicate messages, they don't have a database that keeps messages, the receivers need to do that. Either with idempotency, or by tracking message ids. Event sourcing brokers can de-duplicate, because it can stores all messages.

If you never delete messages from the outbox, then they are re-sent all the time. You are going to notice such a bug really quickly.

Inbox pattern works very similarly, but the other way around.

No, you're right. It basically just passes the buck. But the general idea is that if your transaction succeeds, you KNOW that there is a durable record that some external thing needs to end up in a message bus. And then something else can sit there and spin retries until it happens. It gives you the opportunity for retrying getting it onto the message bus, out of band of the process that is trying to initiate the enqueue.

And the outbox pattern isn't bs - it DOES help a lot in practice. But exactly how much it _guarantees_ something happens is of course still quite limited. And yes as you note it's an At-least-once strategy.

Argument against: The system can't guarantee the message queue receives it, if the transaction is considered finished after the outbox commit.

Scenario: The system turns out to have a data dependent bug that prevents that message from being received by the message broker.

[flagged]

> Sure enough it sounded like this article

FWIW The article literally talks about the challenges with getting this to actually work and recommends removing it and just using the DB for everything.

But that's what the outbox pattern is. You take the problem of transacting between more than one system, and by "just using the db", you declare the problem solved, leaving the communication with other systems as an exercise for the reader.

From the end of the article:

  The enqueue_workflow UDF creates this row in the same transaction as the user database update, guaranteeing atomicity

Your right! But the outbox pattern is good enough for a lot of purposes. The outbox pattern works if the only reason the write to the 2nd system can fail is because of transient issues. It will keep trying until the system is back up.

If the 2nd system write can fail for non-transient reasons, the outbox pattern doesn’t work and you need either 2 phase commit or a distributed saga.

I wrote about this here a few years ago.

https://linuxblog.io/the-two-generals-problem/

The point is that the "outbox pattern" is not an atomic transaction. You fundamentally don't have those in the distributed world, via the CAP theorem, and if you want anything close to the guarantees a local transactional database gives you in a distributed system you have to design your schema for it.

Distributed coherency is not something you can abstract away, the abstractions all leak.

It's an odd question but it's not impossible. Some database products have a full transactional message queue product built in, at which point it's easy. This might sound like "cheating" but why? The assumption that MQs and databases are necessarily different or run in different transactional domains is one you can void by just spending some money.

[boilerplate] Disclosure: I work part time in the Oracle DB team and opinions are my own. [/boilerplate]

This feature is one big reason so many companies use Oracle, it offers this out of the box. It has AQ (Advanced Queuing) and the more modern TxEQ which is all built on the same underlying mechanisms as the relational database engine, so queue pushes and pops are atomic with other transactions.

Postgres has an extension that claims to add an MQ too but I don't consider it safe to use personally, because it doesn't implement proper locking/dequeuing. Instead you get a visibility timeout, so you have to choose how long a message remains dequeued before it goes back onto the queue automatically. That's a harsh choice - in the case of unexpectedly slow message processing a second worker might start processing a message that's already in flight, causing data corruption or business correctness problems (e.g. double charging a customer).

A proper MQ product like TxEQ doesn't have this problem because dequeueing is implemented as you'd expect, so a message that's dq'd into a transaction remains invisible to other workers until either the transaction commits, rolls back or the session is terminated due to abandonment (client no longer responds to pings). You can't get multiple workers processing a message simultaneously unless there's a split brain scenario (really rare in practice and a fundamental limit).

Also useful: AQ/TxEQ are full spec-compliant message queue brokers that support the standard feature sets and semantics you normally need, like exception queues. PGMQ lacks these.

And finally Oracle DB scales horizontally as does the integrated MQ, so it's reasonable to have very high traffic apps that use integrated MQ/DB transactions. The newer TxEQ feature uses a similar scaling design as Kafka.

So it's interesting that this is being used as a technical interview question when the answer would seem trivial to any bank DBA.

I hadn't heard of throwing money at it before.

Maybe "Two Generals" doesn't work, but "Two Rich Generals" does.

If you compare prices vs cloud managed Postgres it's not that expensive.

Or rather, cloud managed Postgres is expensive, especially once you get into the cloud-specific forks of it that try to make it scale, because AWS/Azure/etc know that people will pay a lot of money for the Postgres brand but don't want to admin it themselves.

So the moment you commit to paying for a managed database you should check out the prices to rent an Oracle DB and see how it compares, especially because they flex well so on the smaller end it can end up being cheaper as you're renting only part of a machine. Plus a lot of times people will tell you that Postgres can do this or that, but then it requires some custom extension that is not necessarily available in your cloud's managed product. A lot of stuff that's extensions in Postgres are out of the box features in Oracle e.g. message queues or JavaScript support.

How much does it cost to rent a db that can do CAP?

What do you mean by do CAP?

If you mean the CAP theorem then that's an impossibility result, so...

Despite the person you responded to playing the fool, I found this discussion very interesting. I had never thought about a built in queue as a solution here.

Why not just put the message queue in the same db

That's what I generally choose. You don't need to worry about distributed system semantics, if you choose to not make the system distributed.

However the way Postgres keeps around obsolete rows (deleted or modified) until they're vacuumed can cause problems for high throughput queues. So for those systems the complexity might be worth it. But I bet 90% of the time the choice to use a separate queue is premature optimization. And hopefully OrioleDB (undo based storage engine for postgres) will avoid most of these pitfalls reducing the need for separate queues even further.

Step 1: identify that you and at least one other node are separated by distance, and some lossy communication channel, and therefore form a distributed system.

Step 2: propose a source of truth that everyone can listen to. Hearing the same facts in the same order should put everyone in the same state (eventual consistency)

Step 3 (you are here): try to do better than EC, by merging the external queue into one of the nodes, making it the master.

Step 4: Now there's no distance between the nodes, so no need to solve the distributed systems problem and you can retire the queue.

eventual consistency as generally used doesn't guarantee that events are presented in the same order. I use 'monotonic consistency' for that, but idk how common that is.

Order independence/monotonicity is strong EC rather than regular EC.

Yes, same-ordering gives you EC, not the other round.

That's what the post is about! Once you're doing that, you really do have transactions between the state and the queue.

The bridge between inbox/outbox and queue is not perfect. But it derisks the process a lot. It is much saver to insert a (ideally idempotent) message into the database and then (without transaction) confirm it to the queue than running the whole business process. The likelihood the process will fail is much higher than the inbox / outbox. These patterns also keep your brokers queue empty and allows you to gracefully shutdown your systems.

At that point why not just keep the broker shutdown and have the parties read from the DB and update an ownership column? Personally, I find the whole message queue thing the typical waste of time, absurdly complex and pointlessly corrupting component you get when you pretend you are Twitter and have the problem of maybe delivering a lot of stuff single entry style instead of provable having handled a small amount of stuff that needs to be double entry handled because however small a professional gig is it still uses real accounting.

I somewhat agree with 'just use the db' over 'use the db as the state and the outbox'.

And message "queues" are probably a waste of time too.

Where you get a real benefit is in using a proper append-only ledger. This is a solved problem. Paxos and Raft both give you this on the theoretical side, and systems like Kafka give you practical implementations.

"Pull-based" systems are far, far easier to reason about. E.g., I'm going to update my packages now. I'm going to pull from git now. I'm going to GET news.ycombinator.com now. Imagine the opposite - news.ycombinator deciding to push the frontpage to whichever device I'm using, at the precise time I'm hoping to read it.

So pull is better, but if you can only pull, then how does anyone change any state? Push a new message into Kafka, and let it handle the switch from push to pull.

It may be absurdly complex, but it's the least absurdly complex option if you want to distribute. And if you don't want to distribute, you don't need outbox.

I agree with you. For one, if you are small, build a good monolith and call it a day. So much more efficient than anything else.

Stream based systems where you maintain your own curser are a strong architectural decision similiar to a messaging systems. They also have their downsides.

Lastly on the last sentence: as soon as you need reliable processing of external input or output, inbox/outbox are needed. You are distributed because of your payment processor, because of your user email sending, etc. You do not want to block your core job processor just because the email server is overloaded right now.

If you write a good monolith, you would do exactly that. You message between the subsystems and you are good with just adjust the "ownership" within a transaction.

If you are distributed you have the problem of shared databases. It would break your Microservice ownership etc if you all operate on one database. It is an anti pattern. For very good reasons.

In the past there have been distributed transactions between databases or other systems but they fell out of love due to their proprietary and limited nature (e.g. Microsofts MSDTC)

Is there a reason why two-phase commit can't work with a DB and a message queue? DB2 and MQ Series used to support this (though they called it "XA" transactions and you had to compile support into the drivers which felt a bit sketchy - late 90s I think). Should I have been suspicious of this?

I have the starting state:

  DB={}    Q={}
I would like to either remain in the starting state, or enter a new state:

  DB={Bob paid $15}  Q={Bob paid $15}
But this is Two Generals, which is impossible.

If you invoke 2PC, you want the states to progress thus:

  DB={locked}        Q={locked}

  DB={locked;        Q={locked;
      Bob paid $15}     Bob paid $15}

  DB={locked;        Q={locked;
      Bob paid $15;     Bob paid $15;
      unlocked}         unlocked}
A strictly harder problem, right?

Ah, you're assuming distributed, which I wasn't. We are not at home to partition tolerance :)

Makes sense in the context of the original post though.

But you are assuming distributed transactions as soon as you have two systems participating in the transaction.

It's a recipe for deadlocks and even live locks.

That's a reason industry moved away from this. Bc when it works it's magic. But when the problems start it's pure hell.

Yeah I once saw a system designed about 2000 or so that pulled messages from an MQ queue and updated a database all within a single transaction managed by COM+. To be honest the distributed transaction side of it seemed more bother than it was worth...

Yeah, for a lot of systems it really is more trouble than it's worth - I remember enjoying this article when it came out: https://www.enterpriseintegrationpatterns.com/ramblings/18_s...

[dead]

Just post to the database then asynch send to message queue. Messages should still be idempotent by the consumer but at least this follows rest and is transactional.

It’s simple and easy to follow. At scale use multi tenancy.

The way they worded that question is bad and, as you say, the outbox pattern does not transactionally update the queue itself. The outbox pattern is nevertheless very useful.

It's a bit of trick that the outbox to queue part of it likely needs to support "at least once but duplicates possible" into the queue.

"Send multiple times from D to Q and deduplicate with a UUID" (idempotency) is well short of "insert into both D and Q or neither" (atomicity)

What are you saying here? I'm pointing out that you need to be ok with the lack of exactly once transaction between O and Q. Maybe you're agreeing and simply saying that's a fine?

Every item will be written to the queue exactly once (as the update is transactional). Queue processing may need at-least-once semantics, yes, depending on what exactly you're doing.

The queue write is not in the transaction. The proposed trick is that that is ok because an outbox is able to be transacted on. It kicks the can some what...

I envy you DB + distributed systems specialists. Reminds me I still have a lot to learn.

Youtube. Design interviews and anything about how cloud services work etc.

One major trick in distributed systems is to always attempt things in the same order. And then locally, you just store what you’ve seen, for “a long time”. That takes care of a lot of transactional issues — idempotency, retries, exactly-once delivery with no distributed locks, etc.

But as someone who builds distributed systems, I can tell you that transactions should be local. Anytime you want to lock something across the network (eg Canisters in ICP) so you can read it, that’s probably a code smell. You probably want to have evented reactive things ripple out, you do need idempotency, but you shouldn’t design your system to read remote state if you can help it. Only subscribe to remote messages.

"One major trick in distributed systems is to always attempt things in the same order"

This is inportant in DBs in general to avoid deadlocks by two requests taking locks in different order.