My suggestion would be even simpler:
MQTT -> Postgres (+ S3 for archive)
> 1. my "fear" would be that if I use the same Postgres for the queue and for my business database...
This is a feature, not a bug. In this way you can pair the handling of the message with the business data changes which result in the same transaction. This isn't quite "exactly-once" handling, but it's really really close!
> 2. also that since it would write messages in the queue and then delete them, there would be a lot of GC/Vacuuming
Generally it's best practice in this case to never delete messages from a SQL "queue", but toggle them in-place to consumed and periodically archive to a long-term storage table. This provides in-context historical data which can be super helpful when you need to write a script to undo or mitigate bad code which resulted in data corruption.
Alternatively when you need to roll back to a previous state, often this gives you a "poor woman's undo", by restoring a time-stamped backup, copying over messages which arrived since the restoration point, then letting the engine run forwards processing those messages. (This is a simplification of course, not always directly possible, but data recovery is often a matter of mitigations and least-bad choices.)
Basically, saving all your messages provides both efficiency and data recovery optionality.
> 3...
Legit concern, particularly if you're trying to design your service abstraction to match an eventual evolution of data platform.
> 4. don't provide "fanout" for multiple things
What they do provide is running multiple handling of a queue, wherein you might have n handlers (each with its own "handled_at" timestamp column in the DB), and different handles run at different priorities. This doesn't allow for workflows (ie a cleanup step) but does allow different processes to run on the same queue with different privileges or priorities. So the slow process (archive?) could run opportunistically or in batches, where time-sensitive issues (alerts, outlier detection, etc) can always run instantly. Or archiving can be done by a process which lacks access to any user data to algorithmically enforce PCI boundaries. Etc.
> This is a feature, not a bug. In this way you can pair the handling of the message with the business data changes which result in the same transaction.
That’s a particularly nasty trap. Devs will start using this everywhere and it makes it very hard to move this beyond Postgres when you need to.
I’d keep a small transactional outbox for when you really need it and encourage devs to use it only when absolutely necessary.
I’m currently cleaning up an application that has reached the limit of vertical scaling with Postgres. A significant part of that is because it uses Postgres for every background work queue. Every insert into the queue is in a transaction—do you really want to rollback your change because a notification job couldn’t be enqueued? Probably not. But the ability is there and is so easy to do that it gets overused.
Now I get to go back through hundreds of cases and try to determine whether the transactional insert was intentional or just someone not thinking.
The problem is either you have this feature or you dont, misusing it is another problem. Not having a feature sucks, and most distributed databases will even give you options for consistent (slow ass) reads.
If you have a database that supports transactions and something like skip locked, you always have option of building a transactional outbox when you need it.
> Generally it's best practice in this case to never delete messages from a SQL "queue", but toggle them in-place to consumed and periodically archive to a long-term storage table.
Ignoring the potential uses for this data, what you suggested has the exact same effect on Postgres at a tuple level. An UPDATE is essentially the same as a DELETE + INSERT, due to its MVCC implementation. The only way around this is with a HOT update, which requires (among other things) that no indexed columns were updated. Since presumably in this schema you’d have a column like is_complete or is_deleted, and a partial index on it, as soon as you toggle it, it can’t do a HOT update, so the concerns about vacuum still apply.
> This is a feature, not a bug.
Until your postgresql instance goes down (even by reasons unrelated to pgsql) and then you have no fallback or queue for elasticity