For large applications in a service-oriented architecture, I leverage Kafka 100% of the time. With Confluent Cloud and Amazon MSK, infra is relatively trivial to maintain. There's really no reason to use anything else for this.
For smaller projects of "job queues," I tend to use Amazon SQS or RabbitMQ.
But just for clarity, Kafka is not really a message queue -- it's a persistent structured log that can be used as a message queue. More specifically, you can replay messages by resetting the offset. In a queue, the idea is once you pop an item off the queue, it's no longer in the queue and therefore is gone once it's consumed, but with Kafka, you're leaving the message where it is and moving an offset instead. This means, for example, that you can have many many clients read from the same topic without issue.
SQS and other MQs don't have that persistence -- once you consume the message and ack, the message disappears and you can't "replay it" via the queue system. You have to re-submit the message to process it. This means you can really only have one client per topic, because once the message is consumed, it's no longer available to anyone else.
There are pros and cons to either mechanism, and there's significant overlap in the usage of the two systems, but they are designed to serve different purposes.
The analogy I tend to use is that Kafka is like reading a book. You read a page, you turn the page. But if you get confused, you can flip back and reread a previous page. An MQ like RabbitMQ or Sidekiq is more like the line at the grocery store: once the customer pays, they walk out and they're gone. You can't go back and re-process their cart.
Again, pros and cons to both approaches.
"What didn't work out?" -- I've learned in my career that, in general, I really like replayability, so Kafka is typically my first choice, unless I know that re-creating the messages are trivial, in which case I am more inclined to lean toward RabbitMQ or SQS. I've been bitten several times by MQs where I can't easily recreate the queue, and I lose critical messages.
"Where did you regret adding complexity?" -- Again, smaller systems that are just "job queues" (versus service-to-service async communication) don't need a whole lot of complexity. So I've learned that if it's a small system, go with an MQ first (any of them are fine), and go with Kafka only if you start scaling beyond a single simple system.
"And if you stuck with a DB-based queue -- did it scale?" -- I've done this in the past. It scales until it doesn't. Given my experience with MQs and Kafka, I feel it's a trivial amount of work to set up an MQ/Kafka, and I don't get anything extra by using a DB-based queue. I personally would avoid these, unless you have a compelling reason to use it (eg, your DB isn't huge, and you can save money).
> This means you can really only have one client per topic, because once the message is consumed, it's no longer available to anyone else.
It depends on your use case (or maybe what you mean by "client"). If I just have a bunch of messages that need to be processed by "some" client, then having the message disappear once a client has processed it is exactly what you want.
Absolutely, if you only ever have one client, SQS or a message queue is perfectly fine!
We build applications very differently. SQS queues with 1000s of clients have been a go to for me for over a decade. And the opposite as well — 1000s of queues (one per client device, they’re free). Zero maintenance, zero cost when unused. Absurd scalability.
Certainly. There are many paths to victory here.
One thing to consider is whether you _want_ your producers to be aware of the clients or not. If you use SQS, then your producer needs to be aware of where it's sending the message. In event-driven architecture, ideally producers don't care who's listening. They just broadcast a message: "Hey, this thing just happened." And anyone who wants to subscribe can subscribe. The analogy is a radio tower -- the radio broadcaster has no idea who's listening, but thousands and thousands of people can tune in and listen.
Contrast to making a phone call, where you have to know who it is that you're dialing and you can only talk to one person at a time.
There are pros and cons to both, but there's tremendous value in large applications for making the producer responsible for producing, but not having to worry about who is consuming. Particularly in organizations with large teams where coordinating that kind of thing can be a big pain.
But you're absolutely right: queues/topics are basically free, and you can have as many as you want! I've certainly done it the SQS way that you describe many times!
As I mentioned, there are many paths to victory. Mine works really well for me, and it sounds like yours works really well for you. That's fantastic :)
Hey I'm curious how the consumers of those queues typically consume their data, is it some job that is polling, another piece of tech that helps scale up for bursts of queue traffic, etc. We're using the google equivalent and I'm finding that there are a lot of compute resources being used on both the publisher and subscriber sides. The use cases here I'm talking about are mostly just systems trying to stay in sync with some data where the source system is the source of record and consumers are using it for read-only purposes of some kind.
On the producer side I’d expect to see change data capture being directed to a queue fairly efficiently, but perhaps you have some intermediary that’s running between the system of record and the queue? The latter works, but yeah it eats compute.
On the consumer side the duty cycle drives design. If it’s a steady flow then a polling listener is easy to right size. If the flow is episodic (long periods of idle with unpredictable spikes of high load) one option is to put a alarm on the queue that triggers when it goes from empty to non-empty, and handle that alarm by starting the processing machinery. That avoids the cost of constantly polling during dead time.