I am about to start a project. I know I want an event sourced architecture. That is, the system is designed around a queue, all actors push/pull into the queue. This article gives me some pause.
Performance isn't a big deal for me. I had assumed that Kafka would give me things like decoupling, retry, dead-lettering, logging, schema validation, schema versioning, exactly once processing.
I like Postgres, and obviously I can write a queue ontop of it, but it seems like quite a lot of effort?
Kafka also doesn't give you all those things. E.g. there is no automatic dead-lettering, so a consumer that throws an exception will endlessly retry and block all progress on that partition. Kafka only stores bytes, so schema is up to you. Exactly-once is good, but there are some caveats (you have to use kafka transactions, which are significantly different than normal operation, and any external system may observe at-least-once semantics instead). Similar exactly-once semantics would also be trivial in an RDBMS (i.e. produce and consume in same transaction).
If you plan on retaining your topics indefinitely, schema evolution can become painful since you can't update existing records. Changing the number of partitions in a topic is also painful, and choosing the number initially is a difficult choice. You might want to build your own infrastructure for rewriting a topic and directing new writes to the new topic without duplication.
Kafka isn't really a replacement for a database or anything high-level like a ledger. It's really a replicated log, which is a low-level primitive that will take significant work to build into something else.
I want to rewrite some of my setup, we're doing IoT, and I was planning on
MQTT -> Redpanda (for message logs and replay, etc) -> Postgres/Timescaledb (for data) + S3 (for archive)
(and possibly Flink/RisingWave/Arroyo somewhere in order to do some alerting/incrementally updated materialized views/ etc)
this seems "simple enough" (but I don't have any experience with Redpanda) but is indeed one more moving part compared to MQTT -> Postgres (as a queue) -> Postgres/Timescaledb + S3
Questions:
1. my "fear" would be that if I use the same Postgres for the queue and for my business database, the "message ingestion" part could block the "business" part sometimes (locks, etc)? Also perhaps when I want to update the schema of my database and not "stop" the inflow of messages, not sure if this would be easy?
2. also that since it would write messages in the queue and then delete them, there would be a lot of GC/Vacuuming to do, compared to my business database which is mostly append-only?
3. and if I split the "Postgres queue" from "Postgres database" as two different processes, of course I have "one less tech to learn", but I still have to get used to pgmq, integrate it, etc, is that really much easier than adding Redpanda?
4. I guess most Postgres queues are also "simple" and don't provide "fanout" for multiple things (eg I want to take one of my IoT message, clean it up, store it in my timescaledb, and also archive it to S3, and also run an alert detector on it, etc)
What would be the recommendation?
Very interesting.
I need a durable queue but not indefinitely. Max a couple of hours.
What I want is Google PubSub but open source so I can self host.
Small size, Beanstalkd (https://beanstalkd.github.io/) can get you pretty far.
Larger, RabbitMQ can handle some pretty good workloads.
if you need a durable log (which it sounds like you do for if you're going with event sourcing) that has those features, i'd suggest apache pulsar. you effectively get streams with message queue semantics (per-message acks, retries, dlq, etc.) from one system. it supports many different 'subscription types', so you can use it for a bunch of different use cases. running it on your own is a bit of a beast though and there's really only one hosted provider in the game (streamnative)
note that kafka has recently started investing into 'queues' in KIP-932, but they're still a long way off from implementing all of those features.
A standalone Pulsar, is actually a great way to learn Pulsar. It is one command to get started: bin/pulsar standalone
It can also be used in production. You do not have to build a distributed Pulsar cluster immediately. I have multiple projects running on a standalone Pulsar cluster, because its easy to setup and requires almost no maintenance. Doing it that way makes compliance requirements for isolation simpler and with less fights. Everyone understands host/vm isolation, few understand Pulsar Tenant isolation.
If you want a distributed Apache Pulsar cluster, be prepared to work for that. We run a cluster on bare metal. We considered Kubernetes, but performance was lacking. We are not Kubernetes experts.
If you build it right, the underlying storage engine for your event stream should be swappable for any other event stream tech. Could be SQLite, PSQL, Kafka, Kinesis, SQS, Rabbit, Redis ... really anything can serve this need. The right tool will appear once you dial in your architecture. Treat storage as a black box API that has "push", "pop" etc commands. When your initial engine falls over, switch to a new one and expose that same API.
The bigger question to ask is: will this storage engine be used to persist and retain data forever (like a database) or will it be used more for temporary transit of data from one spot to another.
Event-sourcing != queue.
Event-sourcing is when you buy something and get a receipt, you go stick it in a shoe-box for tax time.
A queue is you get given receipts, and you look at them in the correct order before throwing each one away.
True.
I think my system is sort of both. I want to put some events in a queue for a finite set of time, process them as a single consolidated set, and then drop them all from the queue.
> I had assumed that Kafka would give me things like decoupling, retry, dead-lettering, logging, schema validation, schema versioning, exactly once processing.
If you don't need a lot of perf but you place a premium on ergonomics and correctness, this sounds more like you want a workflow engine? https://github.com/meirwah/awesome-workflow-engines
One thing I learned with Kafka and Cassandra is that you are locked in to a design pretty early on. Then the business changes their mind and it take a great deal of re-work and then they're accusing you of being incompetent because they are used to SQL projects that have way more flexibility.
Perhaps I do. I know that I don't want a system defined as a graph in yaml. Or no code. These options are over engineered for my use case. I'm pretty comfortable building some docker containers and operating them and this is the approach I want to use.
I'm checking out the list.
If what you want is a queue, Kafka might be overkill for your needs. It's a great tool, but it definitely has a lot of complexity relative to a straightforward queue system.
It might look like a lot of effort, but if you follow a tutorial/YouTube video step by step you will be surprised.
It’s mostly registering the Postgres database functions which is one time.
There are also pre-made Postgres extensions that already run the queue.
These days i would like consider m starting with Supabase self hosted which has the Postgres ready to tweak.