I've been looking at migrating to Temporal, but this looks interesting.
For context, we have a simple (read: home-built) "durable" worker setup that uses BullMQ for scheduling/queueing, but all of the actual jobs are Postgres-based.
Due to the cron-nature of the many disparate jobs (bespoke AI-native workflows), we have workers that scale up/down basically on the hour, every hour.
Temporal is the obvious solution, but it will take some rearchitecting to get our jobs to fit their structure. We're also concerned with some of their limits (payload size, language restrictions, etc.).
Looking at DBOS, it's unclear from the docs how to scale the workers:
> DBOS is just a library for your program to import, so it can run with any Python/Node program.
In our ideal case, we can add DBOS to our main application for scheduling jobs, and then have a simple worker app that scales independently.
How "easy" would it be to migrate our current system to DBOS?
As another commentator said, temporal is quite tricky to self host/scale in a cost effective manner. This is also reflected in their cloud pricing (which should've been the warning sign to us tbh)
Overall it's a pretty heavy/expensive solution and I've come to the conclusion it's usage is best limited to lower frequency and/or higher "value" (eg: revenue or risk) tasks.
Orchestrating a food delivery that's paying you $3 of service fees - good use case. Orchestrating some high frequency task that pays you $3 / month - not so good.
This was my problem with Dagster, too. All the documentation and all the examples encourage you to split items into small discrete tasks. Then you realize that their cloud pricing is absolutely bonkers if you go over the paltry 30k credits… unless you sign up for a meaty annual enterprise contract. Got a $500 bill for something like 13k executions over the limit. That’s less than 45k executions in a month. Just for comparison, our main product’s sidekiq queue processes tens of millions of jobs every single day. Just a silly imbalance. I ended up having to combine a bunch of tasks to the point that I started asking myself why I was even bothering with using it at all.
I'd love to learn more about what you're building--just reach out at peter.kraft@dbos.dev.
One option is that you have DBOS workflows that schedule and submit jobs to an external worker app. Another option is that your workers use DBOS queues (https://docs.dbos.dev/python/tutorials/queue-tutorial). I'd have to better understand your use case to figure out what would be the best fit.
I’m also interested in what you think can become best practices where we can have (auto-scaling) worker instances that can pick up DBOS workflows and execute them.
Do you think an app’s (e.g. FastAPI) backend should be the DBOS Client, submitting workflows to the DBOS instance? And then we can have multiple DBOS instances with each picking up jobs from a queue?
Yeah, I think in that case you should have auto-scaling DBOS workers all pulling from a queue and a FastAPI backend using the DBOS client to submit jobs to the queue.
Queue docs: https://docs.dbos.dev/python/tutorials/queue-tutorial Client docs: https://docs.dbos.dev/python/reference/client
Unless you’re planning on using their (temporalio’s) saas you’re in for building a very large database cluster for this if you need some scale.
(source: i run way more cassandra than i ever thought reasonable)
Just got roped into setting up an on prem temporal cluster myself :(
What causes the need for massive database clusters? Now I'm worried this is going to fall apart on us in a very big way
Take a look at the official “basic scaling” guide especially the metric about state transitions / second.
To get an idea of what you’ll need that metric to be try running 1/10th of your workload as a benchmark against it.
In order for our particular setup to handle barely 5000 of these we have almost 100cpus just for cassandra. To double this, it’s 200 cpus just for database.
Oh and make sure you get your history shard count right as you can’t change it without rebuilding it.
Maybe it makes sense for low volume high value jobs e.g uber trips, for high volume low value this doesn’t work economically.
We are likely to drop it.