Hacker News

On my current team we run a centralized task scheduler used by other products in our company that manages on the order of around ~30M schedules. To that end, it's a home-grown distributed system that's built on top of Postgres and Cassandra with a whole control plane and data plane. It's been pretty fun to work on.

There are two main differences between our system and the one in the post:

- In our scheduler, the actual cron (aka recurrence rule) is stored along with the task information. That is, you specify a period (like "every 5 minutes" or "every second Tuesday at 2am") and the task will run according that schedule. We try to support most of the RRule specification. [1] If you want a task to just run one time in the future, you can totally do that too, but that's not our most common use case internally.

- Our scheduler doesn't perform a wide variety of tasks. To maximize flexibility and system throughput, it does just one thing: when a schedule is "due", it puts a message onto a queue. (Internally we have two queueing systems it interops with -- an older one built on top of Redis, and a newer one built on PG + S3). Other team consume from those queues and do real work (sending emails, generating reports, etc). The queueing systems offer a number of delivery options (delayed messages, TTLs, retries, dead-letter queues) so the scheduling system doesn't have to handle it.

Ironically, because supporting a high throughput of scheduled jobs has been our biggest priority, visibility into individual task executions is a bit limited in our system today. For example, our API doesn't expose data about when a schedule last ran, but it's something on our longer term roadmap.

[1] https://icalendar.org/iCalendar-RFC-5545/3-8-5-3-recurrence-...