Out of interest, did you try changing the value of commit_delay? This parameter allows multiple transactions to be written together under heavy load.

IMHO, with this type of issue is often more likely blowing through the multixact cache or the query planner reverting to SEQSCAN due to the number of locks or mxact id exaustion etc.. It is most likely not a WAL flush problem that commit_delay would help with.

From the above link:[1]

> I found that performing extremely frequent vacuum analyze (every 30 minutes) helps a small amount but this is not that helpful so problems are still very apparent.

> The queue table itself fits in RAM (with 2M hugepages) and during the wait, all the performance counters drop to almost 0 - no disk read or write (semi-expected due to the table fitting in memory) with 100% buffer hit rate in pg_top and row read around 100/s which is much smaller than expected.

Bullet points 2 and 3 from here [2] are what first came to mind, due to the 100% buffer hit rate.

Note that vacuuming every 30min provided "minor improvements" but the worst case of:

     25000 tps * 60sec *30min * 250rows == 11,250,000,000 ID's (assuming worst case every client locking conflicting rows)
Even: 25000tps 60sec 30min

Is only two orders of magnitude away from blowing through the 32bit transaction ID's.

    45,000,000
    4,294,967,296
But XID exhaustion is not as hidden as the MXID exhaustion and will block all writes, while the harder to see MXID exhaustion will only block some writes.

IMHO, if I was writing this, and knowing that you are writing an orchestration platform, getting rid of the long term transactions with just a status column would be better, row level locks are writing to the row anyways, actually twice.

    tuple lock -> write row lock to xmax column -> release tuple lock.
Long lived transactions are always problematic for scaling, and that status column would allow for more recovery options etc...

But to be honest, popping off the left of a red black tree like the linux scheduler does is probably so much better than fighting this IMHO.

This opinion is assuming I am reading this right from the linked to issue [1]

> SELECT FOR UPDATE SKIP LOCKED executes and the select processes wait for multiple minutes (10-20 minutes) before completing

There is a undocumented command pg_get_multixact_members() [3] that can help troubleshoot as many people are using hosted Postgres, the tools too look into the above problems can be limited.

It does appear that Amazon documents a bit about the above here [4].

[1] https://postgrespro.com/list/thread-id/2505440 [2] https://www.postgresql.org/docs/current/routine-vacuuming.ht... [3] https://doxygen.postgresql.org/multixact_8c.html#adf3c97f22b... [4] https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...