> Generation is speedy at about ~4 seconds per generation

May I ask on which GPU & VRAM?

edit: oh unless you just meant through huggingface's UI

The open weights variant is "coming soon" so the only option is hosted right now.

It is through Replicate's UI listed, which goes through Black Forest Labs's infra so would likely get the same results from their API.