The samples featured elsewhere seem to be from a larger model?
After testing this locally, it still sounds quite mechanical, and fails catastrophically for simple phrases with numbers ("easy as 1-2-3"). If the 80M model can improve on this and keep the expressiveness seen in the reddit post, that looks promising.