Hacker News

Yeah, in that way it's a lot like image generation. Maybe a single output is good in isolation, but if you want to generate a series maintaining some kind of consistent style, it's very much like a lottery. The models don't have dials to control emphasis, cadence, emotiveness, accent, etc., so they guess from the content. For example, imagine a serious scene that calls for a somber tone, but then one of the characters makes a dark or ironic joke. A human would maintain the same reading voice, but these models would instead switch to a much more chipper register for that one line.