If you're preserving accurate brightness, then yes obviously 99% white needs 1 black pixel out of 100 on average. Accounting for gamma. There's no line to draw. That's already part of the definition. (You can increase contrast for artistic effect, but that's a different conversation.)
And it doesn't matter if you have patterns or not. What matters is that when you look at different dithering algorithms, it's incredibly clear that some (like truly random noise) make detail very difficult to see, while others (like error diffusion) make it much easier to see.
Just look at: https://en.wikipedia.org/wiki/Dither#Algorithms and observe the difference between "Random", "Ordered (void-and-cluster)", and "Floyd-Steinberg". The level of resolvable detail is obviously increasing. It's not subjective, it's literally the level of signal vs noise. How do you quantify that as a metric?
E.g. one way would be to calculate the standard deviation of the brightness in the dithered image from the original in every 4x4 set of pixels, to minimize their sum, or the sum of their squares, or something. But 4x4 is totally arbitrary, so I'm looking for something more elegant and generalizable. The point is, it shouldn't be dependent on human perception. Detail is detail. Signal is signal. So how do you prove which algorithm preserves the most detail, or prove that an algorithm preserves maximum possible detail?
Let's say you have a photo of a starry night sky, and a photo of a slightly brighter sky with no visible stars. If you do "fully accurate on average" dithering, the dithered output would be identical. But in that context, the difference between "sky with dots" and "sky without dots" is more important than the difference between "dark sky" and "very slightly less dark sky". In that context, I would say a dithering algorithm that discards the very slight error in shade in favor of better accuracy in texture is objectively better.
On that wikipedia page, compare Floyd–Steinberg vs Gradient-based. In my opinion, gradient-based better preserves detail in high-contrast areas (e.g. the eyelid), whereas FS better preserves detail in low-contrast areas (e.g. the jawline between the neck and the cheek).
You're talking about artistic tradeoffs. That's fine.
I'm asking, how do you quantitatively measure in the first place so you can even define the tradeoffs quantitatively?
You say how in your opinion, different algorithms preserve detail better in different areas. My question is, how do we define that numerically so it's not a matter of opinion? If it depends on contrast levels, you can then test with images of different contrast levels.
It doesn't seem unreasonable that we should be able to define metrics for these things.