Let's say the original image is solid light gray, maybe 99% white. The palette you have for dithering is only pure white and pure black. What's the "best" possible output? Would it be pure white, or would it be 1 black pixel in every 10x10 square? Pure white is more accurate to the "shape" or "details" of the original, but sparse black pixels make a more accurate "color". Whatever your answer is, would it stay the same for 50% white? What about 99.99% black? Where do you draw the line?
Lets say that it's ~94% white. I think it's reasonable to have 1 black pixel in every 4x4 square on average -- that doesn't feel too sparse to me. But if it's literally just black pixel spaced on an even grid, that would look like a pattern to most people, which would still give the impression of a detail that isn't there. So how do you space them out? More even spacing gives the impression of a pattern that isn't there, but more random spacing give the impression of "clustering" and thus a texture that isn't there. There's literally no solution other than subjectively choosing a tradeoff between patterns and clumps.
If you're preserving accurate brightness, then yes obviously 99% white needs 1 black pixel out of 100 on average. Accounting for gamma. There's no line to draw. That's already part of the definition. (You can increase contrast for artistic effect, but that's a different conversation.)
And it doesn't matter if you have patterns or not. What matters is that when you look at different dithering algorithms, it's incredibly clear that some (like truly random noise) make detail very difficult to see, while others (like error diffusion) make it much easier to see.
Just look at: https://en.wikipedia.org/wiki/Dither#Algorithms and observe the difference between "Random", "Ordered (void-and-cluster)", and "Floyd-Steinberg". The level of resolvable detail is obviously increasing. It's not subjective, it's literally the level of signal vs noise. How do you quantify that as a metric?
E.g. one way would be to calculate the standard deviation of the brightness in the dithered image from the original in every 4x4 set of pixels, to minimize their sum, or the sum of their squares, or something. But 4x4 is totally arbitrary, so I'm looking for something more elegant and generalizable. The point is, it shouldn't be dependent on human perception. Detail is detail. Signal is signal. So how do you prove which algorithm preserves the most detail, or prove that an algorithm preserves maximum possible detail?
Let's say you have a photo of a starry night sky, and a photo of a slightly brighter sky with no visible stars. If you do "fully accurate on average" dithering, the dithered output would be identical. But in that context, the difference between "sky with dots" and "sky without dots" is more important than the difference between "dark sky" and "very slightly less dark sky". In that context, I would say a dithering algorithm that discards the very slight error in shade in favor of better accuracy in texture is objectively better.
On that wikipedia page, compare Floyd–Steinberg vs Gradient-based. In my opinion, gradient-based better preserves detail in high-contrast areas (e.g. the eyelid), whereas FS better preserves detail in low-contrast areas (e.g. the jawline between the neck and the cheek).
You're talking about artistic tradeoffs. That's fine.
I'm asking, how do you quantitatively measure in the first place so you can even define the tradeoffs quantitatively?
You say how in your opinion, different algorithms preserve detail better in different areas. My question is, how do we define that numerically so it's not a matter of opinion? If it depends on contrast levels, you can then test with images of different contrast levels.
It doesn't seem unreasonable that we should be able to define metrics for these things.