The float comparison slider is great.

One thing from practical experience - the quality gap between model sizes shows up in a way benchmarks don't capture. I have a system where a smaller model generates plans and a larger model can override them. On any single output they look comparable. The difference shows up 3-4 steps later — small model makes a decision that sounds reasonable but compounds into a bad plan. Perplexity won't catch that, KL divergence won't either. They both measure one prediction at a time.