Hacker News

I like that you go beyond just prompt engineering and "LLM as a judge" and use finetuned (?) ModernBert and Llama models.

In your previous post you mentioned that you "score 20+ dimensions". Are these generic dimensions for all use cases / users, or do you finetune individually for each user?