Exploring state-of-the-art LLMs as Judges
Benchmarks for Glider, FlowJudge, Phi-3.5-mini, Selene, GPT-4o, and Claude 3.5 Sonnet as judges across general rubrics and red-team safety tasks. Where small fine-tuned models hold up, where they don't, and what the latency and memory tradeoffs look like.
Galtea Team
·
May 2, 2025
·
12 minutes