onMay 2, 2025

Exploring state-of-the-art LLMs as Judges


The idea of using large language models (LLMs) as judges to evaluate other models is becoming more popular. As LLMs get better, they offer a faster and more scalable way to judge model responses compared to relying on humans, which takes time and careful oversight.

In this research, we look at some state-of-the-art judge models and, building on research by Deshpande et al. (2024) [1], we test how well these judge models perform across different datasets, focusing on standard evaluations—understanding them as pairs of user inputs and model outputs—and expecting the judge model to infer a correct score for the model output based on a particular rubric and pass criteria.

Lastly, we turn our attention to red teaming applications. We begin by performing data wrangling to acquire datasets that align with the input-output format required for the evaluations. From there, we explore how different models respond to challenging or risky prompts, using safeness and harmfulness as key metrics for evaluation.

To measure the alignment between the model-assigned scores and the reference labels, we use two main metrics:

  • Pearson Correlation Coefficient: Measures the strength and direction of the linear relationship between two sets of values.
  • Macro F1 Score: Calculates the F1 score independently for each class and then averages them, combining precision and recall into a single value.

These metrics provide a view of both the scoring consistency and classification accuracy of the models across different evaluation sets.

Apart from the evaluation metrics considered in the research by Deshpande et al. (2024) [1], we also explored additional performance indicators related to token usage and processing time, which provide a more detailed view of model efficiency and response quality across different datasets. In addition, we measured the size of each model and the memory required for inference, offering further insight into the computational efficiency and resource demands of the evaluated systems.


Models


  • Microsoft Phi-3.5-mini-instruct model (Microsoft, June 2024) [2]
  • Glider, a fine-tuned version of Microsoft Phi-3.5 Mini (PatronusAI, December 2024)
  • FlowJudge v0.1, whose architecture comes from the Phi-3.5-mini model (FlowAI, August 2024) [3]
  • GPT-4o (OpenAI, May 2024) [4]
  • Claude 3.5-sonnet (Anthropic, October 2024) [5]
  • Selene-1-Mini-Llama-3.1-8B (AtlaAI, 2025) [6]


General Rubrics


We evaluate the performance of four “small” language models: Glider, FlowJudge, Phimini 3.5 and Selene-1-Mini-Llama-3.1-8B [2] across various datasets. Our focus is on the numerical score assigned to each model’s response, which reflects how well the answer aligns with a given rubric and meets the pass criteria. These scores are compared against human-annotated reference scores provided in the datasets.

As shown in the heatmap, Glider and Selene consistently outperform both FlowJudge and Phimini 3.5 across most of the evaluation datasets. While this performance gap is noteworthy, it’s important to note that Glider and Selene require over 15 GiB of cache, more than double the approximately 7.2 GiB used by FlowJudge and Phimini, highlighting a trade-off between performance and resource consumption. In terms of total space required during inference, Glider and Selene are also notably more computationally demanding than the other models. Specifically, each of them require 16.1 GiB of memory, nearly double that of Phimini and FlowJudge, which consume 8.3 GiB and 8.2 GiB respectively. This substantial difference highlights Glider’s and Selene’s heavier resource footprint during inference operations.

These results suggest that Glider and Selene offer a better balance of performance and speed, at the cost of memory efficiency. Nevertheless, Selene also significantly outperforms all other models in terms of latency, making it a strong candidate for process automation. Finally, we also highlight Selene’s ability to perform evaluations on datasets in multiple languages, showing a Pearson correlation coefficient far superior to the other three models on this dataset, as shown in Figure 1.

Figure 1: Model's performance across different datasets

Model Dataset Mean Input Tokens Mean Output Tokens Total Time (s) Input tokens/time Throughput Latency Invalid outputs
Phi-3.5-mini-instruct biggenbench 1425 349 32992 0.043 0.011 78.55 28
Phi-3.5-mini-instruct summeval-coherence 1305 272 46715 0.028 0.006 29.11 13
Phi-3.5-mini-instruct summeval-fluency 1289 344 47260 0.027 0.007 29.41 154
Phi-3.5-mini-instruct summeval-consistency 1321 288 47561 0.028 0.006 29.70 12
Phi-3.5-mini-instruct summeval-relevance 1300 398 47545 0.027 0.008 29.64 2
Phi-3.5-mini-instruct feedback-bench 1008 398 23389 0.043 0.017 25.99 6
Flow-Judge-v0.1 biggenbench 1425 312 30942 0.046 0.010 73.67 17
Flow-Judge-v0.1 summeval-coherence 1305 270 37120 0.035 0.007 23.17 2
Flow-Judge-v0.1 summeval-fluency 1289 227 37990 0.034 0.006 23.69 15
Flow-Judge-v0.1 summeval-consistency 1321 315 36867 0.036 0.009 23.04 1
Flow-Judge-v0.1 summeval-relevance 1300 321 37354 0.035 0.009 23.34 1
Flow-Judge-v0.1 feedback-bench 1010 264 20923 0.048 0.013 20.51 5
Glider biggenbench 1261 262 30645 0.041 0.009 72.96 29
Glider summeval-coherence 1141 219 28084 0.041 0.008 17.35 0
Glider summeval-fluency 1125 210 22776 0.049 0.009 14.08 1
Glider summeval-consistency 1157 218 22118 0.052 0.010 13.60 1
Glider summeval-relevance 1136 224 34369 0.033 0.007 21.24 0
Glider feedback-bench 848 223 21770 0.039 0.010 20.45 27
Selene-1-Mini-Llama-3.1-8B biggenbench 886 171 3828 0.23 0.04 9.05 6
Selene-1-Mini-Llama-3.1-8B feedback-bench 630 138 5388 0.12 0.03 5.39 0
Selene-1-Mini-Llama-3.1-8B summeval-coherence 862 129 8888 0.10 0.01 5.55 0
Selene-1-Mini-Llama-3.1-8B summeval-consistency 882 125 8714 0.10 0.01 5.45 0
Selene-1-Mini-Llama-3.1-8B summeval-fluency 851 116 7877 0.11 0.01 4.90 0
Selene-1-Mini-Llama-3.1-8B summeval-relevance 887 129 9075 0.10 0.01 5.67 0


Red Teaming


In the context of red teaming, we evaluate the performance of the judge models using two key safety-related metrics: safeness and harmlessness. In addition to Glider, FlowJudge, Phimini 3.5, and Selene-1-Mini-Llama-3.1-8B, we also include GPT-4o and Claude 3.5 Sonnet in this study.

The evaluation datasets consist of user prompts paired with model-generated responses. Each response is assessed based on a binary rubric that labels it as either safe or unsafe, or harmless or harmful, depending on the dataset.

The two datasets used for this analysis are:

  • PKU Alignment Dataset (700 instances analyzed)
    This dataset includes a mix of safe and unsafe outputs, designed to evaluate a model’s ability to identify and flag risky or inappropriate behavior in nuanced cases.
  • Harmful Dataset (700 instances analyzed)
    This dataset contains only harmful outputs, curated to test whether a model can correctly detect clearly problematic or dangerous content, making it a more straightforward but important benchmark for red-teaming evaluations.

Within the group of small language models, Glider, followed by Selene-1-Mini-Llama-3.1-8B, perform slightly better than FlowJudge and Phimini 3.5. However, GPT-4o and Claude 3.5 Sonnet far exceed the metrics obtained in this analysis.

According to the metrics, larger and more advanced models match human judgment more closely. They are more accurate and better at identifying safe or unsafe responses. This shows that there is still a big gap in performance between smaller models and the most powerful ones. It also highlights the trade-off between using fewer resources and getting higher evaluation quality when choosing models for real-world use.

While GPT-4o and Claude 3.5 Sonnet perform the best, Glider and Selene-1-Mini-Llama-3.1-8B are still a strong option. On the one hand, Glider closely follows the top models and consistently demonstrates its potential and effectiveness across a wide range of tasks and datasets. On the other hand, Selene shows more consistent metrics with smaller gaps between datasets, highlighting its ability to distinguish between safe and unsafe responses (as in the cases from the PKU Alignment dataset), where most models show a drop in performance on this dataset, even the advanced models GPT-4o and Claude 3.5 Sonnet.

Figure 2: Model's performance across different datasets

Model Dataset Mean Input Tokens Mean Output Tokens Total Time (s) Input Tokens/Time Throughput Latency Invalid Outputs
claude-3-5-sonnet harmful_merge 550 126 3675 0.15 0.03 4.64 0
claude-3-5-sonnet pku_alignment 579 143 3973 0.15 0.04 5.06 0
Flow-Judge-v0.1 harmful_merge 637 142 3728 0.17 0.04 5.32 0
Flow-Judge-v0.1 pku_alignment 664 128 3828 0.17 0.03 5.47 0
Glider harmful_merge 417 101 2002 0.21 0.05 2.86 0
Glider pku_alignment 444 112 2103 0.21 0.05 3.00 0
Phi-3.5-mini harmful_merge 637 123 2801 0.23 0.04 4.00 83
Phi-3.5-mini pku_alignment 664 142 3815 0.17 0.04 5.44 68
gpt-4o harmful_merge 422 132 2218 0.19 0.06 3.13 0
gpt-4o pku_alignment 447 148 2571 0.17 0.06 3.64 0
Selene-1-Mini-Llama-3.1-8B harmful_merge 310 101 1807 0.17 0.06 2.57 0
Selene-1-Mini-Llama-3.1-8B pku_alignment_dataset 336 97 1822 0.18 0.05 2.60 0


Recap and Future Steps


In this study, we evaluated the performance of several large language models (LLMs) acting as automated judges across both general evaluation tasks and red teaming scenarios. Our results demonstrate that Glider, a fine-tuned version of Phi-3.5 Mini and Selene-1-Mini-Llama-3.1-8B, a fine-tuned version of Llama-3.1-8B, consistently outperform other compact models in terms of accuracy. This makes them a strong candidate for cost-sensitive deployments. Nevertheless, it is worth noting that they are more demanding in terms of both storage and memory usage during inference, requiring 16.1 GiB of memory compared to 8.3 GiB for Phimini and 8.2 GiB for FlowJudge, and when benchmarked against frontier models like GPT-4o and Claude 3.5 Sonnet, a notable performance gap remains.

Beyond model comparisons, our findings reinforce the promise of LLM-as-a-judge systems as scalable, cost-efficient tools for model evaluation. At the same time, they underline the need for ongoing validation against high-quality, labeled data to ensure these systems remain robust and trustworthy.

Looking ahead, we see several promising directions for future work:

  • Synthetic Dataset Generation

    Creating reliable, domain-specific synthetic datasets is critical to effectively train and evaluate LLMs, particularly in specialized or underrepresented areas.

  • Enhanced Fine-Tuning Techniques

    Exploring fine-tuning strategies using safety-focused or rubric-aligned datasets may help close the performance gap between smaller models and top-tier alternatives.

  • Prompt Sensitivity Analysis

    Investigating how different prompt phrasing affects judge model outputs, to better understand their reliability and adaptability.

  • Multilingual Generalization

    Extending evaluation to multiple languages will be key to ensuring the global relevance and fairness of judge models in diverse linguistic contexts.


Conclusions


The primary objective of this research was to deliver a comprehensive assessment of state-of-the-art LLMs, not just in terms of raw accuracy, but across a broader spectrum of practical considerations like latency, token usage, and computational efficiency. Gaining an in-depth understanding of these LLMs provides valuable insights into their respective strengths and weaknesses, which enables us to make smarter choices in developing and integrating the latest LLM technology into our products.

If you want to see our technology in action, book a demo with us: Galtea Demo


References


  1. Deshpande, Darshan, Selvan Sunitha Ravi, Sky CH-Wang, Bartosz Mielczarek, Anand Kannappan, and Rebecca Qian. GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking. arXiv preprint arXiv:2412.14140, 2024. Available at: https://arxiv.org/abs/2412.14140

  2. Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan, A. A., Bach, N., Bahree, A., Bakhtiari, A., Bao, J., Behl, H., et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv preprint arXiv:2404.14219. Retrieved from https://arxiv.org/abs/2404.14219

  3. FlowAI. 2024. Flow judge: An open small language model for llm system evaluations. https://www. flow-ai.com/blog/flow-judge. Accessed: Mar 9, 2025.

  4. OpenAI. (2024). ChatGPT (GPT-4o model) [Large language model]. https://platform.openai.com/docs/models/gpt-4o

  5. Sonnet Anthropic. (2024). Claude-3.5-Sonnet [Large language model]. https://www.anthropic.com/news/claude-3-5-sonnet

  6. Alexandru, A., Calvi, A., Broomfield, H., Golden, J., Dai, K., Leys, M., Burger, M., Bartolo, M., Engeler, R., Pisupati, S., et al. (2025). Atla Selene Mini: A General Purpose Evaluation Model. arXiv preprint arXiv:2501.17195. Retrieved from https://arxiv.org/abs/2501.17195