Exploring state-of-the-art LLMs as Judges

The idea of using large language models (LLMs) as judges to evaluate other models is becoming more popular. As LLMs get better, they offer a faster and more scalable way to judge model responses compared to relying on humans, which takes time and careful oversight.

In this research, we look at some state-of-the-art judge models and, building on research by Deshpande et al. (2024) [1], we test how well these judge models perform across different datasets, focusing on standard evaluations—understanding them as pairs of user inputs and model outputs—and expecting the judge model to infer a correct score for the model output based on a particular rubric and pass criteria.

Lastly, we turn our attention to red teaming applications. We begin by performing data wrangling to acquire datasets that align with the input-output format required for the evaluations. From there, we explore how different models respond to challenging or risky prompts, using safeness and harmfulness as key metrics for evaluation.

To measure the alignment between the model-assigned scores and the reference labels, we use two main metrics:

Pearson Correlation Coefficient: Measures the strength and direction of the linear relationship between two sets of values.
Macro F1 Score: Calculates the F1 score independently for each class and then averages them, combining precision and recall into a single value.

These metrics provide a view of both the scoring consistency and classification accuracy of the models across different evaluation sets.

Apart from the evaluation metrics considered in the research by Deshpande et al. (2024) [1], we also explored additional performance indicators related to token usage and processing time, which provide a more detailed view of model efficiency and response quality across different datasets. In addition, we measured the size of each model and the memory required for inference, offering further insight into the computational efficiency and resource demands of the evaluated systems.

Models

Microsoft Phi-3.5-mini-instruct model (Microsoft, June 2024) [2]
Glider, a fine-tuned version of Microsoft Phi-3.5 Mini (PatronusAI, December 2024)
FlowJudge v0.1, whose architecture comes from the Phi-3.5-mini model (FlowAI, August 2024) [3]
GPT-4o (OpenAI, May 2024) [4]
Claude 3.5-sonnet (Anthropic, October 2024) [5]
Selene-1-Mini-Llama-3.1-8B (AtlaAI, 2025) [6]

General Rubrics

We evaluate the performance of four “small” language models: Glider, FlowJudge, Phimini 3.5 and Selene-1-Mini-Llama-3.1-8B [2] across various datasets. Our focus is on the numerical score assigned to each model’s response, which reflects how well the answer aligns with a given rubric and meets the pass criteria. These scores are compared against human-annotated reference scores provided in the datasets.

As shown in the heatmap, Glider and Selene consistently outperform both FlowJudge and Phimini 3.5 across most of the evaluation datasets. While this performance gap is noteworthy, it’s important to note that Glider and Selene require over 15 GiB of cache, more than double the approximately 7.2 GiB used by FlowJudge and Phimini, highlighting a trade-off between performance and resource consumption. In terms of total space required during inference, Glider and Selene are also notably more computationally demanding than the other models. Specifically, each of them require 16.1 GiB of memory, nearly double that of Phimini and FlowJudge, which consume 8.3 GiB and 8.2 GiB respectively. This substantial difference highlights Glider’s and Selene’s heavier resource footprint during inference operations.

These results suggest that Glider and Selene offer a better balance of performance and speed, at the cost of memory efficiency. Nevertheless, Selene also significantly outperforms all other models in terms of latency, making it a strong candidate for process automation. Finally, we also highlight Selene’s ability to perform evaluations on datasets in multiple languages, showing a Pearson correlation coefficient far superior to the other three models on this dataset, as shown in Figure 1.

Figure 1: Model's performance across different datasets

Model	Dataset	Mean Input Tokens	Mean Output Tokens	Total Time (s)	Input tokens/time	Throughput	Latency	Invalid outputs
Phi-3.5-mini-instruct	biggenbench	1425	349	32992	0.043	0.011	78.55	28
Phi-3.5-mini-instruct	summeval-coherence	1305	272	46715	0.028	0.006	29.11	13
Phi-3.5-mini-instruct	summeval-fluency	1289	344	47260	0.027	0.007	29.41	154
Phi-3.5-mini-instruct	summeval-consistency	1321	288	47561	0.028	0.006	29.70	12
Phi-3.5-mini-instruct	summeval-relevance	1300	398	47545	0.027	0.008	29.64	2
Phi-3.5-mini-instruct	feedback-bench	1008	398	23389	0.043	0.017	25.99	6
Flow-Judge-v0.1	biggenbench	1425	312	30942	0.046	0.010	73.67	17
Flow-Judge-v0.1	summeval-coherence	1305	270	37120	0.035	0.007	23.17	2
Flow-Judge-v0.1	summeval-fluency	1289	227	37990	0.034	0.006	23.69	15
Flow-Judge-v0.1	summeval-consistency	1321	315	36867	0.036	0.009	23.04	1
Flow-Judge-v0.1	summeval-relevance	1300	321	37354	0.035	0.009	23.34	1
Flow-Judge-v0.1	feedback-bench	1010	264	20923	0.048	0.013	20.51	5
Glider	biggenbench	1261	262	30645	0.041	0.009	72.96	29
Glider	summeval-coherence	1141	219	28084	0.041	0.008	17.35	0
Glider	summeval-fluency	1125	210	22776	0.049	0.009	14.08	1
Glider	summeval-consistency	1157	218	22118	0.052	0.010	13.60	1
Glider	summeval-relevance	1136	224	34369	0.033	0.007	21.24	0
Glider	feedback-bench	848	223	21770	0.039	0.010	20.45	27
Selene-1-Mini-Llama-3.1-8B	biggenbench	886	171	3828	0.23	0.04	9.05	6
Selene-1-Mini-Llama-3.1-8B	feedback-bench	630	138	5388	0.12	0.03	5.39	0
Selene-1-Mini-Llama-3.1-8B	summeval-coherence	862	129	8888	0.10	0.01	5.55	0
Selene-1-Mini-Llama-3.1-8B	summeval-consistency	882	125	8714	0.10	0.01	5.45	0
Selene-1-Mini-Llama-3.1-8B	summeval-fluency	851	116	7877	0.11	0.01	4.90	0
Selene-1-Mini-Llama-3.1-8B	summeval-relevance	887	129	9075	0.10	0.01	5.67	0

Red Teaming

In the context of red teaming, we evaluate the performance of the judge models using two key safety-related metrics: safeness and harmlessness. In addition to Glider, FlowJudge, Phimini 3.5, and Selene-1-Mini-Llama-3.1-8B, we also include GPT-4o and Claude 3.5 Sonnet in this study.

The evaluation datasets consist of user prompts paired with model-generated responses. Each response is assessed based on a binary rubric that labels it as either safe or unsafe, or harmless or harmful, depending on the dataset.

The two datasets used for this analysis are:

PKU Alignment Dataset (700 instances analyzed)
This dataset includes a mix of safe and unsafe outputs, designed to evaluate a model’s ability to identify and flag risky or inappropriate behavior in nuanced cases.
- Source: Hugging Face – PKU-Alignment
- Percentage of unsafe responses: 57.66%
- Percentage of safe responses: 42.34%
Harmful Dataset (700 instances analyzed)
This dataset contains only harmful outputs, curated to test whether a model can correctly detect clearly problematic or dangerous content, making it a more straightforward but important benchmark for red-teaming evaluations.
- Source: Hugging Face – JailbreakBench
- Percentage of harmful responses: 100%

Within the group of small language models, Glider, followed by Selene-1-Mini-Llama-3.1-8B, perform slightly better than FlowJudge and Phimini 3.5. However, GPT-4o and Claude 3.5 Sonnet far exceed the metrics obtained in this analysis.

According to the metrics, larger and more advanced models match human judgment more closely. They are more accurate and better at identifying safe or unsafe responses. This shows that there is still a big gap in performance between smaller models and the most powerful ones. It also highlights the trade-off between using fewer resources and getting higher evaluation quality when choosing models for real-world use.

While GPT-4o and Claude 3.5 Sonnet perform the best, Glider and Selene-1-Mini-Llama-3.1-8B are still a strong option. On the one hand, Glider closely follows the top models and consistently demonstrates its potential and effectiveness across a wide range of tasks and datasets. On the other hand, Selene shows more consistent metrics with smaller gaps between datasets, highlighting its ability to distinguish between safe and unsafe responses (as in the cases from the PKU Alignment dataset), where most models show a drop in performance on this dataset, even the advanced models GPT-4o and Claude 3.5 Sonnet.

Figure 2: Model's performance across different datasets

Model	Dataset	Mean Input Tokens	Mean Output Tokens	Total Time (s)	Input Tokens/Time	Throughput	Latency	Invalid Outputs
claude-3-5-sonnet	harmful_merge	550	126	3675	0.15	0.03	4.64	0
claude-3-5-sonnet	pku_alignment	579	143	3973	0.15	0.04	5.06	0
Flow-Judge-v0.1	harmful_merge	637	142	3728	0.17	0.04	5.32	0
Flow-Judge-v0.1	pku_alignment	664	128	3828	0.17	0.03	5.47	0
Glider	harmful_merge	417	101	2002	0.21	0.05	2.86	0
Glider	pku_alignment	444	112	2103	0.21	0.05	3.00	0
Phi-3.5-mini	harmful_merge	637	123	2801	0.23	0.04	4.00	83
Phi-3.5-mini	pku_alignment	664	142	3815	0.17	0.04	5.44	68
gpt-4o	harmful_merge	422	132	2218	0.19	0.06	3.13	0
gpt-4o	pku_alignment	447	148	2571	0.17	0.06	3.64	0
Selene-1-Mini-Llama-3.1-8B	harmful_merge	310	101	1807	0.17	0.06	2.57	0
Selene-1-Mini-Llama-3.1-8B	pku_alignment_dataset	336	97	1822	0.18	0.05	2.60	0

Recap and Future Steps

In this study, we evaluated the performance of several large language models (LLMs) acting as automated judges across both general evaluation tasks and red teaming scenarios. Our results demonstrate that Glider, a fine-tuned version of Phi-3.5 Mini and Selene-1-Mini-Llama-3.1-8B, a fine-tuned version of Llama-3.1-8B, consistently outperform other compact models in terms of accuracy. This makes them a strong candidate for cost-sensitive deployments. Nevertheless, it is worth noting that they are more demanding in terms of both storage and memory usage during inference, requiring 16.1 GiB of memory compared to 8.3 GiB for Phimini and 8.2 GiB for FlowJudge, and when benchmarked against frontier models like GPT-4o and Claude 3.5 Sonnet, a notable performance gap remains.

Beyond model comparisons, our findings reinforce the promise of LLM-as-a-judge systems as scalable, cost-efficient tools for model evaluation. At the same time, they underline the need for ongoing validation against high-quality, labeled data to ensure these systems remain robust and trustworthy.

Looking ahead, we see several promising directions for future work:

Synthetic Dataset Generation

Creating reliable, domain-specific synthetic datasets is critical to effectively train and evaluate LLMs, particularly in specialized or underrepresented areas.
Enhanced Fine-Tuning Techniques

Exploring fine-tuning strategies using safety-focused or rubric-aligned datasets may help close the performance gap between smaller models and top-tier alternatives.
Prompt Sensitivity Analysis

Investigating how different prompt phrasing affects judge model outputs, to better understand their reliability and adaptability.
Multilingual Generalization

Extending evaluation to multiple languages will be key to ensuring the global relevance and fairness of judge models in diverse linguistic contexts.

Conclusions

The primary objective of this research was to deliver a comprehensive assessment of state-of-the-art LLMs, not just in terms of raw accuracy, but across a broader spectrum of practical considerations like latency, token usage, and computational efficiency. Gaining an in-depth understanding of these LLMs provides valuable insights into their respective strengths and weaknesses, which enables us to make smarter choices in developing and integrating the latest LLM technology into our products.

If you want to see our technology in action, book a demo with us: Galtea Demo

References

Deshpande, Darshan, Selvan Sunitha Ravi, Sky CH-Wang, Bartosz Mielczarek, Anand Kannappan, and Rebecca Qian. GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking. arXiv preprint arXiv:2412.14140, 2024. Available at: https://arxiv.org/abs/2412.14140
Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan, A. A., Bach, N., Bahree, A., Bakhtiari, A., Bao, J., Behl, H., et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv preprint arXiv:2404.14219. Retrieved from https://arxiv.org/abs/2404.14219
FlowAI. 2024. Flow judge: An open small language model for llm system evaluations. https://www. flow-ai.com/blog/flow-judge. Accessed: Mar 9, 2025.
OpenAI. (2024). ChatGPT (GPT-4o model) [Large language model]. https://platform.openai.com/docs/models/gpt-4o
Sonnet Anthropic. (2024). Claude-3.5-Sonnet [Large language model]. https://www.anthropic.com/news/claude-3-5-sonnet
Alexandru, A., Calvi, A., Broomfield, H., Golden, J., Dai, K., Leys, M., Burger, M., Bartolo, M., Engeler, R., Pisupati, S., et al. (2025). Atla Selene Mini: A General Purpose Evaluation Model. arXiv preprint arXiv:2501.17195. Retrieved from https://arxiv.org/abs/2501.17195