Topic

LLM evaluations

The core craft of measuring LLM performance: metric design, test data generation, LLM-as-a-judge methods, evaluation frameworks, and how to translate model behavior into business confidence.

Posts on this topic

LLM evaluations

LLM Evaluation vs Software testing: why your existing QA process doesn't work

The mental models and tools that work for traditional software testing break down on language models. Five assumptions — determinism, binary pass/fail, deploy-triggered failures, enumerable inputs, engineer-defined quality — and what replaces each one.

the complete guide for LLM evaluations in 2026

How to measure whether an LLM application actually works, from traces and golden datasets through judge calibration, CI regression gating, and production monitoring.

LLM as a Judge prompts: templates, rubrics, and best practices

Production-grade prompt templates for the four criteria RAG and agent pipelines need first, with the rubric patterns and anti-patterns that calibration runs consistently surface.

LLM as a Judge: The Complete Guide

LLM-as-a-judge is the practice of using one language model to evaluate another model’s outputs against a rubric, making scalable AI evaluation practical for chatbots, RAG systems, and agents. The article explains the three core judging modes, where LLM judges work well, where they fail, how to write reliable rubrics, and why calibration against a labelled gold set is mandatory before production use.

How to optimize your LLM Judge for AI evaluations (And why most teams get it wrong)

Most teams building LLM evaluation pipelines spend a lot of time on the judge itself, which model to use, how to write the rubric, and which dimensions to score. Almost none of that effort goes into evaluating whether the judge is actually right.

How to create a solid set of test cases to evaluate your GenAI system

Three approaches to test-case generation for GenAI systems — red teaming with curated attack databases, gold-standard generation for RAG and tool calling, and synthetic-user simulation for multi-turn conversations. With concrete examples for each.

Exploring state-of-the-art LLMs as Judges

Benchmarks for Glider, FlowJudge, Phi-3.5-mini, Selene, GPT-4o, and Claude 3.5 Sonnet as judges across general rubrics and red-team safety tasks. Where small fine-tuned models hold up, where they don't, and what the latency and memory tradeoffs look like.

Why Evaluation is the Key to Scaling Generative AI

Most enterprise GenAI projects stall when they try to scale past the MVP. The reason is evaluation strategy. Here's the four-stage view of where teams get stuck, and what it takes to move past Stage 3.

Galtea Team

February 24, 2025

4 minutes