LLM Evaluators

Overview

Traditional metrics like exact match or BLEU score break down for open-ended LLM outputs. A response can be grammatically correct, semantically relevant, and still be factually wrong — or vice versa. LLM-as-a-judge solves this by using a second, high-capability model (the "judge") to evaluate outputs against a rubric you define.

Arthur supports two categories of LLM evaluators:

Category	Description	Examples
Built-in LLM evaluators	Pre-built LLM-as-a-judge templates with optimized prompts	Aspect Critic, Answer Relevance, Context Recall, Goal Accuracy
Built-in ML evaluators	Model-based evaluators requiring no LLM prompt	PII, Toxicity, Prompt Injection
Custom evaluators	Your own judge prompts, score ranges, and pass/fail thresholds	Brand tone, Domain accuracy, Compliance

Both types produce numeric scores that flow into Arthur's continuous evaluation system, trigger alerts, and appear in the dashboard alongside your trace data.

flowchart LR
    A[Agent Trace] --> B[Arthur Evaluator]
    B --> C{Evaluator Type}
    C -->|Built-in| D[Aspect Critic / PII / Toxicity]
    C -->|Custom| E[Your Judge Prompt]
    D --> F[Numeric Score]
    E --> F
    F --> G{Pass / Fail Threshold}
    G -->|Pass| H[Dashboard: Green]
    G -->|Fail| I[Dashboard: Alert + Red]
    H --> J[Continuous Eval System]
    I --> J

How LLM Evals Work

When an evaluator runs against a trace, Arthur:

Extracts variables from the trace span — for example, {{input}}, {{output}}, and {{context}} — using Transforms you configure (see the Transforms guide).
Renders the judge prompt by substituting those variables into your prompt template.
Sends the rendered prompt to the configured judge model (e.g., GPT-4o, Claude 3.5 Sonnet).
Parses the judge's response into a numeric score within your defined range.
Applies your pass/fail threshold to classify the result.
Stores the score on the trace, making it available in the dashboard, alerts, and experiment comparisons.

The judge model never sees your production system prompt or any credentials — it only receives the rendered evaluation prompt.

sequenceDiagram
    participant T as Trace Span
    participant A as Arthur Engine
    participant J as Judge Model
    participant D as Dashboard

    T->>A: Span arrives (input, output, context)
    A->>A: Apply Transforms → extract variables
    A->>A: Render judge prompt with variables
    A->>J: Send rendered prompt
    J-->>A: Return score + reasoning
    A->>A: Apply pass/fail threshold
    A->>D: Store score on trace
    A->>D: Trigger alert if threshold breached

Prerequisites

Before creating an LLM evaluator, make sure you have:

An Arthur workspace with at least one project configured
Tracing enabled — evaluators run against trace spans, so you need data flowing in (Tracing setup guide)
Transforms configured — Transforms map span attributes to the variables your judge prompt references (Transforms guide)
A judge model connected — Arthur uses your configured LLM provider credentials to call the judge model; confirm your API keys are set in workspace settings
Workspace Editor or Admin role — required to create and modify evaluators

Create a Custom LLM Evaluator

Custom evaluators give you full control over the judge prompt, the variables it references, and how scores are interpreted.

Step 1 — Navigate to Evaluators

In the Arthur dashboard, open your project and select Evaluate → Evaluators from the left navigation.

Click + New Evaluator to open the evaluator creation form.

Step 2 — Write your judge prompt

Your judge prompt is a Jinja-style template. Use double-curly-brace syntax to reference variables that Arthur will populate from your trace spans via Transforms.

A well-structured judge prompt has three parts:

Role and task — tell the judge what it is evaluating and why
Rubric — define what each score level means
Output format — instruct the judge to return a structured score

Example: Response Relevance Evaluator

You are an expert evaluator assessing the relevance of an AI assistant's response to a user's question.

User question:
{{ input }}

AI response:
{{ output }}

Score the response on a scale of 1 to 5 using the following rubric:

5 - Fully relevant: The response directly and completely addresses the user's question.
4 - Mostly relevant: The response addresses the main question with minor tangents.
3 - Partially relevant: The response addresses some aspects but misses key parts of the question.
2 - Mostly irrelevant: The response touches on the topic but does not answer the question.
1 - Irrelevant: The response does not address the user's question at all.

Respond with a JSON object in this exact format:
{
  "score": <integer 1-5>,
  "reasoning": "<one sentence explanation>"
}

Tip: Always instruct the judge to return structured output (JSON). Arthur parses the score field automatically. The reasoning field is stored alongside the score and is visible in the trace detail view — it's invaluable for debugging unexpected scores.

Step 4 — Select the judge model

Choose the LLM that will act as the judge. Arthur supports:

OpenAI models (GPT-4o, GPT-4o-mini, GPT-4-turbo)
Anthropic models (Claude 3.5 Sonnet, Claude 3 Haiku)
Custom endpoints (via your workspace's model provider configuration)

For most evaluation tasks, GPT-4o-mini offers a good balance of quality and cost. Use GPT-4o or Claude 3.5 Sonnet for high-stakes evaluations like hallucination detection.

Step 5 — Save the evaluator

Click Save Evaluator. The evaluator is now available to attach to continuous evaluations and experiments.

Use Built-In Evaluator Templates

Arthur ships with production-tested evaluators for the most common LLM quality and safety dimensions. These use optimized prompts developed and validated by the Arthur team — you don't need to write the judge prompt yourself.

Using a built-in template

When creating an evaluator (see Create a Custom LLM Evaluator), the evaluator type dropdown includes all built-in templates. Selecting one pre-fills the judge prompt and required variables — you only need to map the variables to your Transforms and set a pass threshold.

LLM evaluators (require a judge model and use natural language rubrics):

Evaluator name	What it measures	Variables required
`Answer Relevance`	Whether the response addresses the user's question	`query`, `response`
`Answer Correctness`	Factual accuracy of the response against a reference	`query`, `response`, `reference`
`Context Precision`	Whether retrieved context is relevant to the query	`query`, `context`
`Context Recall`	Whether retrieved context covers the answer	`response`, `context`
`Aspect Critic`	Whether the response satisfies a specific dimension	`query`, `response`, aspect definition
`Goal Accuracy`	Whether the agent achieved its stated goal	`query`, `response`
`SQL Semantic Equivalence`	Whether two SQL queries are semantically equivalent	`query`, `reference`
`Topic Adherence Classification`	Whether the response stays on topic	`query`, `response`
`Hedging Classification`	Whether the response hedges appropriately	`response`
`Assistant Compliance Classification`	Whether the assistant followed instructions	`query`, `response`
`Evenhandedness Classification`	Whether the response is balanced and unbiased	`response`
`Topic Adherence Refusal`	Whether the assistant correctly refuses off-topic requests	`query`, `response`

When to use built-in vs. custom

Use built-in LLM templates as a starting point for common quality dimensions — the prompts are optimized and save you iteration time.
Use built-in ML evaluators (PII, Toxicity, Prompt Injection) for safety checks — they use dedicated models, not LLM prompts, and are faster and cheaper to run at scale.
Use custom evaluators for domain-specific quality dimensions: "Does this response follow our brand voice?", "Is the SQL query syntactically valid?", "Does the response cite sources correctly?"

Run Evaluations on Your Traces

Once your evaluator is created, you have two ways to run it:

Option A — Continuous evaluation (recommended for production)

Attach the evaluator to a Continuous Evaluation rule. Arthur will automatically score every new trace that matches your filter criteria as it arrives.

See the Continuous Evaluations guide for full setup instructions. The short version:

Go to Evaluate
Expand the evaluator you have chosen.
Click + New Continuous Eval
Set the Name, the Description and the Transformer.
Save — evaluations begin immediately on incoming traces

Option B — Test Runs (useful for validating your evaluator)

Before enabling an evaluator continuously, use Test Runs to score a small batch of traces manually and verify the evaluator is behaving as expected. Test Runs are scoped to a continuous eval and their results are stored separately — they don't affect production scores.

Open your continuous eval and click Test

Paste up to 50 trace IDs (one per line or comma-separated)
Click Run Test — Arthur scores each trace and shows per-trace scores, reasoning, and pass/fail status

Test Runs are also available via API:

# Create a test run against up to 50 trace IDs
test_run = client.continuous_evals.create_test_run(
    eval_id="your-eval-id",
    trace_ids=[
        "trace-abc123",
        "trace-def456",
        "trace-ghi789",
    ],
)

# Poll for completion
import time
while test_run.status == "running":
    time.sleep(2)
    test_run = client.continuous_evals.get_test_run(test_run.id)

# Fetch results
results = client.continuous_evals.get_test_run_results(test_run.id)
for result in results:
    print(f"Trace {result.trace_id}: score={result.score}, pass={result.passed}")
    print(f"  Reasoning: {result.reason}")

# Create a test run
curl -X POST https://your-arthur-instance.com/api/v1/continuous_evals/{eval_id}/test_runs \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"trace_ids": ["trace-abc123", "trace-def456", "trace-ghi789"]}'

# Fetch results
curl https://your-arthur-instance.com/api/v1/continuous_evals/test_runs/{test_run_id}/results \
  -H "Authorization: Bearer YOUR_API_KEY"

Interpret Results

After evaluations run, scores appear in two places: the Evaluate Results dashboard and the individual trace detail view.

Dashboard-level metrics

The Evaluate Results view shows aggregate statistics across all evaluated traces:

Metric	Description
Pass rate	Percentage of traces that met or exceeded your pass threshold
Average score	Mean score across all evaluated traces in the time window
Score distribution	Histogram of scores — useful for spotting bimodal distributions
Fail rate over time	Time-series chart — watch for regressions after deployments

Trace-level detail

Click any trace to see:

The rendered judge prompt — exactly what was sent to the judge model
The raw judge response — the full JSON including reasoning
The extracted score and pass/fail classification
The variable values that were substituted (input, output, context)

This trace-level transparency is critical for debugging. If a score looks wrong, the rendered prompt and reasoning tell you exactly why the judge scored it that way — and whether the problem is in your prompt, your Transforms, or your agent's output.

Connecting scores to alerts

If your pass rate drops below a threshold you define, Arthur can fire an alert. Configure this in Alerts → Alert Rules. For example:

Alert when 7-day rolling pass rate for "Hallucination Check" drops below 90%
Alert when any single trace scores 1 on "Response Relevance"

See the Alert Rules guide for configuration details.

Next Steps

You now have a working LLM evaluator producing scores on your traces. Here's where to go next:

Goal	Guide
Understand how variables get into your prompt	Transforms — map trace span attributes to evaluator variables
Automatically score all production traffic	Continuous Evaluations — attach your evaluator to a continuous eval rule
Compare prompt variants using eval scores	Experiments — run A/B tests and use evaluator scores as the comparison metric
Set up alerts on pass rate drops	Alert Rules — configure threshold-based alerts on evaluator metrics
Evaluate retrieval quality in RAG pipelines	RAG Evaluation — run Context Precision, Context Recall, and Answer Correctness evaluators on your retrieval results

Recommended next step: If you're running in production, set up a Continuous Evaluation rule so every trace is scored automatically. If you're still in development, run on-demand evaluations against a representative sample of traces to calibrate your pass threshold before going live.