# LLM Evaluators

How do you create LLM-as-a-judge evaluators using custom prompts to automatically score model outputs? Arthur's LLM Evaluators let you define a judge prompt — a rubric written in natural language — that a capable LLM uses to score your agent's outputs on any dimension you care about: coherence, relevance, factual accuracy, tone, or anything else that rule-based metrics can't capture. This page walks you through creating your first evaluator, configuring score ranges, using Arthur's built-in templates, and wiring results into the continuous evaluation system.

## Overview

Traditional metrics like exact match or BLEU score break down for open-ended LLM outputs. A response can be grammatically correct, semantically relevant, and still be factually wrong — or vice versa. LLM-as-a-judge solves this by using a second, high-capability model (the "judge") to evaluate outputs against a rubric you define.

Arthur supports two categories of LLM evaluators:

| Category                    | Description                                                    | Examples                                                       |
| --------------------------- | -------------------------------------------------------------- | -------------------------------------------------------------- |
| **Built-in LLM evaluators** | Pre-built LLM-as-a-judge templates with optimized prompts      | Aspect Critic, Answer Relevance, Context Recall, Goal Accuracy |
| **Built-in ML evaluators**  | Model-based evaluators requiring no LLM prompt                 | PII, Toxicity, Prompt Injection                                |
| **Custom evaluators**       | Your own judge prompts, score ranges, and pass/fail thresholds | Brand tone, Domain accuracy, Compliance                        |

Both types produce numeric scores that flow into Arthur's continuous evaluation system, trigger alerts, and appear in the dashboard alongside your trace data.

```mermaid
flowchart LR
    A[Agent Trace] --> B[Arthur Evaluator]
    B --> C{Evaluator Type}
    C -->|Built-in| D[Aspect Critic / PII / Toxicity]
    C -->|Custom| E[Your Judge Prompt]
    D --> F[Numeric Score]
    E --> F
    F --> G{Pass / Fail Threshold}
    G -->|Pass| H[Dashboard: Green]
    G -->|Fail| I[Dashboard: Alert + Red]
    H --> J[Continuous Eval System]
    I --> J
```

***

## How LLM Evals Work

When an evaluator runs against a trace, Arthur:

1. **Extracts variables** from the trace span — for example, `{{input}}`, `{{output}}`, and `{{context}}` — using Transforms you configure (see the [Transforms guide](#)).
2. **Renders the judge prompt** by substituting those variables into your prompt template.
3. **Sends the rendered prompt** to the configured judge model (e.g., GPT-4o, Claude 3.5 Sonnet).
4. **Parses the judge's response** into a numeric score within your defined range.
5. **Applies your pass/fail threshold** to classify the result.
6. **Stores the score** on the trace, making it available in the dashboard, alerts, and experiment comparisons.

The judge model never sees your production system prompt or any credentials — it only receives the rendered evaluation prompt.

```mermaid
sequenceDiagram
    participant T as Trace Span
    participant A as Arthur Engine
    participant J as Judge Model
    participant D as Dashboard

    T->>A: Span arrives (input, output, context)
    A->>A: Apply Transforms → extract variables
    A->>A: Render judge prompt with variables
    A->>J: Send rendered prompt
    J-->>A: Return score + reasoning
    A->>A: Apply pass/fail threshold
    A->>D: Store score on trace
    A->>D: Trigger alert if threshold breached
```

***

## Prerequisites

Before creating an LLM evaluator, make sure you have:

* **An Arthur workspace** with at least one project configured
* **Tracing enabled** — evaluators run against trace spans, so you need data flowing in ([Tracing setup guide](#))
* **Transforms configured** — Transforms map span attributes to the variables your judge prompt references ([Transforms guide](#))
* **A judge model connected** — Arthur uses your configured LLM provider credentials to call the judge model; confirm your API keys are set in workspace settings
* **Workspace Editor or Admin role** — required to create and modify evaluators

***

## Create a Custom LLM Evaluator

Custom evaluators give you full control over the judge prompt, the variables it references, and how scores are interpreted.

### Step 1 — Navigate to Evaluators

In the Arthur dashboard, open your project and select **Evaluate → Evaluators** from the left navigation.

<Image align="center" alt="engine_evaluate_evaluators" src="https://files.readme.io/1b5edddceb48570e059ff2d0b690184cf7cc88f5cfbe2d439b04a54252c17667-engine_evaluate_evaluators.png" />

Click **+ New Evaluator** to open the evaluator creation form.

### Step 2 — Write your judge prompt

<Image align="center" src="https://files.readme.io/61a443fd73831f682b45b5c520d9c479286e2bd2095d9c2faf9e2c4c5f93967d-Screenshot_2026-04-23_at_11.16.11.png" />

Your judge prompt is a Jinja-style template. Use double-curly-brace syntax to reference variables that Arthur will populate from your trace spans via Transforms.

A well-structured judge prompt has three parts:

1. **Role and task** — tell the judge what it is evaluating and why
2. **Rubric** — define what each score level means
3. **Output format** — instruct the judge to return a structured score

**Example: Response Relevance Evaluator**

```
You are an expert evaluator assessing the relevance of an AI assistant's response to a user's question.

User question:
{{ input }}

AI response:
{{ output }}

Score the response on a scale of 1 to 5 using the following rubric:

5 - Fully relevant: The response directly and completely addresses the user's question.
4 - Mostly relevant: The response addresses the main question with minor tangents.
3 - Partially relevant: The response addresses some aspects but misses key parts of the question.
2 - Mostly irrelevant: The response touches on the topic but does not answer the question.
1 - Irrelevant: The response does not address the user's question at all.

Respond with a JSON object in this exact format:
{
  "score": <integer 1-5>,
  "reasoning": "<one sentence explanation>"
}
```

> **Tip:** Always instruct the judge to return structured output (JSON). Arthur parses the `score` field automatically. The `reasoning` field is stored alongside the score and is visible in the trace detail view — it's invaluable for debugging unexpected scores.

### Step 4 — Select the judge model

Choose the LLM that will act as the judge. Arthur supports:

* OpenAI models (GPT-4o, GPT-4o-mini, GPT-4-turbo)
* Anthropic models (Claude 3.5 Sonnet, Claude 3 Haiku)
* Custom endpoints (via your workspace's model provider configuration)

For most evaluation tasks, **GPT-4o-mini** offers a good balance of quality and cost. Use **GPT-4o** or **Claude 3.5 Sonnet** for high-stakes evaluations like hallucination detection.

### Step 5 — Save the evaluator

Click **Save Evaluator**. The evaluator is now available to attach to continuous evaluations and experiments.

***

## Use Built-In Evaluator Templates

Arthur ships with production-tested evaluators for the most common LLM quality and safety dimensions. These use optimized prompts developed and validated by the Arthur team — you don't need to write the judge prompt yourself.

### Using a built-in template

When creating an evaluator (see [Create a Custom LLM Evaluator](#create-a-custom-llm-evaluator)), the evaluator type dropdown includes all built-in templates. Selecting one pre-fills the judge prompt and required variables — you only need to map the variables to your Transforms and set a pass threshold.

**LLM evaluators** (require a judge model and use natural language rubrics):

| Evaluator name                        | What it measures                                           | Variables required                     |
| ------------------------------------- | ---------------------------------------------------------- | -------------------------------------- |
| `Answer Relevance`                    | Whether the response addresses the user's question         | `query`, `response`                    |
| `Answer Correctness`                  | Factual accuracy of the response against a reference       | `query`, `response`, `reference`       |
| `Context Precision`                   | Whether retrieved context is relevant to the query         | `query`, `context`                     |
| `Context Recall`                      | Whether retrieved context covers the answer                | `response`, `context`                  |
| `Aspect Critic`                       | Whether the response satisfies a specific dimension        | `query`, `response`, aspect definition |
| `Goal Accuracy`                       | Whether the agent achieved its stated goal                 | `query`, `response`                    |
| `SQL Semantic Equivalence`            | Whether two SQL queries are semantically equivalent        | `query`, `reference`                   |
| `Topic Adherence Classification`      | Whether the response stays on topic                        | `query`, `response`                    |
| `Hedging Classification`              | Whether the response hedges appropriately                  | `response`                             |
| `Assistant Compliance Classification` | Whether the assistant followed instructions                | `query`, `response`                    |
| `Evenhandedness Classification`       | Whether the response is balanced and unbiased              | `response`                             |
| `Topic Adherence Refusal`             | Whether the assistant correctly refuses off-topic requests | `query`, `response`                    |

### When to use built-in vs. custom

* **Use built-in LLM templates** as a starting point for common quality dimensions — the prompts are optimized and save you iteration time.
* **Use built-in ML evaluators** (PII, Toxicity, Prompt Injection) for safety checks — they use dedicated models, not LLM prompts, and are faster and cheaper to run at scale.
* **Use custom evaluators** for domain-specific quality dimensions: "Does this response follow our brand voice?", "Is the SQL query syntactically valid?", "Does the response cite sources correctly?"

***

## Run Evaluations on Your Traces

Once your evaluator is created, you have two ways to run it:

### Option A — Continuous evaluation (recommended for production)

Attach the evaluator to a **Continuous Evaluation** rule. Arthur will automatically score every new trace that matches your filter criteria as it arrives.

See the [Continuous Evaluations guide](#) for full setup instructions. The short version:

1. Go to **Evaluate**
2. Expand the evaluator you have chosen.
3. Click **+ New Continuous Eval**
4. Set the Name, the Description and the Transformer.
5. Save — evaluations begin immediately on incoming traces

### Option B — Test Runs (useful for validating your evaluator)

Before enabling an evaluator continuously, use **Test Runs** to score a small batch of traces manually and verify the evaluator is behaving as expected. Test Runs are scoped to a continuous eval and their results are stored separately — they don't affect production scores.

1. Open your continuous eval and click **Test**

<Image align="center" src="https://files.readme.io/5f8587b2df6fd19b54a1739ebbf9fc494ac617043a2f9915071f549c1d712bd1-Screenshot_2026-04-23_at_11.15.35.png" />

1. Paste up to 50 trace IDs (one per line or comma-separated)
2. Click **Run Test** — Arthur scores each trace and shows per-trace scores, reasoning, and pass/fail status

Test Runs are also available via API:

```python Python SDK
# Create a test run against up to 50 trace IDs
test_run = client.continuous_evals.create_test_run(
    eval_id="your-eval-id",
    trace_ids=[
        "trace-abc123",
        "trace-def456",
        "trace-ghi789",
    ],
)

# Poll for completion
import time
while test_run.status == "running":
    time.sleep(2)
    test_run = client.continuous_evals.get_test_run(test_run.id)

# Fetch results
results = client.continuous_evals.get_test_run_results(test_run.id)
for result in results:
    print(f"Trace {result.trace_id}: score={result.score}, pass={result.passed}")
    print(f"  Reasoning: {result.reason}")
```

```curl cURL
# Create a test run
curl -X POST https://your-arthur-instance.com/api/v1/continuous_evals/{eval_id}/test_runs \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"trace_ids": ["trace-abc123", "trace-def456", "trace-ghi789"]}'

# Fetch results
curl https://your-arthur-instance.com/api/v1/continuous_evals/test_runs/{test_run_id}/results \
  -H "Authorization: Bearer YOUR_API_KEY"
```

***

## Interpret Results

After evaluations run, scores appear in two places: the **Evaluate Results** dashboard and the **individual trace detail view**.

<Image align="center" alt="engine_evaluate_results" src="https://files.readme.io/2dfdd337b2cdc2f1de38eaac58efb5bf08f26bbf6f74d1f752eec0937be57a4c-Screenshot_2026-04-23_at_11.21.03.png" />

### Dashboard-level metrics

The Evaluate Results view shows aggregate statistics across all evaluated traces:

| Metric                  | Description                                                     |
| ----------------------- | --------------------------------------------------------------- |
| **Pass rate**           | Percentage of traces that met or exceeded your pass threshold   |
| **Average score**       | Mean score across all evaluated traces in the time window       |
| **Score distribution**  | Histogram of scores — useful for spotting bimodal distributions |
| **Fail rate over time** | Time-series chart — watch for regressions after deployments     |

### Trace-level detail

Click any trace to see:

* The **rendered judge prompt** — exactly what was sent to the judge model
* The **raw judge response** — the full JSON including reasoning
* The **extracted score** and pass/fail classification
* The **variable values** that were substituted (input, output, context)

This trace-level transparency is critical for debugging. If a score looks wrong, the rendered prompt and reasoning tell you exactly why the judge scored it that way — and whether the problem is in your prompt, your Transforms, or your agent's output.

<Image align="center" src="https://files.readme.io/3ddc4305c7a9eaeab700326551d5a5f4af9e7d59882500154934762db9a98829-Screenshot_2026-04-23_at_11.20.58.png" />

### Connecting scores to alerts

If your pass rate drops below a threshold you define, Arthur can fire an alert. Configure this in **Alerts → Alert Rules**. For example:

* Alert when 7-day rolling pass rate for "Hallucination Check" drops below 90%
* Alert when any single trace scores 1 on "Response Relevance"

See the [Alert Rules guide](#) for configuration details.

***

## Next Steps

You now have a working LLM evaluator producing scores on your traces. Here's where to go next:

| Goal                                              | Guide                                                                                                                               |
| ------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| **Understand how variables get into your prompt** | [Transforms](#) — map trace span attributes to evaluator variables                                                                  |
| **Automatically score all production traffic**    | [Continuous Evaluations](#) — attach your evaluator to a continuous eval rule                                                       |
| **Compare prompt variants using eval scores**     | [Experiments](#) — run A/B tests and use evaluator scores as the comparison metric                                                  |
| **Set up alerts on pass rate drops**              | [Alert Rules](#) — configure threshold-based alerts on evaluator metrics                                                            |
| **Evaluate retrieval quality in RAG pipelines**   | [RAG Evaluation](rag-guide.md) — run Context Precision, Context Recall, and Answer Correctness evaluators on your retrieval results |

> **Recommended next step:** If you're running in production, set up a Continuous Evaluation rule so every trace is scored automatically. If you're still in development, run on-demand evaluations against a representative sample of traces to calibrate your pass threshold before going live.