LLM Evaluators

How do you create LLM-as-a-judge evaluators using custom prompts to automatically score model outputs? Arthur's LLM Evaluators let you define a judge prompt — a rubric written in natural language — that a capable LLM uses to score your agent's outputs on any dimension you care about: coherence, relevance, factual accuracy, tone, or anything else that rule-based metrics can't capture. This page walks you through creating your first evaluator, configuring score ranges, using Arthur's built-in templates, and wiring results into the continuous evaluation system.

Overview

Traditional metrics like exact match or BLEU score break down for open-ended LLM outputs. A response can be grammatically correct, semantically relevant, and still be factually wrong — or vice versa. LLM-as-a-judge solves this by using a second, high-capability model (the "judge") to evaluate outputs against a rubric you define.

Arthur supports two categories of LLM evaluators:

CategoryDescriptionExamples
Built-in LLM evaluatorsPre-built LLM-as-a-judge templates with optimized promptsAspect Critic, Answer Relevance, Context Recall, Goal Accuracy
Built-in ML evaluatorsModel-based evaluators requiring no LLM promptPII, Toxicity, Prompt Injection
Custom evaluatorsYour own judge prompts, score ranges, and pass/fail thresholdsBrand tone, Domain accuracy, Compliance

Both types produce numeric scores that flow into Arthur's continuous evaluation system, trigger alerts, and appear in the dashboard alongside your trace data.

flowchart LR
    A[Agent Trace] --> B[Arthur Evaluator]
    B --> C{Evaluator Type}
    C -->|Built-in| D[Aspect Critic / PII / Toxicity]
    C -->|Custom| E[Your Judge Prompt]
    D --> F[Numeric Score]
    E --> F
    F --> G{Pass / Fail Threshold}
    G -->|Pass| H[Dashboard: Green]
    G -->|Fail| I[Dashboard: Alert + Red]
    H --> J[Continuous Eval System]
    I --> J

How LLM Evals Work

When an evaluator runs against a trace, Arthur:

  1. Extracts variables from the trace span — for example, {{input}}, {{output}}, and {{context}} — using Transforms you configure (see the Transforms guide).
  2. Renders the judge prompt by substituting those variables into your prompt template.
  3. Sends the rendered prompt to the configured judge model (e.g., GPT-4o, Claude 3.5 Sonnet).
  4. Parses the judge's response into a numeric score within your defined range.
  5. Applies your pass/fail threshold to classify the result.
  6. Stores the score on the trace, making it available in the dashboard, alerts, and experiment comparisons.

The judge model never sees your production system prompt or any credentials — it only receives the rendered evaluation prompt.

sequenceDiagram
    participant T as Trace Span
    participant A as Arthur Engine
    participant J as Judge Model
    participant D as Dashboard

    T->>A: Span arrives (input, output, context)
    A->>A: Apply Transforms → extract variables
    A->>A: Render judge prompt with variables
    A->>J: Send rendered prompt
    J-->>A: Return score + reasoning
    A->>A: Apply pass/fail threshold
    A->>D: Store score on trace
    A->>D: Trigger alert if threshold breached

Prerequisites

Before creating an LLM evaluator, make sure you have:

  • An Arthur workspace with at least one project configured
  • Tracing enabled — evaluators run against trace spans, so you need data flowing in (Tracing setup guide)
  • Transforms configured — Transforms map span attributes to the variables your judge prompt references (Transforms guide)
  • A judge model connected — Arthur uses your configured LLM provider credentials to call the judge model; confirm your API keys are set in workspace settings
  • Workspace Editor or Admin role — required to create and modify evaluators

Create a Custom LLM Evaluator

Custom evaluators give you full control over the judge prompt, the variables it references, and how scores are interpreted.

Step 1 — Navigate to Evaluators

In the Arthur dashboard, open your project and select Evaluate → Evaluators from the left navigation.

engine_evaluate_evaluators

Click + New Evaluator to open the evaluator creation form.

Step 2 — Write your judge prompt

Your judge prompt is a Jinja-style template. Use double-curly-brace syntax to reference variables that Arthur will populate from your trace spans via Transforms.

A well-structured judge prompt has three parts:

  1. Role and task — tell the judge what it is evaluating and why
  2. Rubric — define what each score level means
  3. Output format — instruct the judge to return a structured score

Example: Response Relevance Evaluator

You are an expert evaluator assessing the relevance of an AI assistant's response to a user's question.

User question:
{{ input }}

AI response:
{{ output }}

Score the response on a scale of 1 to 5 using the following rubric:

5 - Fully relevant: The response directly and completely addresses the user's question.
4 - Mostly relevant: The response addresses the main question with minor tangents.
3 - Partially relevant: The response addresses some aspects but misses key parts of the question.
2 - Mostly irrelevant: The response touches on the topic but does not answer the question.
1 - Irrelevant: The response does not address the user's question at all.

Respond with a JSON object in this exact format:
{
  "score": <integer 1-5>,
  "reasoning": "<one sentence explanation>"
}

Tip: Always instruct the judge to return structured output (JSON). Arthur parses the score field automatically. The reasoning field is stored alongside the score and is visible in the trace detail view — it's invaluable for debugging unexpected scores.

Step 4 — Select the judge model

Choose the LLM that will act as the judge. Arthur supports:

  • OpenAI models (GPT-4o, GPT-4o-mini, GPT-4-turbo)
  • Anthropic models (Claude 3.5 Sonnet, Claude 3 Haiku)
  • Custom endpoints (via your workspace's model provider configuration)

For most evaluation tasks, GPT-4o-mini offers a good balance of quality and cost. Use GPT-4o or Claude 3.5 Sonnet for high-stakes evaluations like hallucination detection.

Step 5 — Save the evaluator

Click Save Evaluator. The evaluator is now available to attach to continuous evaluations and experiments.


Use Built-In Evaluator Templates

Arthur ships with production-tested evaluators for the most common LLM quality and safety dimensions. These use optimized prompts developed and validated by the Arthur team — you don't need to write the judge prompt yourself.

Using a built-in template

When creating an evaluator (see Create a Custom LLM Evaluator), the evaluator type dropdown includes all built-in templates. Selecting one pre-fills the judge prompt and required variables — you only need to map the variables to your Transforms and set a pass threshold.

LLM evaluators (require a judge model and use natural language rubrics):

Evaluator nameWhat it measuresVariables required
Answer RelevanceWhether the response addresses the user's questionquery, response
Answer CorrectnessFactual accuracy of the response against a referencequery, response, reference
Context PrecisionWhether retrieved context is relevant to the queryquery, context
Context RecallWhether retrieved context covers the answerresponse, context
Aspect CriticWhether the response satisfies a specific dimensionquery, response, aspect definition
Goal AccuracyWhether the agent achieved its stated goalquery, response
SQL Semantic EquivalenceWhether two SQL queries are semantically equivalentquery, reference
Topic Adherence ClassificationWhether the response stays on topicquery, response
Hedging ClassificationWhether the response hedges appropriatelyresponse
Assistant Compliance ClassificationWhether the assistant followed instructionsquery, response
Evenhandedness ClassificationWhether the response is balanced and unbiasedresponse
Topic Adherence RefusalWhether the assistant correctly refuses off-topic requestsquery, response

When to use built-in vs. custom

  • Use built-in LLM templates as a starting point for common quality dimensions — the prompts are optimized and save you iteration time.
  • Use built-in ML evaluators (PII, Toxicity, Prompt Injection) for safety checks — they use dedicated models, not LLM prompts, and are faster and cheaper to run at scale.
  • Use custom evaluators for domain-specific quality dimensions: "Does this response follow our brand voice?", "Is the SQL query syntactically valid?", "Does the response cite sources correctly?"

Run Evaluations on Your Traces

Once your evaluator is created, you have two ways to run it:

Option A — Continuous evaluation (recommended for production)

Attach the evaluator to a Continuous Evaluation rule. Arthur will automatically score every new trace that matches your filter criteria as it arrives.

See the Continuous Evaluations guide for full setup instructions. The short version:

  1. Go to Evaluate
  2. Expand the evaluator you have chosen.
  3. Click + New Continuous Eval
  4. Set the Name, the Description and the Transformer.
  5. Save — evaluations begin immediately on incoming traces

Option B — Test Runs (useful for validating your evaluator)

Before enabling an evaluator continuously, use Test Runs to score a small batch of traces manually and verify the evaluator is behaving as expected. Test Runs are scoped to a continuous eval and their results are stored separately — they don't affect production scores.

  1. Open your continuous eval and click Test
  1. Paste up to 50 trace IDs (one per line or comma-separated)
  2. Click Run Test — Arthur scores each trace and shows per-trace scores, reasoning, and pass/fail status

Test Runs are also available via API:

# Create a test run against up to 50 trace IDs
test_run = client.continuous_evals.create_test_run(
    eval_id="your-eval-id",
    trace_ids=[
        "trace-abc123",
        "trace-def456",
        "trace-ghi789",
    ],
)

# Poll for completion
import time
while test_run.status == "running":
    time.sleep(2)
    test_run = client.continuous_evals.get_test_run(test_run.id)

# Fetch results
results = client.continuous_evals.get_test_run_results(test_run.id)
for result in results:
    print(f"Trace {result.trace_id}: score={result.score}, pass={result.passed}")
    print(f"  Reasoning: {result.reason}")
# Create a test run
curl -X POST https://your-arthur-instance.com/api/v1/continuous_evals/{eval_id}/test_runs \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"trace_ids": ["trace-abc123", "trace-def456", "trace-ghi789"]}'

# Fetch results
curl https://your-arthur-instance.com/api/v1/continuous_evals/test_runs/{test_run_id}/results \
  -H "Authorization: Bearer YOUR_API_KEY"

Interpret Results

After evaluations run, scores appear in two places: the Evaluate Results dashboard and the individual trace detail view.

engine_evaluate_results

Dashboard-level metrics

The Evaluate Results view shows aggregate statistics across all evaluated traces:

MetricDescription
Pass ratePercentage of traces that met or exceeded your pass threshold
Average scoreMean score across all evaluated traces in the time window
Score distributionHistogram of scores — useful for spotting bimodal distributions
Fail rate over timeTime-series chart — watch for regressions after deployments

Trace-level detail

Click any trace to see:

  • The rendered judge prompt — exactly what was sent to the judge model
  • The raw judge response — the full JSON including reasoning
  • The extracted score and pass/fail classification
  • The variable values that were substituted (input, output, context)

This trace-level transparency is critical for debugging. If a score looks wrong, the rendered prompt and reasoning tell you exactly why the judge scored it that way — and whether the problem is in your prompt, your Transforms, or your agent's output.

Connecting scores to alerts

If your pass rate drops below a threshold you define, Arthur can fire an alert. Configure this in Alerts → Alert Rules. For example:

  • Alert when 7-day rolling pass rate for "Hallucination Check" drops below 90%
  • Alert when any single trace scores 1 on "Response Relevance"

See the Alert Rules guide for configuration details.


Next Steps

You now have a working LLM evaluator producing scores on your traces. Here's where to go next:

GoalGuide
Understand how variables get into your promptTransforms — map trace span attributes to evaluator variables
Automatically score all production trafficContinuous Evaluations — attach your evaluator to a continuous eval rule
Compare prompt variants using eval scoresExperiments — run A/B tests and use evaluator scores as the comparison metric
Set up alerts on pass rate dropsAlert Rules — configure threshold-based alerts on evaluator metrics
Evaluate retrieval quality in RAG pipelinesRAG Evaluation — run Context Precision, Context Recall, and Answer Correctness evaluators on your retrieval results

Recommended next step: If you're running in production, set up a Continuous Evaluation rule so every trace is scored automatically. If you're still in development, run on-demand evaluations against a representative sample of traces to calibrate your pass threshold before going live.