Quickstart: Evaluate Your First LLM Call

In your first 30 minutes with Arthur, you will install the Python SDK, send a single LLM call through Arthur's evaluation pipeline, and see a scored result appear in the Arthur dashboard — no prior configuration required beyond an API key. This page walks you through every step in sequence, with no gaps.

Prerequisites

Before you begin, confirm you have the following:

RequirementDetails
Arthur accountSign up at platform.arthur.ai if you don't have one
Deployment targetA host environment for the engine — local Docker, Kubernetes, or a cloud account (AWS, GCP, Azure)
PythonVersion 3.12 or later
pipLatest version recommended (pip install --upgrade pip)

Deploy the Engine

Deploy the Arthur Engine from the platform UI. The wizard walks you through configuration and generates an install command tailored to your infrastructure.

  1. Navigate to Engines Management in your workspace: https://platform.arthur.ai/workspaces/{your_workspace_id}/engines

  2. Click + ENGINE and step through the wizard.

  3. On the Select Install Method step, choose your target environment:

    MethodDetails
    DockerIdeal for local development or single-server deployments
    AWSCloudFormation-based deploy — GPU or CPU stack. See AWS deployment guide
    KubernetesHelm chart deploy on any Kubernetes distribution. See Kubernetes deployment guide
    GCPDeploy to Google Cloud Platform using Cloud Run
    AzureDeploy to Azure using Container Instances
  4. On the Install step, the platform provides your pre-configured install command (for Docker) or links to deployment documentation with your generated client secret (for AWS and Kubernetes). Run the command or follow the linked guide.

  5. Once the engine connects, click Continue to Project Setup.


Install the SDK

Install the Arthur Observability SDK with the extra for your LLM framework:

# OpenAI
pip install "arthur-observability-sdk[openai]"

# LangChain
pip install "arthur-observability-sdk[langchain]"

# Anthropic
pip install "arthur-observability-sdk[anthropic]"
# Mastra users: install the Arthur exporter
npm install @mastra/arthur
📘

The SDK supports 30+ frameworks. See the full list of extras.


Instrument Your First Call

The SDK auto-instruments your LLM framework — no manual payload construction required. Every call is automatically captured as a trace and sent to Arthur.

Step 1 — Create a task and get your API key

A task represents your LLM application in Arthur. You need to create one before you can send traces.

  1. Open the Arthur UI at your engine's address (e.g. http://localhost:3030 for Docker) and log in with your GENAI_ENGINE_ADMIN_KEY. For Docker, the default is changeme_genai_engine_admin_key unless you changed it in docker-compose.yml.
  2. On the home page, create a new task. Note the task ID — it appears in the URL: /tasks/{id}.
  3. Go to Settings → API Keys and create an API key.

Step 2 — Initialize Arthur and instrument your framework

from arthur_observability_sdk import Arthur

arthur = Arthur(
    api_key="YOUR_API_KEY",   # from Settings → API Keys
    base_url="http://localhost:3030",
    task_id="YOUR_TASK_ID",   # from the task URL: /tasks/{id}
)

# OpenAI
arthur.instrument_openai()

# — or Anthropic —
# arthur.instrument_anthropic()
import { ArthurExporter } from '@mastra/arthur';
import { Mastra } from '@mastra/core/mastra';

const mastra = new Mastra({
  observability: {
    configs: {
      arthur: {
        exporters: [
          new ArthurExporter({
            apiKey: process.env.ARTHUR_API_KEY,   // from Settings → API Keys
            endpoint: process.env.ARTHUR_BASE_URL,
            taskId: process.env.ARTHUR_TASK_ID,   // from the task URL: /tasks/{id} — or set ARTHUR_TASK_ID env var
          }),
        ],
      },
    },
  },
});

Step 3 — Make your first LLM call

Make your LLM call as normal. The SDK captures it automatically.

OpenAI

import openai

client = openai.OpenAI()

with arthur.attributes(session_id="quickstart-session-001", user_id="user-1"):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "What is the capital of France?"}],
    )
    print(response.choices[0].message.content)
// Mastra traces every agent/LLM call automatically once ArthurExporter is wired in.
// No additional instrumentation needed — just use your Mastra agent as normal.
const agent = mastra.getAgent('my-agent');
const response = await agent.generate('What is the capital of France?');
console.log(response.text);

Anthropic

import anthropic

client = anthropic.Anthropic()

with arthur.attributes(session_id="quickstart-session-001", user_id="user-1"):
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        messages=[{"role": "user", "content": "What is the capital of France?"}],
    )
    print(response.content[0].text)
// Mastra traces every agent/LLM call automatically once ArthurExporter is wired in.
// No additional instrumentation needed — just use your Mastra agent as normal.
const agent = mastra.getAgent('my-agent');
const response = await agent.generate('What is the capital of France?');
console.log(response.text);

Step 4 — Flush and shut down

Call shutdown() before your process exits to ensure all pending traces are sent.

arthur.shutdown()
await exporter.shutdown(); // flush and terminate
// or just flush without shutting down:
// await exporter.flush();

View Results in Arthur

Once the evaluation status is completed, your scores are live in the Arthur platform.

Navigate to your evaluation

  1. Open the Arthur UI at http://localhost:3030.
  2. Select your task from the home page.
  3. Click Traces in the left sidebar — this opens /tasks/{YOUR_TASK_ID}/traces.
  4. Your trace appears in the list. Filter by session_id = quickstart-session-001 to find it.

What you'll see in the trace

Click a trace row to open the detail view. You will see:

FieldWhat it shows
Input / OutputThe exact prompt and response that were captured
LatencyTime taken for the LLM call
Token countsPrompt and completion token usage
Session / UserThe session_id and user_id attributes you set
📘

Scores require evaluators

Traces don't automatically include evaluation scores. To score traces for toxicity, PII, and other criteria, configure evaluators under Evaluate in the task sidebar.

Confirm what you just accomplished

flowchart LR
    A[Install SDK] --> B[Initialize Arthur]
    B --> C[Instrument Framework]
    C --> D[Make LLM Call]
    D --> E[arthur.shutdown]
    E --> F[View Trace in Arthur]

Next Steps

Now that you have a working evaluation pipeline, here is where to go next depending on your goal:

Evaluate at scale

Connect Arthur to your production LLM calls so every response is scored automatically. See Continuous Evaluation → to learn how to instrument your application with a single decorator or middleware hook.

Set up evaluators

No evaluators are active by default — you configure the ones you need for your task. Arthur ships with templates for a wide range of criteria, including Answer Correctness, Answer Relevance, Context Precision, Context Recall, Goal Accuracy, Toxicity, Topic Adherence, SQL Semantic Equivalence, and more. Configure them under Evaluate in the task sidebar. See Configuring Evaluators →.

Set up alerts

Get notified when scores drop below a threshold. Arthur's alert rules let you define conditions and route notifications to Slack, PagerDuty, or email. See Alert Rules →.

Evaluate agentic workflows

If you are building multi-step agents, Arthur can trace tool calls, sub-agent invocations, and LLM model calls within a single session. See Agent Evaluation →.

Invite your team

Add colleagues to your workspace so they can view evaluations and configure scorers together. See Managing Workspace Access →.


💬

Stuck? If you did not see a score after completing these steps, check that your API key has evaluations:write permission in Settings → API Keys, and that your workspace_id is correct. You can also reach the Arthur team at [email protected].