Quickstart: Evaluate Your First LLM Call
In your first 30 minutes with Arthur, you will install the Python SDK, send a single LLM call through Arthur's evaluation pipeline, and see a scored result appear in the Arthur dashboard — no prior configuration required beyond an API key. This page walks you through every step in sequence, with no gaps.
Prerequisites
Before you begin, confirm you have the following:
| Requirement | Details |
|---|---|
| Arthur account | Sign up at platform.arthur.ai if you don't have one |
| Deployment target | A host environment for the engine — local Docker, Kubernetes, or a cloud account (AWS, GCP, Azure) |
| Python | Version 3.12 or later |
| pip | Latest version recommended (pip install --upgrade pip) |
Deploy the Engine
Deploy the Arthur Engine from the platform UI. The wizard walks you through configuration and generates an install command tailored to your infrastructure.
-
Navigate to Engines Management in your workspace:
https://platform.arthur.ai/workspaces/{your_workspace_id}/engines -
Click + ENGINE and step through the wizard.
-
On the Select Install Method step, choose your target environment:
Method Details Docker Ideal for local development or single-server deployments AWS CloudFormation-based deploy — GPU or CPU stack. See AWS deployment guide Kubernetes Helm chart deploy on any Kubernetes distribution. See Kubernetes deployment guide GCP Deploy to Google Cloud Platform using Cloud Run Azure Deploy to Azure using Container Instances -
On the Install step, the platform provides your pre-configured install command (for Docker) or links to deployment documentation with your generated client secret (for AWS and Kubernetes). Run the command or follow the linked guide.
-
Once the engine connects, click Continue to Project Setup.
Install the SDK
Install the Arthur Observability SDK with the extra for your LLM framework:
# OpenAI
pip install "arthur-observability-sdk[openai]"
# LangChain
pip install "arthur-observability-sdk[langchain]"
# Anthropic
pip install "arthur-observability-sdk[anthropic]"# Mastra users: install the Arthur exporter
npm install @mastra/arthur
The SDK supports 30+ frameworks. See the full list of extras.
Instrument Your First Call
The SDK auto-instruments your LLM framework — no manual payload construction required. Every call is automatically captured as a trace and sent to Arthur.
Step 1 — Create a task and get your API key
A task represents your LLM application in Arthur. You need to create one before you can send traces.
- Open the Arthur UI at your engine's address (e.g.
http://localhost:3030for Docker) and log in with yourGENAI_ENGINE_ADMIN_KEY. For Docker, the default ischangeme_genai_engine_admin_keyunless you changed it indocker-compose.yml. - On the home page, create a new task. Note the task ID — it appears in the URL:
/tasks/{id}. - Go to Settings → API Keys and create an API key.
Step 2 — Initialize Arthur and instrument your framework
from arthur_observability_sdk import Arthur
arthur = Arthur(
api_key="YOUR_API_KEY", # from Settings → API Keys
base_url="http://localhost:3030",
task_id="YOUR_TASK_ID", # from the task URL: /tasks/{id}
)
# OpenAI
arthur.instrument_openai()
# — or Anthropic —
# arthur.instrument_anthropic()import { ArthurExporter } from '@mastra/arthur';
import { Mastra } from '@mastra/core/mastra';
const mastra = new Mastra({
observability: {
configs: {
arthur: {
exporters: [
new ArthurExporter({
apiKey: process.env.ARTHUR_API_KEY, // from Settings → API Keys
endpoint: process.env.ARTHUR_BASE_URL,
taskId: process.env.ARTHUR_TASK_ID, // from the task URL: /tasks/{id} — or set ARTHUR_TASK_ID env var
}),
],
},
},
},
});Step 3 — Make your first LLM call
Make your LLM call as normal. The SDK captures it automatically.
OpenAI
import openai
client = openai.OpenAI()
with arthur.attributes(session_id="quickstart-session-001", user_id="user-1"):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is the capital of France?"}],
)
print(response.choices[0].message.content)// Mastra traces every agent/LLM call automatically once ArthurExporter is wired in.
// No additional instrumentation needed — just use your Mastra agent as normal.
const agent = mastra.getAgent('my-agent');
const response = await agent.generate('What is the capital of France?');
console.log(response.text);Anthropic
import anthropic
client = anthropic.Anthropic()
with arthur.attributes(session_id="quickstart-session-001", user_id="user-1"):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256,
messages=[{"role": "user", "content": "What is the capital of France?"}],
)
print(response.content[0].text)// Mastra traces every agent/LLM call automatically once ArthurExporter is wired in.
// No additional instrumentation needed — just use your Mastra agent as normal.
const agent = mastra.getAgent('my-agent');
const response = await agent.generate('What is the capital of France?');
console.log(response.text);Step 4 — Flush and shut down
Call shutdown() before your process exits to ensure all pending traces are sent.
arthur.shutdown()await exporter.shutdown(); // flush and terminate
// or just flush without shutting down:
// await exporter.flush();View Results in Arthur
Once the evaluation status is completed, your scores are live in the Arthur platform.
Navigate to your evaluation
- Open the Arthur UI at
http://localhost:3030. - Select your task from the home page.
- Click Traces in the left sidebar — this opens
/tasks/{YOUR_TASK_ID}/traces. - Your trace appears in the list. Filter by
session_id = quickstart-session-001to find it.
What you'll see in the trace
Click a trace row to open the detail view. You will see:
| Field | What it shows |
|---|---|
| Input / Output | The exact prompt and response that were captured |
| Latency | Time taken for the LLM call |
| Token counts | Prompt and completion token usage |
| Session / User | The session_id and user_id attributes you set |
Scores require evaluatorsTraces don't automatically include evaluation scores. To score traces for toxicity, PII, and other criteria, configure evaluators under Evaluate in the task sidebar.
Confirm what you just accomplished
flowchart LR
A[Install SDK] --> B[Initialize Arthur]
B --> C[Instrument Framework]
C --> D[Make LLM Call]
D --> E[arthur.shutdown]
E --> F[View Trace in Arthur]
Next Steps
Now that you have a working evaluation pipeline, here is where to go next depending on your goal:
Evaluate at scale
Connect Arthur to your production LLM calls so every response is scored automatically. See Continuous Evaluation → to learn how to instrument your application with a single decorator or middleware hook.
Set up evaluators
No evaluators are active by default — you configure the ones you need for your task. Arthur ships with templates for a wide range of criteria, including Answer Correctness, Answer Relevance, Context Precision, Context Recall, Goal Accuracy, Toxicity, Topic Adherence, SQL Semantic Equivalence, and more. Configure them under Evaluate in the task sidebar. See Configuring Evaluators →.
Set up alerts
Get notified when scores drop below a threshold. Arthur's alert rules let you define conditions and route notifications to Slack, PagerDuty, or email. See Alert Rules →.
Evaluate agentic workflows
If you are building multi-step agents, Arthur can trace tool calls, sub-agent invocations, and LLM model calls within a single session. See Agent Evaluation →.
Invite your team
Add colleagues to your workspace so they can view evaluations and configure scorers together. See Managing Workspace Access →.
Stuck? If you did not see a score after completing these steps, check that your API key hasevaluations:writepermission in Settings → API Keys, and that yourworkspace_idis correct. You can also reach the Arthur team at [email protected].
Updated about 22 hours ago