Datasets (Engine)
Overview
To create, version, and manage evaluation datasets in the Arthur Engine so your eval runs are reproducible, you upload a named, versioned dataset to a specific task, then reference that dataset by ID or version in every evaluation run. Because datasets are scoped to a task and immutable once versioned, two runs pointing at the same dataset version will always operate on identical inputs — giving you an apples-to-apples comparison across model changes, prompt updates, or scorer configurations.
This page covers engine-side datasets — task-scoped datasets used to power evaluations and benchmarks in the Arthur Engine. These are distinct from platform-side datasets used for model monitoring. For platform-side datasets, see the Datasets (Platform) page.
flowchart LR
A["Raw test cases<br>or CSV"] --> B["Create Dataset<br>POST /datasets"]
B --> C["Named Dataset<br>with dataset_id"]
C --> D["Add a Version<br>POST /datasets/:id/versions"]
D --> E["Versioned Dataset<br>v1, v2, ..."]
E --> F["Reference in<br>Eval Run"]
F --> G["Reproducible<br>Eval Results"]
Engine base URLAll Engine API calls in this page use
http://localhost:3030as the default base URL. SetARTHUR_BASE_URLin your environment to override this for staging or production deployments.
Prerequisites
Before you create your first dataset, make sure you have:
- An Arthur Engine instance running and reachable (default:
http://localhost:3030) - An API key with
DATASET_WRITEpermission — set asARTHUR_API_KEYin your environment or passed directly - A task already created in the Engine — datasets are scoped to a task. See Tasks if you need to create one first
- Your test cases ready — either as a list of column/value pairs, a CSV, or a JSON array
Create a Dataset
The create-then-version workflow is the foundation of reproducible evals. You first register a named dataset under a task, then upload one or more versioned snapshots of the actual data. This separation lets you evolve your dataset over time without breaking references to earlier versions.
Step 1 — Register the dataset name
Call POST /api/v2/tasks/{task_id}/datasets to register a new dataset under your task. This creates the dataset record and returns a dataset_id you'll use in all subsequent calls.
import requests
import os
ARTHUR_BASE_URL = os.environ.get("ARTHUR_BASE_URL", "http://localhost:3030")
ARTHUR_API_KEY = os.environ["ARTHUR_API_KEY"]
TASK_ID = "your-task-id"
headers = {
"Authorization": f"Bearer {ARTHUR_API_KEY}",
"Content-Type": "application/json",
}
payload = {
"name": "customer-support-qa-bench",
"description": "Golden set of customer support Q&A pairs for regression testing",
}
response = requests.post(
f"{ARTHUR_BASE_URL}/api/v2/tasks/{TASK_ID}/datasets",
json=payload,
headers=headers,
)
response.raise_for_status()
dataset = response.json()
dataset_id = dataset["id"]
print(f"Created dataset: {dataset_id}")const ARTHUR_BASE_URL = process.env.ARTHUR_BASE_URL ?? "http://localhost:3030";
const ARTHUR_API_KEY = process.env.ARTHUR_API_KEY;
const TASK_ID = "your-task-id";
const response = await fetch(
`${ARTHUR_BASE_URL}/api/v2/tasks/${TASK_ID}/datasets`,
{
method: "POST",
headers: {
Authorization: `Bearer ${ARTHUR_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
name: "customer-support-qa-bench",
description: "Golden set of customer support Q&A pairs for regression testing",
}),
}
);
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const dataset = await response.json();
console.log("Created dataset:", dataset.id);curl -X POST http://localhost:3030/api/v2/tasks/{task_id}/datasets \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "customer-support-qa-bench",
"description": "Golden set of customer support Q&A pairs for regression testing"
}'The response includes the dataset_id you'll need for the next step:
{
"id": "ds_01hx9z3k2m4n5p6q7r8s9t0u",
"task_id": "your-task-id",
"name": "customer-support-qa-bench",
"description": "Golden set of customer support Q&A pairs for regression testing",
"created_at": 1717243200000,
"updated_at": 1717243200000,
"latest_version_number": null
}Step 2 — Upload your first version
With the dataset_id in hand, upload your actual test cases as the first version. Each version is an immutable snapshot of the data at that point in time.
Rows use a column/value format: each row is an object with a data array of {column_name, column_value} pairs. You define your own column names — there are no reserved field names.
import requests
import os
ARTHUR_BASE_URL = os.environ.get("ARTHUR_BASE_URL", "http://localhost:3030")
ARTHUR_API_KEY = os.environ["ARTHUR_API_KEY"]
DATASET_ID = "ds_01hx9z3k2m4n5p6q7r8s9t0u"
headers = {
"Authorization": f"Bearer {ARTHUR_API_KEY}",
"Content-Type": "application/json",
}
payload = {
"rows_to_add": [
{
"data": [
{"column_name": "input", "column_value": "How do I reset my password?"},
{"column_name": "expected_output", "column_value": "Visit account settings and click 'Forgot Password'."},
{"column_name": "category", "column_value": "account"},
]
},
{
"data": [
{"column_name": "input", "column_value": "What is your refund policy?"},
{"column_name": "expected_output", "column_value": "We offer a 30-day money-back guarantee on all plans."},
{"column_name": "category", "column_value": "billing"},
]
},
]
}
response = requests.post(
f"{ARTHUR_BASE_URL}/api/v2/datasets/{DATASET_ID}/versions",
json=payload,
headers=headers,
)
response.raise_for_status()
version = response.json()
print(f"Created version: {version['version_number']}")const ARTHUR_BASE_URL = process.env.ARTHUR_BASE_URL ?? "http://localhost:3030";
const ARTHUR_API_KEY = process.env.ARTHUR_API_KEY;
const DATASET_ID = "ds_01hx9z3k2m4n5p6q7r8s9t0u";
const response = await fetch(
`${ARTHUR_BASE_URL}/api/v2/datasets/${DATASET_ID}/versions`,
{
method: "POST",
headers: {
Authorization: `Bearer ${ARTHUR_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
rows_to_add: [
{
data: [
{ column_name: "input", column_value: "How do I reset my password?" },
{ column_name: "expected_output", column_value: "Visit account settings and click 'Forgot Password'." },
{ column_name: "category", column_value: "account" },
],
},
{
data: [
{ column_name: "input", column_value: "What is your refund policy?" },
{ column_name: "expected_output", column_value: "We offer a 30-day money-back guarantee on all plans." },
{ column_name: "category", column_value: "billing" },
],
},
],
}),
}
);
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const version = await response.json();
console.log("Created version:", version.version_number);curl -X POST http://localhost:3030/api/v2/datasets/{dataset_id}/versions \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"rows_to_add": [
{
"data": [
{"column_name": "input", "column_value": "How do I reset my password?"},
{"column_name": "expected_output", "column_value": "Visit account settings and click Forgot Password."},
{"column_name": "category", "column_value": "account"}
]
}
]
}'
Row schemaEach row is a list of
{column_name, column_value}pairs. You define the column names — use whatever makes sense for your eval (e.g.input,expected_output,context,category). All values are strings. Column names are inferred from the first version and carried forward.
UI: Create a dataset
In the Arthur dashboard, open your task and navigate to Datasets in the left navigation. Click + New Dataset, enter a name and optional description, then click Create.
Once created, open the dataset and either:
- Import CSV — upload a CSV file; Arthur auto-detects the delimiter and maps columns
- Add rows manually — enter values cell by cell in the table editor
- Generate synthetic data — use the AI generation flow (see Generate Synthetic Data)
When you're done editing, click Save as New Version to commit the snapshot.
Version Your Datasets
Versioning is what makes eval runs reproducible. Every time you save a new snapshot of your test cases, the Engine assigns it an incrementing integer version number (1, 2, 3, …). Older versions are never modified or deleted when you add a new one.
How versioning works
flowchart TD
DS["Dataset: customer-support-qa-bench<br>dataset_id: ds_01hx..."]
DS --> V1["Version 1<br>50 rows — initial golden set<br>created: 2024-06-01"]
DS --> V2["Version 2<br>75 rows — added billing cases<br>created: 2024-07-15"]
DS --> V3["Version 3<br>75 rows — corrected 3 expected outputs<br>created: 2024-08-02"]
V1 --> R1["Eval Run A<br>model: gpt-4o-mini"]
V2 --> R2["Eval Run B<br>model: gpt-4o-mini"]
V3 --> R3["Eval Run C<br>model: gpt-4-turbo"]
Because each eval run records the exact dataset_id + version_number it used, you can always re-run any historical configuration and get the same inputs.
UI: Browse version history
From the dataset detail view (Evals → Dataset → click a dataset), click the Versions button in the top-right header. This opens a side drawer listing all versions with timestamps. Click any version to switch to it — the row table updates to show that version's contents. The current version number is reflected in the URL as ?version=<number>.
Add a new version
New versions support incremental updates: add rows, delete rows by ID or by filter, and update existing rows — all in a single call. The Engine creates a new immutable snapshot reflecting the result.
import requests, os, json
ARTHUR_BASE_URL = os.environ.get("ARTHUR_BASE_URL", "http://localhost:3030")
ARTHUR_API_KEY = os.environ["ARTHUR_API_KEY"]
DATASET_ID = "ds_01hx9z3k2m4n5p6q7r8s9t0u"
with open("new_cases.json") as f:
new_rows = json.load(f) # list of {"data": [{column_name, column_value}, ...]}
response = requests.post(
f"{ARTHUR_BASE_URL}/api/v2/datasets/{DATASET_ID}/versions",
json={
"rows_to_add": new_rows,
"rows_to_delete": ["row-id-to-remove"], # optional
"rows_to_delete_filter": [ # optional: delete by column value
{"column_name": "category", "column_value": "deprecated"}
],
},
headers={
"Authorization": f"Bearer {ARTHUR_API_KEY}",
"Content-Type": "application/json",
},
)
response.raise_for_status()
version = response.json()
print(f"New version: {version['version_number']}")const ARTHUR_BASE_URL = process.env.ARTHUR_BASE_URL ?? "http://localhost:3030";
const ARTHUR_API_KEY = process.env.ARTHUR_API_KEY;
const DATASET_ID = "ds_01hx9z3k2m4n5p6q7r8s9t0u";
const response = await fetch(
`${ARTHUR_BASE_URL}/api/v2/datasets/${DATASET_ID}/versions`,
{
method: "POST",
headers: {
Authorization: `Bearer ${ARTHUR_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
rows_to_add: newRows,
rows_to_delete: ["row-id-to-remove"],
}),
}
);
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const version = await response.json();
console.log("New version:", version.version_number);curl -X POST http://localhost:3030/api/v2/datasets/{dataset_id}/versions \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"rows_to_add": [...],
"rows_to_delete": ["row-id-to-remove"]
}'List all versions of a dataset
import requests, os
ARTHUR_BASE_URL = os.environ.get("ARTHUR_BASE_URL", "http://localhost:3030")
ARTHUR_API_KEY = os.environ["ARTHUR_API_KEY"]
DATASET_ID = "ds_01hx9z3k2m4n5p6q7r8s9t0u"
response = requests.get(
f"{ARTHUR_BASE_URL}/api/v2/datasets/{DATASET_ID}/versions",
headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
)
response.raise_for_status()
for v in response.json()["versions"]:
print(f"v{v['version_number']} — {v['column_names']} — created {v['created_at']}")const response = await fetch(
`${ARTHUR_BASE_URL}/api/v2/datasets/${DATASET_ID}/versions`,
{ headers: { Authorization: `Bearer ${ARTHUR_API_KEY}` } }
);
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const { versions } = await response.json();
versions.forEach((v) =>
console.log(`v${v.version_number} — created ${v.created_at}`)
);curl "http://localhost:3030/api/v2/datasets/{dataset_id}/versions" \
-H "Authorization: Bearer $ARTHUR_API_KEY"
Versions are immutableOnce a version is created, its rows cannot be edited. To correct a mistake, create a new version with the corrected rows and update your eval runs to reference the new version number.
Browse and Inspect
Once you have datasets and versions, you'll want to list them, inspect their contents, and confirm the right data is in place before kicking off an eval run.
UI: Browse datasets
Navigate to Evals → Dataset in the left sidebar. The list view shows all datasets for the current task with name, latest version number, and last updated time. Use the search bar to filter by name.
Search datasets for a task
import requests, os
ARTHUR_BASE_URL = os.environ.get("ARTHUR_BASE_URL", "http://localhost:3030")
ARTHUR_API_KEY = os.environ["ARTHUR_API_KEY"]
TASK_ID = "your-task-id"
response = requests.get(
f"{ARTHUR_BASE_URL}/api/v2/tasks/{TASK_ID}/datasets/search",
params={"page": 0, "page_size": 20},
headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
)
response.raise_for_status()
for ds in response.json()["datasets"]:
print(f"{ds['name']} — id: {ds['id']} — latest version: {ds['latest_version_number']}")const response = await fetch(
`${ARTHUR_BASE_URL}/api/v2/tasks/${TASK_ID}/datasets/search?page=0&page_size=20`,
{ headers: { Authorization: `Bearer ${ARTHUR_API_KEY}` } }
);
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const { datasets } = await response.json();
datasets.forEach((ds) =>
console.log(`${ds.name} — id: ${ds.id} — latest version: ${ds.latest_version_number}`)
);curl "http://localhost:3030/api/v2/tasks/{task_id}/datasets/search?page=0&page_size=20" \
-H "Authorization: Bearer $ARTHUR_API_KEY"You can filter by name substring or specific IDs:
# Filter by name
response = requests.get(
f"{ARTHUR_BASE_URL}/api/v2/tasks/{TASK_ID}/datasets/search",
params={"name": "qa-bench", "sort_by": "updated_at", "sort_order": "desc"},
headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
)Fetch a specific version's rows
Use GET /api/v2/datasets/{dataset_id}/versions/{version_number} to retrieve the contents of a specific version. Pass the integer version number, or use the latest_version_only=true query parameter on the versions list to get the most recent snapshot.
import requests, os
ARTHUR_BASE_URL = os.environ.get("ARTHUR_BASE_URL", "http://localhost:3030")
ARTHUR_API_KEY = os.environ["ARTHUR_API_KEY"]
DATASET_ID = "ds_01hx9z3k2m4n5p6q7r8s9t0u"
VERSION = 1 # integer version number
response = requests.get(
f"{ARTHUR_BASE_URL}/api/v2/datasets/{DATASET_ID}/versions/{VERSION}",
headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
)
response.raise_for_status()
data = response.json()
print(f"Version {data['version_number']} — columns: {data['column_names']}")
for row in data["rows"][:3]:
print(row["data"])const response = await fetch(
`${ARTHUR_BASE_URL}/api/v2/datasets/${DATASET_ID}/versions/${VERSION}`,
{ headers: { Authorization: `Bearer ${ARTHUR_API_KEY}` } }
);
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const data = await response.json();
console.log(`Version ${data.version_number} — columns: ${data.column_names}`);
data.rows.slice(0, 3).forEach((row) => console.log(row.data));curl "http://localhost:3030/api/v2/datasets/{dataset_id}/versions/{version_number}" \
-H "Authorization: Bearer $ARTHUR_API_KEY"To get the latest version without knowing its number:
response = requests.get(
f"{ARTHUR_BASE_URL}/api/v2/datasets/{DATASET_ID}/versions",
params={"latest_version_only": "true"},
headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
)
latest = response.json()["versions"][0]
print(f"Latest is v{latest['version_number']}")
Tip: Pin versions in productionDuring active development, fetching with
latest_version_only=truemeans you always test against the most current dataset. Once a benchmark is stable, pin to a specific version number so results remain comparable across runs.
Generate Synthetic Data
If your team doesn't have labeled examples yet, Arthur can generate synthetic test cases using an LLM.
UI: Generate synthetic data
From the dataset detail view (Evals → Dataset → click a dataset), click the Generate button in the header. This opens a two-phase modal:
- Configure — describe the dataset's purpose, define each column, set the number of rows (max 25), and choose a model
- Canvas — a chat interface where you review the generated rows and send follow-up messages to refine them (add more rows, adjust outputs, change categories, etc.)
When satisfied, confirm to add the rows to the current dataset. Save as a new version to commit them. Synthetic generation is tied to an existing dataset version — you describe your dataset's purpose and columns, and Arthur generates rows you can review before committing them as a new version.
Generation is conversational: you can send follow-up messages to refine, add, or remove rows before saving.
Start a generation session
import requests, os
ARTHUR_BASE_URL = os.environ.get("ARTHUR_BASE_URL", "http://localhost:3030")
ARTHUR_API_KEY = os.environ["ARTHUR_API_KEY"]
DATASET_ID = "ds_01hx9z3k2m4n5p6q7r8s9t0u"
VERSION = 1 # version to base generation on
payload = {
"dataset_purpose": "Customer support Q&A pairs for testing an AI assistant's ability to handle account and billing questions",
"column_descriptions": [
{"column_name": "input", "description": "A customer question about account management or billing"},
{"column_name": "expected_output", "description": "The correct, concise answer an agent should give"},
{"column_name": "category", "description": "One of: account, billing, cancellation"},
],
"num_rows": 10, # max 25 per request
"model_provider": "openai",
"model_name": "gpt-4o",
}
response = requests.post(
f"{ARTHUR_BASE_URL}/api/v2/datasets/{DATASET_ID}/versions/{VERSION}/generate-synthetic",
json=payload,
headers={
"Authorization": f"Bearer {ARTHUR_API_KEY}",
"Content-Type": "application/json",
},
)
response.raise_for_status()
result = response.json()
print(f"Generated {len(result['rows'])} rows")
print(result["assistant_message"]["content"])curl -X POST \
"http://localhost:3030/api/v2/datasets/{dataset_id}/versions/{version_number}/generate-synthetic" \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"dataset_purpose": "Customer support Q&A pairs for testing an AI assistant",
"column_descriptions": [
{"column_name": "input", "description": "A customer question"},
{"column_name": "expected_output", "description": "The correct answer"}
],
"num_rows": 10,
"model_provider": "openai",
"model_name": "gpt-4o"
}'Refine with follow-up messages
# Continue the conversation to refine the generated rows
refine_response = requests.post(
f"{ARTHUR_BASE_URL}/api/v2/datasets/{DATASET_ID}/versions/{VERSION}/generate-synthetic/message",
json={
"message": "Add 5 more rows focused on password reset edge cases, and make the expected outputs more concise",
"current_rows": result["rows"],
"conversation_history": [result["assistant_message"]],
"model_provider": "openai",
"model_name": "gpt-4o",
},
headers={
"Authorization": f"Bearer {ARTHUR_API_KEY}",
"Content-Type": "application/json",
},
)
refined = refine_response.json()
print(f"Rows added: {len(refined['rows_added'])}, modified: {len(refined['rows_modified'])}")Promote synthetic rows into a version
The generation endpoints return rows but do not automatically create a dataset version. Review the output, then commit it:
curated_rows = refined["rows"] # inspect and filter as needed
version_response = requests.post(
f"{ARTHUR_BASE_URL}/api/v2/datasets/{DATASET_ID}/versions",
json={"rows_to_add": curated_rows},
headers={
"Authorization": f"Bearer {ARTHUR_API_KEY}",
"Content-Type": "application/json",
},
)
version_response.raise_for_status()
print(f"Saved as version {version_response.json()['version_number']}")
Review synthetic data before using it in benchmarksSynthetic examples can contain factual errors or unrealistic edge cases. Always review a sample before promoting synthetic rows into a dataset version you'll use for official benchmarks.
Update and Delete
UI: Update or delete a dataset
From the Evals → Dataset list view, each row has action buttons:
- Edit icon — opens the edit modal to update the dataset name or description
- Delete icon — shows a confirmation dialog, then permanently deletes the dataset and all its versions
Update dataset metadata
You can update a dataset's name, description, or metadata at any time without affecting its versions or any eval runs that reference it by ID.
import requests, os
ARTHUR_BASE_URL = os.environ.get("ARTHUR_BASE_URL", "http://localhost:3030")
ARTHUR_API_KEY = os.environ["ARTHUR_API_KEY"]
DATASET_ID = "ds_01hx9z3k2m4n5p6q7r8s9t0u"
response = requests.patch(
f"{ARTHUR_BASE_URL}/api/v2/datasets/{DATASET_ID}",
json={"description": "Golden set v2 — expanded to 75 cases including billing edge cases"},
headers={
"Authorization": f"Bearer {ARTHUR_API_KEY}",
"Content-Type": "application/json",
},
)
response.raise_for_status()
print("Dataset updated:", response.json())const response = await fetch(
`${ARTHUR_BASE_URL}/api/v2/datasets/${DATASET_ID}`,
{
method: "PATCH",
headers: {
Authorization: `Bearer ${ARTHUR_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
description: "Golden set v2 — expanded to 75 cases including billing edge cases",
}),
}
);
if (!response.ok) throw new Error(`HTTP ${response.status}`);
console.log("Dataset updated:", await response.json());curl -X PATCH http://localhost:3030/api/v2/datasets/{dataset_id} \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"description": "Golden set v2 — expanded to 75 cases including billing edge cases"}'Delete a dataset
Deleting a dataset removes it and all its versions permanently. Eval runs that previously referenced this dataset will retain their recorded results, but you will no longer be able to re-run them against the original data.
response = requests.delete(
f"{ARTHUR_BASE_URL}/api/v2/datasets/{DATASET_ID}",
headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
)
response.raise_for_status()
print("Dataset deleted.")const response = await fetch(
`${ARTHUR_BASE_URL}/api/v2/datasets/${DATASET_ID}`,
{
method: "DELETE",
headers: { Authorization: `Bearer ${ARTHUR_API_KEY}` },
}
);
if (!response.ok) throw new Error(`HTTP ${response.status}`);
console.log("Dataset deleted.");curl -X DELETE "http://localhost:3030/api/v2/datasets/{dataset_id}" \
-H "Authorization: Bearer $ARTHUR_API_KEY"
Deletion is permanentThere is no soft-delete or recycle bin. If you need to retire a dataset without losing access to its data, consider updating its name to include a
[DEPRECATED]prefix instead of deleting it.
Next Steps
Now that you have a named, versioned dataset, you're ready to put it to work:
| What to do next | Where to go |
|---|---|
| Run an evaluation against your dataset | Evaluation Runs |
| Configure scorers to grade your model's outputs | Scorers |
| Set up automated eval pipelines in CI | CI/CD Integration |
| Manage tasks that datasets are scoped to | Tasks |
| Learn about platform-side datasets for model monitoring | Datasets (Platform) |
Summary of what you did on this page:
- Registered a named dataset under a task with
POST /api/v2/tasks/{task_id}/datasets - Uploaded test cases as version 1 with
POST /api/v2/datasets/{dataset_id}/versions - Learned how to add new versions with incremental row updates
- Searched and inspected dataset contents via
GET /api/v2/tasks/{task_id}/datasets/search - Generated synthetic test cases using the conversational generation API
- Updated dataset metadata and deleted datasets when needed
Updated about 22 hours ago