RAG
Overview
To evaluate your RAG pipeline in Arthur, you connect a vector database, define search settings, and run an experiment that applies those settings to a test dataset — scoring each result with the LLM evaluators you choose. Arthur runs the retrieval and evaluation steps for every row in your dataset, so you can compare different search configurations (keyword vs. vector vs. hybrid, different top-k values, different collections) against the same questions.
RAG systems fail in two distinct, often invisible ways:
| Failure Mode | What Goes Wrong | Symptom |
|---|---|---|
| Retrieval quality | The wrong chunks are fetched — the retrieved context is irrelevant, incomplete, or noisy | The model answers confidently but from the wrong source material |
| Generation faithfulness | The right chunks are fetched but the model ignores or contradicts them | The answer sounds plausible but isn't grounded in what was retrieved |
Arthur evaluates both dimensions by letting you attach any LLM evaluator to an experiment — including built-in templates like Context Precision, Context Recall, and Answer Relevance — so you can pinpoint exactly where your pipeline breaks down.
flowchart LR
Q[User Query] --> R[Retriever]
R --> C[Retrieved Chunks]
C --> G[Generator / LLM]
G --> A[Answer]
C --> RE[Retrieval Evaluators]
RE --> RS["Context Precision<br>Context Recall"]
A --> FE[Faithfulness Evaluators]
C --> FE
FE --> FS["Answer Relevance<br>Custom Evals"]
RS --> EX[Experiment Results]
FS --> EX
How RAG Evaluation Works
An experiment ties together three things:
- RAG configurations — one or more search setups (provider, collection, search type, parameters) to test
- A dataset — rows of test queries, with optional ground truth or expected outputs
- Evaluators — LLM-as-judge evaluators that score each retrieved result
For every row in your dataset, Arthur runs the retrieval step for each configuration, then scores the output with each evaluator. Results are grouped by configuration so you can compare them directly.
flowchart TD
A[RAG Experiment] --> B[Config A<br>vector search top-5]
A --> C[Config B<br>hybrid search top-10]
B --> D[Run retrieval<br>for each test row]
C --> E[Run retrieval<br>for each test row]
D --> F[Score with<br>selected evaluators]
E --> F
F --> G[Results per config<br>pass/fail counts per eval]
Evaluators for RAG — use the built-in LLM templates from the evaluator library:
- Context Precision — were the retrieved chunks actually relevant?
- Context Recall — did the retrieved chunks cover the necessary information?
- Answer Relevance — does the answer address the question?
- Or any custom evaluator you've defined
Prerequisites
- An Arthur Engine instance running and reachable (default:
http://localhost:3030) - An API key — set as
ARTHUR_API_KEYin your environment - A Weaviate vector database instance (currently the only supported provider) with:
- Host URL
- API key
- At least one populated collection
- A test dataset already created in Arthur (see Datasets) with at minimum a column containing your test queries
- Evaluators configured for your task (see LLM Evaluators)
Step 1 — Connect a RAG Provider
A RAG provider is a connection to your vector database. You create it once per task and reuse it across experiments.
UI
Navigate to RAG → RAG Configurations in the left sidebar. Click + Configuration.
Fill in:
- Name — a label for this connection (e.g.,
prod-weaviate) - Host URL — your Weaviate instance URL (with or without
https://) - API key — your Weaviate API key
Optionally click Test Connection to verify the credentials before saving.
API
import requests, os
ARTHUR_BASE_URL = os.environ.get("ARTHUR_BASE_URL", "http://localhost:3030")
ARTHUR_API_KEY = os.environ["ARTHUR_API_KEY"]
TASK_ID = "your-task-id"
response = requests.post(
f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_providers",
headers={
"Authorization": f"Bearer {ARTHUR_API_KEY}",
"Content-Type": "application/json",
},
json={
"name": "prod-weaviate",
"description": "Production Weaviate cluster",
"authentication_config": {
"authentication_method": "api_key_authentication",
"rag_provider": "weaviate",
"host_url": "https://my-cluster.weaviate.network",
"api_key": "your-weaviate-api-key",
},
},
)
response.raise_for_status()
provider_id = response.json()["id"]
print(f"Created provider: {provider_id}")const ARTHUR_BASE_URL = process.env.ARTHUR_BASE_URL ?? "http://localhost:3030";
const ARTHUR_API_KEY = process.env.ARTHUR_API_KEY;
const TASK_ID = "your-task-id";
const response = await fetch(
`${ARTHUR_BASE_URL}/api/v1/tasks/${TASK_ID}/rag_providers`,
{
method: "POST",
headers: {
Authorization: `Bearer ${ARTHUR_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
name: "prod-weaviate",
authentication_config: {
authentication_method: "api_key_authentication",
rag_provider: "weaviate",
host_url: "https://my-cluster.weaviate.network",
api_key: "your-weaviate-api-key",
},
}),
}
);
const provider = await response.json();
console.log("Created provider:", provider.id);curl -X POST http://localhost:3030/api/v1/tasks/{task_id}/rag_providers \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "prod-weaviate",
"authentication_config": {
"authentication_method": "api_key_authentication",
"rag_provider": "weaviate",
"host_url": "https://my-cluster.weaviate.network",
"api_key": "your-weaviate-api-key"
}
}'Step 2 — Create Search Settings
Search settings define how Arthur queries your vector database: which collection to search, which search method to use, and the parameters for that method. Each saved configuration is versioned — you can update it and old versions remain unchanged.
Arthur supports three search methods:
| Method | search_kind | Best for |
|---|---|---|
| Vector similarity | vector_similarity_text_search | Semantic similarity, embedding-based retrieval |
| Keyword (BM25) | keyword_search | Exact term matching, structured queries |
| Hybrid | hybrid_search | Blend of vector + keyword (most flexible) |
UI
In RAG → RAG Experiments, click Create RAG Configuration. Select your provider, choose a collection from the dropdown (auto-loaded from Weaviate), select a search method, and configure the parameters.
API
# Vector similarity search configuration
response = requests.post(
f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_search_settings",
headers={
"Authorization": f"Bearer {ARTHUR_API_KEY}",
"Content-Type": "application/json",
},
json={
"name": "support-docs-vector-top5",
"description": "Vector search over support docs collection, top 5 results",
"rag_provider_id": provider_id,
"settings": {
"search_kind": "vector_similarity_text_search",
"rag_provider": "weaviate",
"collection_name": "SupportDocs",
"limit": 5,
"certainty": 0.7, # min similarity score (0–1)
"return_properties": ["text", "source", "title"],
"return_metadata": ["distance", "certainty", "score"],
},
},
)
settings_id = response.json()["id"]
settings_version = response.json()["latest_version_number"]response = requests.post(
f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_search_settings",
headers={
"Authorization": f"Bearer {ARTHUR_API_KEY}",
"Content-Type": "application/json",
},
json={
"name": "support-docs-hybrid-top10",
"rag_provider_id": provider_id,
"settings": {
"search_kind": "hybrid_search",
"rag_provider": "weaviate",
"collection_name": "SupportDocs",
"limit": 10,
"alpha": 0.7, # 1.0 = pure vector, 0.0 = pure keyword
"return_properties": ["text", "source"],
},
},
)curl -X POST http://localhost:3030/api/v1/tasks/{task_id}/rag_search_settings \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "support-docs-vector-top5",
"rag_provider_id": "your-provider-id",
"settings": {
"search_kind": "vector_similarity_text_search",
"rag_provider": "weaviate",
"collection_name": "SupportDocs",
"limit": 5,
"certainty": 0.7,
"return_properties": ["text", "source"]
}
}'Search settings parameters
Vector similarity (vector_similarity_text_search):
| Parameter | Type | Description |
|---|---|---|
collection_name | string | Weaviate collection to search |
limit | int | Max results to return |
certainty | float (0–1) | Min similarity threshold (mutually exclusive with distance) |
distance | float | Max distance threshold (mutually exclusive with certainty) |
return_properties | string[] | Which object properties to return |
return_metadata | string[] | Metadata to return (distance, certainty, score, etc.) |
offset | int | Skip first N results |
include_vector | bool | Include embedding vectors in response |
Keyword / BM25 (keyword_search):
| Parameter | Type | Description |
|---|---|---|
collection_name | string | Weaviate collection to search |
limit | int | Max results to return |
and_operator | bool | All tokens must match (mutually exclusive with minimum_match_or_operator) |
minimum_match_or_operator | int | Minimum number of tokens that must match |
Hybrid (hybrid_search):
| Parameter | Type | Description |
|---|---|---|
collection_name | string | Weaviate collection to search |
limit | int | Max results to return |
alpha | float (0–1) | Balance: 1.0 = pure vector, 0.0 = pure keyword. Default: 0.7 |
query_properties | string[] | Apply keyword search to a subset of properties |
fusion_type | string | Fusion algorithm (default: Relative Score Fusion) |
max_vector_distance | float | Max threshold for the vector component |
Step 3 — Test Retrieval with RAG Search Panels
Before running a full experiment, use the RAG Search Panels to interactively test your search settings against real queries. This lets you verify that the right chunks are being retrieved before committing to a full dataset run.
UI
Navigate to RAG → RAG Experiments. The page shows up to 5 search panels side by side. Each panel lets you:
- Select a provider and collection
- Choose a search method and configure its parameters
- Enter a query and click Run — results appear immediately with metadata (distance, certainty, score)
- Optionally save the panel configuration as a named Search Settings config
Run the same query across multiple panels simultaneously with Run All to compare how different configurations retrieve for the same input.
Step 4 — Create a RAG Notebook (Optional)
A RAG Notebook is a saved, reusable experiment template. It stores your RAG configuration choices, dataset selection, and evaluator assignments so you can re-run the same experiment setup without reconfiguring from scratch each time.
Notebooks are optional — you can run experiments directly without one — but they're useful for recurring evaluation setups like nightly regression runs or benchmarks you revisit after each pipeline change.
UI
Navigate to RAG → RAG Notebooks and click Create Notebook. Give it a name, then open it to configure:
- Which RAG search configurations to test
- Which dataset and version to use
- Which evaluators to run and how to map their variables
A notebook's configuration can be partially filled in and saved at any time — it only needs to be complete when you click Run.
API
# Create a notebook
response = requests.post(
f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_notebooks",
headers={
"Authorization": f"Bearer {ARTHUR_API_KEY}",
"Content-Type": "application/json",
},
json={
"name": "Support docs weekly benchmark",
"description": "Runs every Monday against the golden Q&A dataset",
},
)
notebook_id = response.json()["id"]
# Save experiment state to the notebook
requests.put(
f"{ARTHUR_BASE_URL}/api/v1/rag_notebooks/{notebook_id}/state",
headers={
"Authorization": f"Bearer {ARTHUR_API_KEY}",
"Content-Type": "application/json",
},
json={
"state": {
"rag_configs": [
{
"type": "saved",
"setting_configuration_id": settings_id,
"version": settings_version,
"query_column": {"dataset_column": "question"},
}
],
"dataset_ref": {
"id": "your-dataset-id",
"name": "support-qa-golden",
"version": 1,
},
"eval_list": [
{"name": "Context Precision", "version": 1},
{"name": "Answer Relevance", "version": 1},
],
}
},
)Step 5 — Run a RAG Experiment
An experiment applies your RAG configurations to every row in your dataset, runs the selected evaluators, and stores per-row and aggregate results.
UI
From RAG → RAG Experiments, click Create Experiment. The creation flow has these steps:
- Name and description
- Dataset — select a dataset and version; choose which column contains the queries
- RAG configurations — select saved configurations or define inline ones; you can run multiple configurations in a single experiment to compare them
- Evaluators — select which evaluators to run and map their variables to dataset columns or RAG output fields
- Review and run
API
response = requests.post(
f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_experiments",
headers={
"Authorization": f"Bearer {ARTHUR_API_KEY}",
"Content-Type": "application/json",
},
json={
"name": "vector-vs-hybrid-comparison",
"description": "Compare top-5 vector search against top-10 hybrid",
"dataset_ref": {
"id": "your-dataset-id",
"name": "support-qa-golden",
"version": 1,
},
"rag_configs": [
{
"type": "saved",
"setting_configuration_id": vector_settings_id,
"version": 1,
"query_column": {"dataset_column": "question"},
},
{
"type": "saved",
"setting_configuration_id": hybrid_settings_id,
"version": 1,
"query_column": {"dataset_column": "question"},
},
],
"eval_list": [
{"name": "Context Precision", "version": 1},
{"name": "Answer Relevance", "version": 1},
],
},
)
response.raise_for_status()
experiment = response.json()
experiment_id = experiment["id"]
print(f"Experiment started: {experiment_id} — status: {experiment['status']}")const response = await fetch(
`${ARTHUR_BASE_URL}/api/v1/tasks/${TASK_ID}/rag_experiments`,
{
method: "POST",
headers: {
Authorization: `Bearer ${ARTHUR_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
name: "vector-vs-hybrid-comparison",
dataset_ref: {
id: "your-dataset-id",
name: "support-qa-golden",
version: 1,
},
rag_configs: [
{
type: "saved",
setting_configuration_id: vectorSettingsId,
version: 1,
query_column: { dataset_column: "question" },
},
{
type: "saved",
setting_configuration_id: hybridSettingsId,
version: 1,
query_column: { dataset_column: "question" },
},
],
eval_list: [
{ name: "Context Precision", version: 1 },
{ name: "Answer Relevance", version: 1 },
],
}),
}
);
const experiment = await response.json();
console.log(`Experiment started: ${experiment.id} — status: ${experiment.status}`);curl -X POST http://localhost:3030/api/v1/tasks/{task_id}/rag_experiments \
-H "Authorization: Bearer $ARTHUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "vector-vs-hybrid-comparison",
"dataset_ref": {"id": "your-dataset-id", "name": "support-qa-golden", "version": 1},
"rag_configs": [
{
"type": "saved",
"setting_configuration_id": "vector-settings-id",
"version": 1,
"query_column": {"dataset_column": "question"}
}
],
"eval_list": [
{"name": "Context Precision", "version": 1}
]
}'Poll for completion
import time
while True:
resp = requests.get(
f"{ARTHUR_BASE_URL}/api/v1/rag_experiments/{experiment_id}",
headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
)
status = resp.json()["status"]
print(f"Status: {status}")
if status in ("completed", "failed"):
break
time.sleep(5)
result = resp.json()Experiment status values: queued → running → completed or failed.
View Results
UI
From RAG → RAG Experiments, click on your Experiment
Once a RAG experiment completes, the results page shows a summary of how each configuration performed across all test cases. At a glance you can see the overall pass rate per evaluator, which configurations passed or failed each test case, and the cost per row. The experiment header shows total duration, number of test cases run, and the dataset used — making it easy to compare runs over time.
Aggregate results
The completed experiment response includes a summary of pass/fail counts per configuration per evaluator:
{
"id": "...",
"status": "completed",
"summary_results": {
"rag_eval_summaries": [
{
"rag_config_key": "saved:uuid:1",
"rag_config_type": "saved",
"eval_results": [
{
"eval_name": "Context Precision",
"passed_count": 38,
"failed_count": 12,
"total_count": 50,
"error_count": 0
}
]
}
]
}
}Per-row results
resp = requests.get(
f"{ARTHUR_BASE_URL}/api/v1/rag_experiments/{experiment_id}/test_cases",
headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
params={"page": 0, "page_size": 20},
)
for case in resp.json()["data"]:
print(f"\nRow: {case['dataset_row_id']} — status: {case['status']}")
for rag_result in case["rag_results"]:
config = rag_result["rag_config_key"]
query = rag_result["query_text"]
objects = rag_result["output"]["response"]["objects"]
print(f" Config {config}: {len(objects)} chunks retrieved for '{query}'")
for obj in objects[:2]:
score = obj.get("metadata", {}).get("score")
print(f" score={score} — {str(obj['properties'])[:80]}")Per-configuration results
To get results for a specific RAG configuration only:
rag_config_key = "saved:your-settings-id:1" # format: saved:{id}:{version} or unsaved:{uuid}
resp = requests.get(
f"{ARTHUR_BASE_URL}/api/v1/rag_experiments/{experiment_id}/rag_configs/{rag_config_key}/results",
headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
params={"page": 0, "page_size": 20},
)Next Steps
| Goal | Where to go |
|---|---|
| Build the evaluators you want to run on RAG experiments | LLM Evaluators — create Context Precision, Answer Relevance, and custom judge prompts |
| Set up test datasets for RAG experiments | Datasets — create versioned datasets with your test queries |
| Automate RAG evaluation in CI | CI/CD Integration — trigger experiments on every pipeline change |
Updated about 22 hours ago