RAG

Overview

To evaluate your RAG pipeline in Arthur, you connect a vector database, define search settings, and run an experiment that applies those settings to a test dataset — scoring each result with the LLM evaluators you choose. Arthur runs the retrieval and evaluation steps for every row in your dataset, so you can compare different search configurations (keyword vs. vector vs. hybrid, different top-k values, different collections) against the same questions.

RAG systems fail in two distinct, often invisible ways:

Failure ModeWhat Goes WrongSymptom
Retrieval qualityThe wrong chunks are fetched — the retrieved context is irrelevant, incomplete, or noisyThe model answers confidently but from the wrong source material
Generation faithfulnessThe right chunks are fetched but the model ignores or contradicts themThe answer sounds plausible but isn't grounded in what was retrieved

Arthur evaluates both dimensions by letting you attach any LLM evaluator to an experiment — including built-in templates like Context Precision, Context Recall, and Answer Relevance — so you can pinpoint exactly where your pipeline breaks down.

flowchart LR
    Q[User Query] --> R[Retriever]
    R --> C[Retrieved Chunks]
    C --> G[Generator / LLM]
    G --> A[Answer]

    C --> RE[Retrieval Evaluators]
    RE --> RS["Context Precision<br>Context Recall"]

    A --> FE[Faithfulness Evaluators]
    C --> FE
    FE --> FS["Answer Relevance<br>Custom Evals"]

    RS --> EX[Experiment Results]
    FS --> EX

How RAG Evaluation Works

An experiment ties together three things:

  1. RAG configurations — one or more search setups (provider, collection, search type, parameters) to test
  2. A dataset — rows of test queries, with optional ground truth or expected outputs
  3. Evaluators — LLM-as-judge evaluators that score each retrieved result

For every row in your dataset, Arthur runs the retrieval step for each configuration, then scores the output with each evaluator. Results are grouped by configuration so you can compare them directly.

flowchart TD
    A[RAG Experiment] --> B[Config A<br>vector search top-5]
    A --> C[Config B<br>hybrid search top-10]

    B --> D[Run retrieval<br>for each test row]
    C --> E[Run retrieval<br>for each test row]

    D --> F[Score with<br>selected evaluators]
    E --> F

    F --> G[Results per config<br>pass/fail counts per eval]

Evaluators for RAG — use the built-in LLM templates from the evaluator library:

  • Context Precision — were the retrieved chunks actually relevant?
  • Context Recall — did the retrieved chunks cover the necessary information?
  • Answer Relevance — does the answer address the question?
  • Or any custom evaluator you've defined

Prerequisites

  • An Arthur Engine instance running and reachable (default: http://localhost:3030)
  • An API key — set as ARTHUR_API_KEY in your environment
  • A Weaviate vector database instance (currently the only supported provider) with:
    • Host URL
    • API key
    • At least one populated collection
  • A test dataset already created in Arthur (see Datasets) with at minimum a column containing your test queries
  • Evaluators configured for your task (see LLM Evaluators)

Step 1 — Connect a RAG Provider

A RAG provider is a connection to your vector database. You create it once per task and reuse it across experiments.

UI

Navigate to RAG → RAG Configurations in the left sidebar. Click + Configuration.

Fill in:

  • Name — a label for this connection (e.g., prod-weaviate)
  • Host URL — your Weaviate instance URL (with or without https://)
  • API key — your Weaviate API key

Optionally click Test Connection to verify the credentials before saving.

API

import requests, os

ARTHUR_BASE_URL = os.environ.get("ARTHUR_BASE_URL", "http://localhost:3030")
ARTHUR_API_KEY = os.environ["ARTHUR_API_KEY"]
TASK_ID = "your-task-id"

response = requests.post(
    f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_providers",
    headers={
        "Authorization": f"Bearer {ARTHUR_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "name": "prod-weaviate",
        "description": "Production Weaviate cluster",
        "authentication_config": {
            "authentication_method": "api_key_authentication",
            "rag_provider": "weaviate",
            "host_url": "https://my-cluster.weaviate.network",
            "api_key": "your-weaviate-api-key",
        },
    },
)
response.raise_for_status()
provider_id = response.json()["id"]
print(f"Created provider: {provider_id}")
const ARTHUR_BASE_URL = process.env.ARTHUR_BASE_URL ?? "http://localhost:3030";
const ARTHUR_API_KEY = process.env.ARTHUR_API_KEY;
const TASK_ID = "your-task-id";

const response = await fetch(
  `${ARTHUR_BASE_URL}/api/v1/tasks/${TASK_ID}/rag_providers`,
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${ARTHUR_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      name: "prod-weaviate",
      authentication_config: {
        authentication_method: "api_key_authentication",
        rag_provider: "weaviate",
        host_url: "https://my-cluster.weaviate.network",
        api_key: "your-weaviate-api-key",
      },
    }),
  }
);
const provider = await response.json();
console.log("Created provider:", provider.id);
curl -X POST http://localhost:3030/api/v1/tasks/{task_id}/rag_providers \
  -H "Authorization: Bearer $ARTHUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "prod-weaviate",
    "authentication_config": {
      "authentication_method": "api_key_authentication",
      "rag_provider": "weaviate",
      "host_url": "https://my-cluster.weaviate.network",
      "api_key": "your-weaviate-api-key"
    }
  }'

Step 2 — Create Search Settings

Search settings define how Arthur queries your vector database: which collection to search, which search method to use, and the parameters for that method. Each saved configuration is versioned — you can update it and old versions remain unchanged.

Arthur supports three search methods:

Methodsearch_kindBest for
Vector similarityvector_similarity_text_searchSemantic similarity, embedding-based retrieval
Keyword (BM25)keyword_searchExact term matching, structured queries
Hybridhybrid_searchBlend of vector + keyword (most flexible)

UI

In RAG → RAG Experiments, click Create RAG Configuration. Select your provider, choose a collection from the dropdown (auto-loaded from Weaviate), select a search method, and configure the parameters.

API

# Vector similarity search configuration
response = requests.post(
    f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_search_settings",
    headers={
        "Authorization": f"Bearer {ARTHUR_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "name": "support-docs-vector-top5",
        "description": "Vector search over support docs collection, top 5 results",
        "rag_provider_id": provider_id,
        "settings": {
            "search_kind": "vector_similarity_text_search",
            "rag_provider": "weaviate",
            "collection_name": "SupportDocs",
            "limit": 5,
            "certainty": 0.7,          # min similarity score (0–1)
            "return_properties": ["text", "source", "title"],
            "return_metadata": ["distance", "certainty", "score"],
        },
    },
)
settings_id = response.json()["id"]
settings_version = response.json()["latest_version_number"]
response = requests.post(
    f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_search_settings",
    headers={
        "Authorization": f"Bearer {ARTHUR_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "name": "support-docs-hybrid-top10",
        "rag_provider_id": provider_id,
        "settings": {
            "search_kind": "hybrid_search",
            "rag_provider": "weaviate",
            "collection_name": "SupportDocs",
            "limit": 10,
            "alpha": 0.7,              # 1.0 = pure vector, 0.0 = pure keyword
            "return_properties": ["text", "source"],
        },
    },
)
curl -X POST http://localhost:3030/api/v1/tasks/{task_id}/rag_search_settings \
  -H "Authorization: Bearer $ARTHUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "support-docs-vector-top5",
    "rag_provider_id": "your-provider-id",
    "settings": {
      "search_kind": "vector_similarity_text_search",
      "rag_provider": "weaviate",
      "collection_name": "SupportDocs",
      "limit": 5,
      "certainty": 0.7,
      "return_properties": ["text", "source"]
    }
  }'

Search settings parameters

Vector similarity (vector_similarity_text_search):

ParameterTypeDescription
collection_namestringWeaviate collection to search
limitintMax results to return
certaintyfloat (0–1)Min similarity threshold (mutually exclusive with distance)
distancefloatMax distance threshold (mutually exclusive with certainty)
return_propertiesstring[]Which object properties to return
return_metadatastring[]Metadata to return (distance, certainty, score, etc.)
offsetintSkip first N results
include_vectorboolInclude embedding vectors in response

Keyword / BM25 (keyword_search):

ParameterTypeDescription
collection_namestringWeaviate collection to search
limitintMax results to return
and_operatorboolAll tokens must match (mutually exclusive with minimum_match_or_operator)
minimum_match_or_operatorintMinimum number of tokens that must match

Hybrid (hybrid_search):

ParameterTypeDescription
collection_namestringWeaviate collection to search
limitintMax results to return
alphafloat (0–1)Balance: 1.0 = pure vector, 0.0 = pure keyword. Default: 0.7
query_propertiesstring[]Apply keyword search to a subset of properties
fusion_typestringFusion algorithm (default: Relative Score Fusion)
max_vector_distancefloatMax threshold for the vector component

Step 3 — Test Retrieval with RAG Search Panels

Before running a full experiment, use the RAG Search Panels to interactively test your search settings against real queries. This lets you verify that the right chunks are being retrieved before committing to a full dataset run.

UI

Navigate to RAG → RAG Experiments. The page shows up to 5 search panels side by side. Each panel lets you:

  1. Select a provider and collection
  2. Choose a search method and configure its parameters
  3. Enter a query and click Run — results appear immediately with metadata (distance, certainty, score)
  4. Optionally save the panel configuration as a named Search Settings config

Run the same query across multiple panels simultaneously with Run All to compare how different configurations retrieve for the same input.


Step 4 — Create a RAG Notebook (Optional)

A RAG Notebook is a saved, reusable experiment template. It stores your RAG configuration choices, dataset selection, and evaluator assignments so you can re-run the same experiment setup without reconfiguring from scratch each time.

Notebooks are optional — you can run experiments directly without one — but they're useful for recurring evaluation setups like nightly regression runs or benchmarks you revisit after each pipeline change.

UI

Navigate to RAG → RAG Notebooks and click Create Notebook. Give it a name, then open it to configure:

  • Which RAG search configurations to test
  • Which dataset and version to use
  • Which evaluators to run and how to map their variables

A notebook's configuration can be partially filled in and saved at any time — it only needs to be complete when you click Run.

API

# Create a notebook
response = requests.post(
    f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_notebooks",
    headers={
        "Authorization": f"Bearer {ARTHUR_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "name": "Support docs weekly benchmark",
        "description": "Runs every Monday against the golden Q&A dataset",
    },
)
notebook_id = response.json()["id"]

# Save experiment state to the notebook
requests.put(
    f"{ARTHUR_BASE_URL}/api/v1/rag_notebooks/{notebook_id}/state",
    headers={
        "Authorization": f"Bearer {ARTHUR_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "state": {
            "rag_configs": [
                {
                    "type": "saved",
                    "setting_configuration_id": settings_id,
                    "version": settings_version,
                    "query_column": {"dataset_column": "question"},
                }
            ],
            "dataset_ref": {
                "id": "your-dataset-id",
                "name": "support-qa-golden",
                "version": 1,
            },
            "eval_list": [
                {"name": "Context Precision", "version": 1},
                {"name": "Answer Relevance", "version": 1},
            ],
        }
    },
)

Step 5 — Run a RAG Experiment

An experiment applies your RAG configurations to every row in your dataset, runs the selected evaluators, and stores per-row and aggregate results.

UI

From RAG → RAG Experiments, click Create Experiment. The creation flow has these steps:

  1. Name and description
  2. Dataset — select a dataset and version; choose which column contains the queries
  3. RAG configurations — select saved configurations or define inline ones; you can run multiple configurations in a single experiment to compare them
  4. Evaluators — select which evaluators to run and map their variables to dataset columns or RAG output fields
  5. Review and run

API

response = requests.post(
    f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_experiments",
    headers={
        "Authorization": f"Bearer {ARTHUR_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "name": "vector-vs-hybrid-comparison",
        "description": "Compare top-5 vector search against top-10 hybrid",
        "dataset_ref": {
            "id": "your-dataset-id",
            "name": "support-qa-golden",
            "version": 1,
        },
        "rag_configs": [
            {
                "type": "saved",
                "setting_configuration_id": vector_settings_id,
                "version": 1,
                "query_column": {"dataset_column": "question"},
            },
            {
                "type": "saved",
                "setting_configuration_id": hybrid_settings_id,
                "version": 1,
                "query_column": {"dataset_column": "question"},
            },
        ],
        "eval_list": [
            {"name": "Context Precision", "version": 1},
            {"name": "Answer Relevance", "version": 1},
        ],
    },
)
response.raise_for_status()
experiment = response.json()
experiment_id = experiment["id"]
print(f"Experiment started: {experiment_id} — status: {experiment['status']}")
const response = await fetch(
  `${ARTHUR_BASE_URL}/api/v1/tasks/${TASK_ID}/rag_experiments`,
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${ARTHUR_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      name: "vector-vs-hybrid-comparison",
      dataset_ref: {
        id: "your-dataset-id",
        name: "support-qa-golden",
        version: 1,
      },
      rag_configs: [
        {
          type: "saved",
          setting_configuration_id: vectorSettingsId,
          version: 1,
          query_column: { dataset_column: "question" },
        },
        {
          type: "saved",
          setting_configuration_id: hybridSettingsId,
          version: 1,
          query_column: { dataset_column: "question" },
        },
      ],
      eval_list: [
        { name: "Context Precision", version: 1 },
        { name: "Answer Relevance", version: 1 },
      ],
    }),
  }
);
const experiment = await response.json();
console.log(`Experiment started: ${experiment.id} — status: ${experiment.status}`);
curl -X POST http://localhost:3030/api/v1/tasks/{task_id}/rag_experiments \
  -H "Authorization: Bearer $ARTHUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "vector-vs-hybrid-comparison",
    "dataset_ref": {"id": "your-dataset-id", "name": "support-qa-golden", "version": 1},
    "rag_configs": [
      {
        "type": "saved",
        "setting_configuration_id": "vector-settings-id",
        "version": 1,
        "query_column": {"dataset_column": "question"}
      }
    ],
    "eval_list": [
      {"name": "Context Precision", "version": 1}
    ]
  }'

Poll for completion

import time

while True:
    resp = requests.get(
        f"{ARTHUR_BASE_URL}/api/v1/rag_experiments/{experiment_id}",
        headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
    )
    status = resp.json()["status"]
    print(f"Status: {status}")
    if status in ("completed", "failed"):
        break
    time.sleep(5)

result = resp.json()

Experiment status values: queuedrunningcompleted or failed.


View Results

UI

From RAG → RAG Experiments, click on your Experiment

Once a RAG experiment completes, the results page shows a summary of how each configuration performed across all test cases. At a glance you can see the overall pass rate per evaluator, which configurations passed or failed each test case, and the cost per row. The experiment header shows total duration, number of test cases run, and the dataset used — making it easy to compare runs over time.

Aggregate results

The completed experiment response includes a summary of pass/fail counts per configuration per evaluator:

{
  "id": "...",
  "status": "completed",
  "summary_results": {
    "rag_eval_summaries": [
      {
        "rag_config_key": "saved:uuid:1",
        "rag_config_type": "saved",
        "eval_results": [
          {
            "eval_name": "Context Precision",
            "passed_count": 38,
            "failed_count": 12,
            "total_count": 50,
            "error_count": 0
          }
        ]
      }
    ]
  }
}

Per-row results

resp = requests.get(
    f"{ARTHUR_BASE_URL}/api/v1/rag_experiments/{experiment_id}/test_cases",
    headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
    params={"page": 0, "page_size": 20},
)

for case in resp.json()["data"]:
    print(f"\nRow: {case['dataset_row_id']} — status: {case['status']}")
    for rag_result in case["rag_results"]:
        config = rag_result["rag_config_key"]
        query = rag_result["query_text"]
        objects = rag_result["output"]["response"]["objects"]
        print(f"  Config {config}: {len(objects)} chunks retrieved for '{query}'")
        for obj in objects[:2]:
            score = obj.get("metadata", {}).get("score")
            print(f"    score={score} — {str(obj['properties'])[:80]}")

Per-configuration results

To get results for a specific RAG configuration only:

rag_config_key = "saved:your-settings-id:1"  # format: saved:{id}:{version} or unsaved:{uuid}

resp = requests.get(
    f"{ARTHUR_BASE_URL}/api/v1/rag_experiments/{experiment_id}/rag_configs/{rag_config_key}/results",
    headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
    params={"page": 0, "page_size": 20},
)

Next Steps

GoalWhere to go
Build the evaluators you want to run on RAG experimentsLLM Evaluators — create Context Precision, Answer Relevance, and custom judge prompts
Set up test datasets for RAG experimentsDatasets — create versioned datasets with your test queries
Automate RAG evaluation in CICI/CD Integration — trigger experiments on every pipeline change