# RAG

## Overview

To evaluate your RAG pipeline in Arthur, you connect a vector database, define search settings, and run an experiment that applies those settings to a test dataset — scoring each result with the LLM evaluators you choose. Arthur runs the retrieval and evaluation steps for every row in your dataset, so you can compare different search configurations (keyword vs. vector vs. hybrid, different top-k values, different collections) against the same questions.

RAG systems fail in two distinct, often invisible ways:

| Failure Mode                | What Goes Wrong                                                                          | Symptom                                                              |
| --------------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| **Retrieval quality**       | The wrong chunks are fetched — the retrieved context is irrelevant, incomplete, or noisy | The model answers confidently but from the wrong source material     |
| **Generation faithfulness** | The right chunks are fetched but the model ignores or contradicts them                   | The answer sounds plausible but isn't grounded in what was retrieved |

Arthur evaluates both dimensions by letting you attach any LLM evaluator to an experiment — including built-in templates like **Context Precision**, **Context Recall**, and **Answer Relevance** — so you can pinpoint exactly where your pipeline breaks down.

```mermaid
flowchart LR
    Q[User Query] --> R[Retriever]
    R --> C[Retrieved Chunks]
    C --> G[Generator / LLM]
    G --> A[Answer]

    C --> RE[Retrieval Evaluators]
    RE --> RS["Context Precision<br>Context Recall"]

    A --> FE[Faithfulness Evaluators]
    C --> FE
    FE --> FS["Answer Relevance<br>Custom Evals"]

    RS --> EX[Experiment Results]
    FS --> EX
```

***

## How RAG Evaluation Works

An experiment ties together three things:

1. **RAG configurations** — one or more search setups (provider, collection, search type, parameters) to test
2. **A dataset** — rows of test queries, with optional ground truth or expected outputs
3. **Evaluators** — LLM-as-judge evaluators that score each retrieved result

<Image align="center" src="https://files.readme.io/dadc895baa37cdc9dedb9ef519abf259b88340a5b9a4d63a19d1297df21c1ca9-Screenshot_2026-04-23_at_12.01.36.png" />

For every row in your dataset, Arthur runs the retrieval step for each configuration, then scores the output with each evaluator. Results are grouped by configuration so you can compare them directly.

```mermaid
flowchart TD
    A[RAG Experiment] --> B[Config A<br>vector search top-5]
    A --> C[Config B<br>hybrid search top-10]

    B --> D[Run retrieval<br>for each test row]
    C --> E[Run retrieval<br>for each test row]

    D --> F[Score with<br>selected evaluators]
    E --> F

    F --> G[Results per config<br>pass/fail counts per eval]
```

**Evaluators for RAG** — use the built-in LLM templates from the evaluator library:

* **Context Precision** — were the retrieved chunks actually relevant?
* **Context Recall** — did the retrieved chunks cover the necessary information?
* **Answer Relevance** — does the answer address the question?
* Or any custom evaluator you've defined

***

## Prerequisites

* An Arthur Engine instance running and reachable (default: `http://localhost:3030`)
* An API key — set as `ARTHUR_API_KEY` in your environment
* A **Weaviate** vector database instance (currently the only supported provider) with:
  * Host URL
  * API key
  * At least one populated collection
* A **test dataset** already created in Arthur (see [Datasets](https://docs.arthur.ai/docs/datasets-engine)) with at minimum a column containing your test queries
* **Evaluators** configured for your task (see [LLM Evaluators](https://docs.arthur.ai/docs/llm-evaluators))

***

## Step 1 — Connect a RAG Provider

A RAG provider is a connection to your vector database. You create it once per task and reuse it across experiments.

### UI

Navigate to **RAG → RAG Configurations** in the left sidebar. Click **+ Configuration**.

<Image align="center" src="https://files.readme.io/177c1e31e7d907c53a131e6c3481cced3a20f828d481c138cb4988584e405ade-Screenshot_2026-04-23_at_12.02.17.png" />

Fill in:

* **Name** — a label for this connection (e.g., `prod-weaviate`)
* **Host URL** — your Weaviate instance URL (with or without `https://`)
* **API key** — your Weaviate API key

Optionally click **Test Connection** to verify the credentials before saving.

### API

```python Python SDK
import requests, os

ARTHUR_BASE_URL = os.environ.get("ARTHUR_BASE_URL", "http://localhost:3030")
ARTHUR_API_KEY = os.environ["ARTHUR_API_KEY"]
TASK_ID = "your-task-id"

response = requests.post(
    f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_providers",
    headers={
        "Authorization": f"Bearer {ARTHUR_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "name": "prod-weaviate",
        "description": "Production Weaviate cluster",
        "authentication_config": {
            "authentication_method": "api_key_authentication",
            "rag_provider": "weaviate",
            "host_url": "https://my-cluster.weaviate.network",
            "api_key": "your-weaviate-api-key",
        },
    },
)
response.raise_for_status()
provider_id = response.json()["id"]
print(f"Created provider: {provider_id}")
```

```javascript JavaScript
const ARTHUR_BASE_URL = process.env.ARTHUR_BASE_URL ?? "http://localhost:3030";
const ARTHUR_API_KEY = process.env.ARTHUR_API_KEY;
const TASK_ID = "your-task-id";

const response = await fetch(
  `${ARTHUR_BASE_URL}/api/v1/tasks/${TASK_ID}/rag_providers`,
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${ARTHUR_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      name: "prod-weaviate",
      authentication_config: {
        authentication_method: "api_key_authentication",
        rag_provider: "weaviate",
        host_url: "https://my-cluster.weaviate.network",
        api_key: "your-weaviate-api-key",
      },
    }),
  }
);
const provider = await response.json();
console.log("Created provider:", provider.id);
```

```curl cURL
curl -X POST http://localhost:3030/api/v1/tasks/{task_id}/rag_providers \
  -H "Authorization: Bearer $ARTHUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "prod-weaviate",
    "authentication_config": {
      "authentication_method": "api_key_authentication",
      "rag_provider": "weaviate",
      "host_url": "https://my-cluster.weaviate.network",
      "api_key": "your-weaviate-api-key"
    }
  }'
```

***

## Step 2 — Create Search Settings

Search settings define how Arthur queries your vector database: which collection to search, which search method to use, and the parameters for that method. Each saved configuration is versioned — you can update it and old versions remain unchanged.

Arthur supports three search methods:

| Method            | `search_kind`                   | Best for                                       |
| ----------------- | ------------------------------- | ---------------------------------------------- |
| Vector similarity | `vector_similarity_text_search` | Semantic similarity, embedding-based retrieval |
| Keyword (BM25)    | `keyword_search`                | Exact term matching, structured queries        |
| Hybrid            | `hybrid_search`                 | Blend of vector + keyword (most flexible)      |

### UI

In **RAG → RAG Experiments**, click **Create RAG Configuration**. Select your provider, choose a collection from the dropdown (auto-loaded from Weaviate), select a search method, and configure the parameters.

### API

```python Python SDK
# Vector similarity search configuration
response = requests.post(
    f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_search_settings",
    headers={
        "Authorization": f"Bearer {ARTHUR_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "name": "support-docs-vector-top5",
        "description": "Vector search over support docs collection, top 5 results",
        "rag_provider_id": provider_id,
        "settings": {
            "search_kind": "vector_similarity_text_search",
            "rag_provider": "weaviate",
            "collection_name": "SupportDocs",
            "limit": 5,
            "certainty": 0.7,          # min similarity score (0–1)
            "return_properties": ["text", "source", "title"],
            "return_metadata": ["distance", "certainty", "score"],
        },
    },
)
settings_id = response.json()["id"]
settings_version = response.json()["latest_version_number"]
```

```python Hybrid search
response = requests.post(
    f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_search_settings",
    headers={
        "Authorization": f"Bearer {ARTHUR_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "name": "support-docs-hybrid-top10",
        "rag_provider_id": provider_id,
        "settings": {
            "search_kind": "hybrid_search",
            "rag_provider": "weaviate",
            "collection_name": "SupportDocs",
            "limit": 10,
            "alpha": 0.7,              # 1.0 = pure vector, 0.0 = pure keyword
            "return_properties": ["text", "source"],
        },
    },
)
```

```curl cURL
curl -X POST http://localhost:3030/api/v1/tasks/{task_id}/rag_search_settings \
  -H "Authorization: Bearer $ARTHUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "support-docs-vector-top5",
    "rag_provider_id": "your-provider-id",
    "settings": {
      "search_kind": "vector_similarity_text_search",
      "rag_provider": "weaviate",
      "collection_name": "SupportDocs",
      "limit": 5,
      "certainty": 0.7,
      "return_properties": ["text", "source"]
    }
  }'
```

### Search settings parameters

**Vector similarity (`vector_similarity_text_search`):**

| Parameter           | Type        | Description                                                   |
| ------------------- | ----------- | ------------------------------------------------------------- |
| `collection_name`   | string      | Weaviate collection to search                                 |
| `limit`             | int         | Max results to return                                         |
| `certainty`         | float (0–1) | Min similarity threshold (mutually exclusive with `distance`) |
| `distance`          | float       | Max distance threshold (mutually exclusive with `certainty`)  |
| `return_properties` | string\[]   | Which object properties to return                             |
| `return_metadata`   | string\[]   | Metadata to return (`distance`, `certainty`, `score`, etc.)   |
| `offset`            | int         | Skip first N results                                          |
| `include_vector`    | bool        | Include embedding vectors in response                         |

**Keyword / BM25 (`keyword_search`):**

| Parameter                   | Type   | Description                                                                 |
| --------------------------- | ------ | --------------------------------------------------------------------------- |
| `collection_name`           | string | Weaviate collection to search                                               |
| `limit`                     | int    | Max results to return                                                       |
| `and_operator`              | bool   | All tokens must match (mutually exclusive with `minimum_match_or_operator`) |
| `minimum_match_or_operator` | int    | Minimum number of tokens that must match                                    |

**Hybrid (`hybrid_search`):**

| Parameter             | Type        | Description                                                        |
| --------------------- | ----------- | ------------------------------------------------------------------ |
| `collection_name`     | string      | Weaviate collection to search                                      |
| `limit`               | int         | Max results to return                                              |
| `alpha`               | float (0–1) | Balance: `1.0` = pure vector, `0.0` = pure keyword. Default: `0.7` |
| `query_properties`    | string\[]   | Apply keyword search to a subset of properties                     |
| `fusion_type`         | string      | Fusion algorithm (default: Relative Score Fusion)                  |
| `max_vector_distance` | float       | Max threshold for the vector component                             |

***

## Step 3 — Test Retrieval with RAG Search Panels

Before running a full experiment, use the **RAG Search Panels** to interactively test your search settings against real queries. This lets you verify that the right chunks are being retrieved before committing to a full dataset run.

### UI

Navigate to **RAG → RAG Experiments**. The page shows up to 5 search panels side by side. Each panel lets you:

1. Select a provider and collection
2. Choose a search method and configure its parameters
3. Enter a query and click **Run** — results appear immediately with metadata (distance, certainty, score)
4. Optionally save the panel configuration as a named Search Settings config

Run the same query across multiple panels simultaneously with **Run All** to compare how different configurations retrieve for the same input.

***

## Step 4 — Create a RAG Notebook (Optional)

A **RAG Notebook** is a saved, reusable experiment template. It stores your RAG configuration choices, dataset selection, and evaluator assignments so you can re-run the same experiment setup without reconfiguring from scratch each time.

Notebooks are optional — you can run experiments directly without one — but they're useful for recurring evaluation setups like nightly regression runs or benchmarks you revisit after each pipeline change.

### UI

Navigate to **RAG → RAG Notebooks** and click **Create Notebook**. Give it a name, then open it to configure:

<Image align="center" src="https://files.readme.io/cbb624ab7e087bbac6bee00dcc5e34864d15947cb5d9a413a1ac5c5c50355ba6-Screenshot_2026-04-23_at_12.11.24.png" />

* Which RAG search configurations to test
* Which dataset and version to use
* Which evaluators to run and how to map their variables

A notebook's configuration can be partially filled in and saved at any time — it only needs to be complete when you click **Run**.

### API

```python Python SDK
# Create a notebook
response = requests.post(
    f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_notebooks",
    headers={
        "Authorization": f"Bearer {ARTHUR_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "name": "Support docs weekly benchmark",
        "description": "Runs every Monday against the golden Q&A dataset",
    },
)
notebook_id = response.json()["id"]

# Save experiment state to the notebook
requests.put(
    f"{ARTHUR_BASE_URL}/api/v1/rag_notebooks/{notebook_id}/state",
    headers={
        "Authorization": f"Bearer {ARTHUR_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "state": {
            "rag_configs": [
                {
                    "type": "saved",
                    "setting_configuration_id": settings_id,
                    "version": settings_version,
                    "query_column": {"dataset_column": "question"},
                }
            ],
            "dataset_ref": {
                "id": "your-dataset-id",
                "name": "support-qa-golden",
                "version": 1,
            },
            "eval_list": [
                {"name": "Context Precision", "version": 1},
                {"name": "Answer Relevance", "version": 1},
            ],
        }
    },
)
```

***

## Step 5 — Run a RAG Experiment

An experiment applies your RAG configurations to every row in your dataset, runs the selected evaluators, and stores per-row and aggregate results.

### UI

From **RAG → RAG Experiments**, click **Create Experiment**. The creation flow has these steps:

<Image align="center" src="https://files.readme.io/dbbbd2aeb42b2e5a8352548d949344e6507bbd62ff4e4773fbf8725444dacf9d-Screenshot_2026-04-23_at_12.02.27.png" />

1. **Name and description**
2. **Dataset** — select a dataset and version; choose which column contains the queries
3. **RAG configurations** — select saved configurations or define inline ones; you can run multiple configurations in a single experiment to compare them
4. **Evaluators** — select which evaluators to run and map their variables to dataset columns or RAG output fields
5. **Review and run**

### API

```python Python SDK
response = requests.post(
    f"{ARTHUR_BASE_URL}/api/v1/tasks/{TASK_ID}/rag_experiments",
    headers={
        "Authorization": f"Bearer {ARTHUR_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "name": "vector-vs-hybrid-comparison",
        "description": "Compare top-5 vector search against top-10 hybrid",
        "dataset_ref": {
            "id": "your-dataset-id",
            "name": "support-qa-golden",
            "version": 1,
        },
        "rag_configs": [
            {
                "type": "saved",
                "setting_configuration_id": vector_settings_id,
                "version": 1,
                "query_column": {"dataset_column": "question"},
            },
            {
                "type": "saved",
                "setting_configuration_id": hybrid_settings_id,
                "version": 1,
                "query_column": {"dataset_column": "question"},
            },
        ],
        "eval_list": [
            {"name": "Context Precision", "version": 1},
            {"name": "Answer Relevance", "version": 1},
        ],
    },
)
response.raise_for_status()
experiment = response.json()
experiment_id = experiment["id"]
print(f"Experiment started: {experiment_id} — status: {experiment['status']}")
```

```javascript JavaScript
const response = await fetch(
  `${ARTHUR_BASE_URL}/api/v1/tasks/${TASK_ID}/rag_experiments`,
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${ARTHUR_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      name: "vector-vs-hybrid-comparison",
      dataset_ref: {
        id: "your-dataset-id",
        name: "support-qa-golden",
        version: 1,
      },
      rag_configs: [
        {
          type: "saved",
          setting_configuration_id: vectorSettingsId,
          version: 1,
          query_column: { dataset_column: "question" },
        },
        {
          type: "saved",
          setting_configuration_id: hybridSettingsId,
          version: 1,
          query_column: { dataset_column: "question" },
        },
      ],
      eval_list: [
        { name: "Context Precision", version: 1 },
        { name: "Answer Relevance", version: 1 },
      ],
    }),
  }
);
const experiment = await response.json();
console.log(`Experiment started: ${experiment.id} — status: ${experiment.status}`);
```

```curl cURL
curl -X POST http://localhost:3030/api/v1/tasks/{task_id}/rag_experiments \
  -H "Authorization: Bearer $ARTHUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "vector-vs-hybrid-comparison",
    "dataset_ref": {"id": "your-dataset-id", "name": "support-qa-golden", "version": 1},
    "rag_configs": [
      {
        "type": "saved",
        "setting_configuration_id": "vector-settings-id",
        "version": 1,
        "query_column": {"dataset_column": "question"}
      }
    ],
    "eval_list": [
      {"name": "Context Precision", "version": 1}
    ]
  }'
```

### Poll for completion

```python Python SDK
import time

while True:
    resp = requests.get(
        f"{ARTHUR_BASE_URL}/api/v1/rag_experiments/{experiment_id}",
        headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
    )
    status = resp.json()["status"]
    print(f"Status: {status}")
    if status in ("completed", "failed"):
        break
    time.sleep(5)

result = resp.json()
```

Experiment status values: `queued` → `running` → `completed` or `failed`.

***

## View Results

### UI

From **RAG → RAG Experiments**, click on your **Experiment**

Once a RAG experiment completes, the results page shows a summary of how each configuration performed across all test cases. At a glance you can see the overall pass rate per evaluator, which configurations passed or failed each test case, and the cost per row. The experiment header shows total duration, number of test cases run, and the dataset used — making it easy to compare runs over time.

<Image align="center" src="https://files.readme.io/3a048460d8d7c5dfa193acedd49c19b0a659fa927789ed75ba1196da389be5a5-Screenshot_2026-04-23_at_12.12.08.png" />

### Aggregate results

The completed experiment response includes a summary of pass/fail counts per configuration per evaluator:

```json
{
  "id": "...",
  "status": "completed",
  "summary_results": {
    "rag_eval_summaries": [
      {
        "rag_config_key": "saved:uuid:1",
        "rag_config_type": "saved",
        "eval_results": [
          {
            "eval_name": "Context Precision",
            "passed_count": 38,
            "failed_count": 12,
            "total_count": 50,
            "error_count": 0
          }
        ]
      }
    ]
  }
}
```

### Per-row results

```python Python SDK
resp = requests.get(
    f"{ARTHUR_BASE_URL}/api/v1/rag_experiments/{experiment_id}/test_cases",
    headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
    params={"page": 0, "page_size": 20},
)

for case in resp.json()["data"]:
    print(f"\nRow: {case['dataset_row_id']} — status: {case['status']}")
    for rag_result in case["rag_results"]:
        config = rag_result["rag_config_key"]
        query = rag_result["query_text"]
        objects = rag_result["output"]["response"]["objects"]
        print(f"  Config {config}: {len(objects)} chunks retrieved for '{query}'")
        for obj in objects[:2]:
            score = obj.get("metadata", {}).get("score")
            print(f"    score={score} — {str(obj['properties'])[:80]}")
```

### Per-configuration results

To get results for a specific RAG configuration only:

```python Python SDK
rag_config_key = "saved:your-settings-id:1"  # format: saved:{id}:{version} or unsaved:{uuid}

resp = requests.get(
    f"{ARTHUR_BASE_URL}/api/v1/rag_experiments/{experiment_id}/rag_configs/{rag_config_key}/results",
    headers={"Authorization": f"Bearer {ARTHUR_API_KEY}"},
    params={"page": 0, "page_size": 20},
)
```

***

## Next Steps

| Goal                                                        | Where to go                                                                                                 |
| ----------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| **Build the evaluators you want to run on RAG experiments** | [LLM Evaluators](https://docs.arthur.ai/docs/llm-evaluators) — create Context Precision, Answer Relevance, and custom judge prompts |
| **Set up test datasets for RAG experiments**                | [Datasets](https://docs.arthur.ai/docs/datasets-engine) — create versioned datasets with your test queries                          |
| **Automate RAG evaluation in CI**                           | [CI/CD Integration](https://docs.arthur.ai/docs/cicd-integration) — trigger experiments on every pipeline change                    |