Prompt Experiments Overview

Overview

A Prompt Experiment runs one or more prompts against a dataset, executes evaluations, and provides detailed results for analysis. Each row in your dataset becomes a test case. Experiments run asynchronously and provide real-time progress updates.

Why Use Prompt Experiments?

Prompt Experiments enable systematic, data-driven prompt development and optimization. Here's why they're valuable:

Make Data-Driven Decisions

Instead of guessing which prompt works best, experiments give you concrete metrics. You can:

Compare prompts objectively: Test multiple prompts on the same inputs and see which performs better
Identify patterns: Discover which types of inputs cause failures or successes
Track improvements: Measure how prompt changes affect performance over time

Save Time and Reduce Manual Work

Running prompts manually on test cases is tedious and error-prone. Experiments automate:

Batch execution: Run prompts on hundreds or thousands of test cases automatically
Consistent evaluation: Apply the same evaluation criteria across all test cases

Improve Quality Assurance

Before deploying prompts to production, experiments help ensure quality:

Catch failures early: Identify edge cases and failure modes before users encounter them
Validate improvements: Confirm that prompt changes actually improve performance
Document performance: Create a record of how prompts perform on your test data

Enable Systematic Iteration

Experiments support an iterative improvement workflow:

Clone and modify: Start from existing experiments and make incremental changes
Compare versions: Test new prompts alongside proven ones to measure improvement
Track history: See how your prompts have evolved and what worked
Promote to Production: Promote the best prompts to production and integrate them into your applications with one click

Core Concepts

1. Prompt

A prompt is a set of messages and instructions sent to a language model to generate outputs. It typically includes system messages (defining the model's role), user messages (the actual request), and can contain variables (like user_query or context) that get filled in with values, either from the dataset, or by you. When you run an experiment, each prompt executes against each row of your dataset, with variables replaced by the corresponding column values. Prompts can be:

Saved Prompts: Versioned prompts stored in Prompt Management. These are reusable across experiments and can be referenced by name and version.
Unsaved Prompts: Ad-hoc prompts defined directly in the experiment. These are useful for quick testing without creating a saved prompt first.

You can test multiple prompts (both saved and unsaved) in a single experiment to compare their performance.

Example:

System message: "You are a helpful assistant that answers questions concisely."
User message: "Answer this question: {{question}}"

This prompt has one variable: question, which will be filled with values from your dataset

2. Dataset

A dataset contains test data organized in rows and columns. Each row represents a test case, and columns contain the input data. When creating an experiment, you:

Select a dataset and version
Optionally filter rows using column name-value pairs (AND logic - all conditions must match)
Map dataset columns to prompt variables

Example:

A dataset with two rows:

question	expected_answer
"What is 2+2?"	"4"
"What is the capital of France?"	"Paris"

Each row becomes a test case in your experiment.

3. Eval (Evaluator)

An evaluator is an automated test that scores prompt outputs. Evaluators can check for:

Quality metrics (e.g., correctness, relevance, completeness)
Safety checks (e.g., toxicity, bias)
Custom criteria defined by your team

Each evaluator requires specific input variables, which can come from:

Dataset columns: Static values from your test data
Experiment output: Values extracted from prompt outputs

Example:

An evaluator called "Answer Correctness" that checks if the prompt's answer matches the expected answer. It requires:

response: The prompt's output
expected_answer: The correct answer

For each test case, it compares the response to the expected answer and returns a pass/fail score.

4. Mappings

Mappings connect your dataset to prompts and evaluators:

Prompt Variable Mappings: Map dataset columns to prompt variables. For example, map a user_query column to a query variable in your prompt.
Eval Variable Mappings: Map dataset columns or prompt outputs to evaluator variables. For example, map the prompt's output content to an evaluator's response variable.

Example:

Using the prompt and dataset examples above:

Prompt Variable Mapping:

Dataset column question → Prompt variable question

Eval Variable Mappings:

Experiment output (prompt's response) → Eval variable response
Dataset column expected_answer → Eval variable expected_answer

When the experiment runs:

Row 1: question = "What is 2+2?" → Prompt executes → Gets response → Eval compares response to "4"
Row 2: question = "What is the capital of France?" → Prompt executes → Gets response → Eval compares response to "Paris"

5. Results

Experiment results include:

Summary Statistics: Overall pass rates, costs, and completion status
Per-Prompt Performance: Evaluation scores for each prompt across all test cases
Test Case Details: Individual results showing inputs, rendered prompts, outputs, and evaluation scores
Cost Tracking: Total cost per test case and per prompt

Example:

After running the experiment with the examples above, you might see:

Test Cases:

Test Case	Input	Rendered Prompt	Output	Eval Result	Cost
1	`question = "What is 2+2?"`	"Answer this question: What is 2+2?"	"4"	✅ Pass (response matches expected answer "4")	$0.001
2	`question = "What is the capital of France?"`	"Answer this question: What is the capital of France?"	"Paris"	✅ Pass (response matches expected answer "Paris")	$0.001

Summary:

Metric	Value
Total test cases	2
Passed	2
Failed	0
Pass rate	100%
Total cost	$0.002

6. Prompt Experiment

A Prompt Experiment brings together all the concepts above into a systematic testing configuration. It defines:

Prompts to test: One or more prompts (saved or unsaved) to evaluate
Dataset: The test data to run prompts against
- Row filters (optional): Conditions to test on a subset of dataset rows
Evaluations: Automated tests to score prompt outputs
Variable mappings: How dataset columns map to prompt and evaluator variables

When you run an experiment, it creates a test case for each row in your dataset (or filtered subset), executes each prompt with the mapped variables, runs evaluations, and collects results.