Prompt Experiments Overview
Prompt Experiments enable systematic testing and comparison of prompts against datasets with automated evaluations. This guide helps you understand how to use experiments for prompt A/B testing, improvement, and quality assurance.
Overview
A Prompt Experiment runs one or more prompts against a dataset, executes evaluations, and provides detailed results for analysis. Each row in your dataset becomes a test case. Experiments run asynchronously and provide real-time progress updates.
Why Use Prompt Experiments?
Prompt Experiments enable systematic, data-driven prompt development and optimization. Here's why they're valuable:
Make Data-Driven Decisions
Instead of guessing which prompt works best, experiments give you concrete metrics. You can:
- Compare prompts objectively: Test multiple prompts on the same inputs and see which performs better
- Identify patterns: Discover which types of inputs cause failures or successes
- Track improvements: Measure how prompt changes affect performance over time
Save Time and Reduce Manual Work
Running prompts manually on test cases is tedious and error-prone. Experiments automate:
- Batch execution: Run prompts on hundreds or thousands of test cases automatically
- Consistent evaluation: Apply the same evaluation criteria across all test cases
Improve Quality Assurance
Before deploying prompts to production, experiments help ensure quality:
- Catch failures early: Identify edge cases and failure modes before users encounter them
- Validate improvements: Confirm that prompt changes actually improve performance
- Document performance: Create a record of how prompts perform on your test data
Enable Systematic Iteration
Experiments support an iterative improvement workflow:
- Clone and modify: Start from existing experiments and make incremental changes
- Compare versions: Test new prompts alongside proven ones to measure improvement
- Track history: See how your prompts have evolved and what worked
- Promote to Production: Promote the best prompts to production and integrate them into your applications with one click
Core Concepts
1. Prompt
A prompt is a set of messages and instructions sent to a language model to generate outputs. It typically includes system messages (defining the model's role), user messages (the actual request), and can contain variables (like user_query or context) that get filled in with values, either from the dataset, or by you. When you run an experiment, each prompt executes against each row of your dataset, with variables replaced by the corresponding column values. Prompts can be:
- Saved Prompts: Versioned prompts stored in Prompt Management. These are reusable across experiments and can be referenced by name and version.
- Unsaved Prompts: Ad-hoc prompts defined directly in the experiment. These are useful for quick testing without creating a saved prompt first.
You can test multiple prompts (both saved and unsaved) in a single experiment to compare their performance.
Example:
System message: "You are a helpful assistant that answers questions concisely."
User message: "Answer this question: {{question}}"
This prompt has one variable: question, which will be filled with values from your dataset
2. Dataset
A dataset contains test data organized in rows and columns. Each row represents a test case, and columns contain the input data. When creating an experiment, you:
- Select a dataset and version
- Optionally filter rows using column name-value pairs (AND logic - all conditions must match)
- Map dataset columns to prompt variables
Example:
A dataset with two rows:
| question | expected_answer |
|---|---|
| "What is 2+2?" | "4" |
| "What is the capital of France?" | "Paris" |
Each row becomes a test case in your experiment.
3. Eval (Evaluator)
An evaluator is an automated test that scores prompt outputs. Evaluators can check for:
- Quality metrics (e.g., correctness, relevance, completeness)
- Safety checks (e.g., toxicity, bias)
- Custom criteria defined by your team
Each evaluator requires specific input variables, which can come from:
- Dataset columns: Static values from your test data
- Experiment output: Values extracted from prompt outputs
Example:
An evaluator called "Answer Correctness" that checks if the prompt's answer matches the expected answer. It requires:
response: The prompt's outputexpected_answer: The correct answer
For each test case, it compares the response to the expected answer and returns a pass/fail score.
4. Mappings
Mappings connect your dataset to prompts and evaluators:
- Prompt Variable Mappings: Map dataset columns to prompt variables. For example, map a
user_querycolumn to aqueryvariable in your prompt. - Eval Variable Mappings: Map dataset columns or prompt outputs to evaluator variables. For example, map the prompt's output content to an evaluator's
responsevariable.
Example:
Using the prompt and dataset examples above:
Prompt Variable Mapping:
- Dataset column
question→ Prompt variablequestion
Eval Variable Mappings:
- Experiment output (prompt's response) → Eval variable
response - Dataset column
expected_answer→ Eval variableexpected_answer
When the experiment runs:
- Row 1:
question= "What is 2+2?" → Prompt executes → Gets response → Eval compares response to "4" - Row 2:
question= "What is the capital of France?" → Prompt executes → Gets response → Eval compares response to "Paris"
5. Results
Experiment results include:
- Summary Statistics: Overall pass rates, costs, and completion status
- Per-Prompt Performance: Evaluation scores for each prompt across all test cases
- Test Case Details: Individual results showing inputs, rendered prompts, outputs, and evaluation scores
- Cost Tracking: Total cost per test case and per prompt
Example:
After running the experiment with the examples above, you might see:
Test Cases:
| Test Case | Input | Rendered Prompt | Output | Eval Result | Cost |
|---|---|---|---|---|---|
| 1 | question = "What is 2+2?" | "Answer this question: What is 2+2?" | "4" | ✅ Pass (response matches expected answer "4") | $0.001 |
| 2 | question = "What is the capital of France?" | "Answer this question: What is the capital of France?" | "Paris" | ✅ Pass (response matches expected answer "Paris") | $0.001 |
Summary:
| Metric | Value |
|---|---|
| Total test cases | 2 |
| Passed | 2 |
| Failed | 0 |
| Pass rate | 100% |
| Total cost | $0.002 |
6. Prompt Experiment
A Prompt Experiment brings together all the concepts above into a systematic testing configuration. It defines:
- Prompts to test: One or more prompts (saved or unsaved) to evaluate
- Dataset: The test data to run prompts against
- Row filters (optional): Conditions to test on a subset of dataset rows
- Evaluations: Automated tests to score prompt outputs
- Variable mappings: How dataset columns map to prompt and evaluator variables
When you run an experiment, it creates a test case for each row in your dataset (or filtered subset), executes each prompt with the mapped variables, runs evaluations, and collects results.
Updated 1 day ago