Eval() function, use the braintrust eval CLI command to run multiple evaluations from files, or create experiments in the Braintrust UI for no-code workflows. Integrate with CI/CD to catch regressions automatically.
Run with Eval()
TheEval() function runs an evaluation and creates an experiment:
Eval() automatically:
- Creates an experiment in Braintrust
- Displays a summary in your terminal
- Populates the UI with results
- Returns summary metrics
Run with CLI
Use thebraintrust eval command to run evaluations from files:
- TypeScript
- Python
.env.development.local.env.local.env.development.env
--watch to re-run evaluations automatically when files change:
Run in UI
Create and run experiments directly in the Braintrust UI without writing code:- Navigate to Evaluations > Experiments.
- Click + Experiment or use the empty state form.
- Select one or more prompts, workflows, or scorers to evaluate.
- Choose or create a dataset:
- Select existing dataset: Pick from datasets in your organization
- Upload CSV/JSON: Import test cases from a file
- Empty dataset: Create a blank dataset to populate manually later
- Add scorers to measure output quality.
- Click Create to execute the experiment.
UI experiments timeout after 15 minutes. For longer-running evaluations, use the SDK or CLI approach.
Run in CI/CD
Integrate evaluations into your CI/CD pipeline to catch regressions automatically.GitHub Actions
Use thebraintrustdata/eval-action to run evaluations on every pull request:

Other CI systems
For other CI systems, run evaluations as a standard command:BRAINTRUST_API_KEY environment variable set.
Run remotely
Expose evaluations running on remote servers or local machines using dev mode:Run locally
Run evaluations without sending logs to Braintrust for quick iteration.- CLI
- SDK
Configure experiments
Customize experiment behavior with options:Run trials
Run each input multiple times to measure variance and get more robust scores. Braintrust intelligently aggregates results by bucketing test cases with the sameinput value:
Use hill climbing
Sometimes you don’t have expected outputs and want to use a previous experiment as a baseline instead. Hill climbing enables iterative improvement by comparing new experiments to previous ones, which is especially useful when you lack a pre-existing benchmark. Braintrust supports hill climbing as a first-class concept, allowing you to use a previous experiment’soutput field as the expected field for the current experiment. Autoevals includes scorers like Battle and Summary designed specifically for hill climbing.
To enable hill climbing, use BaseExperiment() in the data field:
expected field by merging the expected and output fields from the base experiment. If you set expected through the UI while reviewing results, it will be used as the expected field for the next experiment.
Use a specific experiment
To use a specific experiment as the base, pass thename field to BaseExperiment():
Scoring considerations
When hill climbing, use two types of scoring functions:- Non-comparative methods like
ClosedQAthat judge output quality based purely on input and output without requiring an expected value. Track these across experiments to compare any two experiments, even if they aren’t sequentially related. - Comparative methods like
BattleorSummarythat accept anexpectedoutput but don’t treat it as ground truth. If you score > 50% on a comparative method, you’re doing better than the base on average. Learn more about how Battle and Summary work.
Create custom reporters
When you run an experiment, Braintrust logs results to your terminal, andbraintrust eval returns a non-zero exit code if any eval throws an exception. Customize this behavior for CI/CD pipelines to precisely define what constitutes a failure or to report results to different systems.
Define custom reporters using Reporter(). A reporter has two functions:
Reporter included among your evaluated files will be automatically picked up by the braintrust eval command.
- If no reporters are defined, the default reporter logs results to the console.
- If you define one reporter, it’s used for all
Evalblocks. - If you define multiple
Reporters, specify the reporter name as an optional third argument toEval().
Include attachments
Braintrust allows you to log binary data like images, audio, and PDFs as attachments. Use attachments in evaluations by initializing anAttachment object in your data:
Use attachment URLs
Obtain a signed URL for the attachment to forward to other services like OpenAI:Trace your evals
Add detailed tracing to your evaluation task functions to measure performance and debug issues. Each span in the trace represents an operation like an LLM call, database lookup, or API request.Use
wrapOpenAI/wrap_openai to automatically trace OpenAI API calls. See Add custom tracing for details.traced() to log incrementally to spans. This example progressively logs input, output, and metrics:
Next steps
- Interpret results from your experiments
- Compare experiments to measure improvements
- Write scorers to measure quality
- Use playgrounds for no-code experimentation