So you built an AI application, curated an initial set of test examples, picked a scoring function, and ran an eval. You got a score—great!

Now what?
We'll walk you through how to think about that score, and more importantly, what steps you should take next to continuously improve both your AI application and your evaluation process, including:

Note: this guide assumes you already have a basic evaluation infrastructure in place. If not, we recommend starting with these two guides first: Structuring evals and Getting started with automated evals.
After running an eval, your two goals should be to:
To figure out where to focus first, start by reviewing 5-10 actual examples from your evals. For each row, inspect the trace–the input, output, and how your evaluation function scored the output.

As you review these examples, ask yourself:
Every example will fall into one of these four categories:

Once you've analyzed these examples, you'll start to notice patterns around how your application is performing and whether your scoring is accurate. From there, you can decide whether to refine your evals or make changes to your application.
The key here is iteration. Don’t overthink it! In almost all cases, it’s worth spending time improving both your evals and your AI application. Move fast, try things out, inspect the results, and roll back updates if necessary.

There are three main ways to improve your evals:
For an in-depth guide, check out the full blog post.
If your evaluations are robust, you can confidently iterate on your AI application, since any changes are tested for performance impact and potential regressions each time you re-run your evals.
Here's the general workflow for making improvements:
Before you make an update, it's important to define what you're improving and why. This clarity is important because it makes your iteration targeted and measurable. Start by examining your eval results: overall scores, individual examples, and feedback.
Once you've identified a potential improvement, frame it in one of two ways:
Well-defined scores are tied to specific attributes (e.g. factuality, length of output) or capabilities (e.g. selecting the right tool, writing executable code). By increasing a score, you're clearly indicating that you are improving your application along a particular axis. Start with end-to-end scores—these give you a broad view of how well your system is performing. Then, as needed, zoom in on specific sections where smaller, more focused changes can make a significant impact.
Fixing examples of bad outputs is essential when dealing with customer complaints or catching critical failures while inspecting examples. These fixes typically address very visible or user-facing problems, so they're usually high leverage.
Once you've decided what to improve, the next step is to think through what update would result in the improvement you decided on. This part of the process is critical because the range of potential tweaks is broad, but choosing the right one can be both fun and challenging.
Here are some common areas we see AI teams change:
Now, it's time to make the update to your AI application. Whether you’re tweaking a prompt, adjusting model weights, or modifying pipeline logic, the goal is to introduce changes that directly target the performance or behavior you’re trying to improve.
After implementing the change, re-run your evals to see if your change to verify the impact.
Importantly, you should also closely monitor your overall scores to ensure you are not regressing in other areas. Avoiding unintended regressions is critical to continuous forward progress.
Did the change move the needle? Great—build on that momentum. If not, you now have more data to make better decisions and try again. The key is in the iteration—every cycle moves your application and evals closer to optimal performance.
We've shared how critical a rapid development loop is to delivering high-quality AI features. After each improvement, re-run your evaluations and go through the workflow again.
Many of our customers cycle through this loop 50+ times a day. This fast-paced iteration is what enables AI teams to stay ahead, ship robust AI features, and quickly react to feedback and issues. To see an example of the workflow in action, check out how Notion develops world-class AI features.
At Braintrust, our goal is to make AI development as iterative and robust as possible. If you're looking to optimize your workflow or scale your AI product development efforts, get in touch!