
At Braintrust, when we chat with engineers building AI applications, one of the most common questions we hear is “How do we get started with automated evaluations?”
In this post, we will discuss the state of evals today and lay out some high-leverage ways to quickly get started with automated evaluations.
Prior to Braintrust, we see AI teams leverage a few common approaches to evals:
While the above approaches are all helpful, we find that all three fall short in important ways. Vibes and manual review do not scale, and general benchmarks are not sufficiently application-specific and are hard to customize. This means engineering teams struggle to understand product performance, resulting in a very slow dev loop and frustrating behavior like:
Automated evaluations are straightforward to set up and can make an immediate impact on AI development speed. In this section, we will walk through 3 great approaches: LLM evaluators, heuristics, and comparative evals.
LLMs are incredibly useful for evaluating responses out-of-the-box, even with minimal prompting. Anything you can ask a human to evaluate, you can (at least partially) encode into an LLM evaluator. Here are some examples:
The above two methods are great places to start, and we’ve seen customers successfully configure LLMs to score many other subjective characteristics - conciseness, tone, helpfulness, writing quality, and many more.
Heuristics are a valuable objective way to score responses. We’ve found that the best heuristics fall into one of two buckets:
Importantly - to make heuristic scoring as valuable as possible, engineering teams should be able to see updated scores after every change, quickly drill down into interesting examples, and add/tweak heuristics.
Comparative evals compare an updated set of responses vs. a previous iteration. This is particularly helpful in understanding whether your application is improving as you make changes. Comparative evals also do not require expected responses, so they can be a great option for very subjective tasks. Here are a few examples:
Braintrust natively supports hill climbing, which allows you to iteratively compare new outputs to previous ones.
While there is no replacement for human review, setting up basic structure around automated evals unlocks the ability for developers to start iterating quickly. The ideal AI dev loop enables teams to immediately understand performance, track experiments over time, identify and drill down into interesting examples, and codify what “good” looks like. This also makes human review time much higher leverage as you can point reviewers to useful examples and continuously utilize their scores.
Getting this foundation in place does not require a big time investment up front. A single scoring function with 10-30 examples is enough to enable teams to start iterating. We’ve seen teams start from that foundation and very quickly scale into making 50+ updates per day across their AI applications, evaluation methods, and test data.
At Braintrust, we obsess over making the AI development process as smooth and iterative as possible. Setting up evaluations in Braintrust takes less than 1 hour and makes a huge difference. If you want to learn more, sign up, check out our docs or get in touch!