Evaluation metrics turn subjective AI quality into measurable numbers. Without metrics, you rely on manual review and intuition. You can't systematically improve what you can't measure.
This guide covers evaluation metrics for LLMs: what they measure, when to use them, and how to implement them systematically. We'll explore metrics for general LLM outputs, RAG applications, and specialized use cases, with practical implementation examples.
AI outputs are non-deterministic and subjective. The same prompt can produce different responses. Quality depends on context, user intent, and domain-specific requirements. Traditional software testing (checking for exact matches or return codes) doesn't work.
You need metrics that capture quality dimensions relevant to your use case: Is the answer factually correct? Is it relevant to the question? Does it follow instructions? Is it safe and appropriate?
Without metrics, quality evaluation becomes manual and slow. Someone reads each output, judges it subjectively, and records their assessment. This doesn't scale. You can't test thousands of examples. You can't track quality over time. You can't identify which prompt changes improve performance.
Systematic improvement: Metrics make quality measurable. You can test prompt changes and know if they improved factuality, relevance, or coherence. Iterate based on data, not guesswork.
Regression detection: Track metrics across versions. When factuality drops from 85% to 72%, you know something broke. Without metrics, regressions surface as user complaints.
A/B testing: Compare prompt variants quantitatively. Variant A scores 0.83 on relevance, variant B scores 0.91. Deploy B. Without metrics, you can't make data-driven decisions.
Continuous monitoring: Run metrics on production traffic. Detect quality degradation in real-time. Alert when scores drop below thresholds. Respond before users notice problems.
Metrics fall into different categories based on what they measure and how they're implemented.
Task-agnostic metrics apply broadly across use cases:
These work for most LLM applications without customization.
Task-specific metrics measure criteria unique to your application:
Task-specific metrics require domain knowledge to implement correctly.
Code-based metrics use deterministic logic:
Code-based metrics are fast, cheap, and deterministic. Use them whenever possible.
LLM-based metrics use language models to judge outputs:
LLM-based metrics (LLM-as-a-judge) handle nuanced, subjective criteria that code can't capture. They cost more and add variability, but enable evaluation of complex quality dimensions.
Reference-based metrics compare output to an expected answer:
These require ground truth data. Use when you have known correct answers.
Reference-free metrics evaluate outputs independently:
Use when there's no single correct answer or when expected outputs aren't available.
Measures whether the output contains accurate, verifiable information. Critical for applications providing factual answers, summarization, or question answering.
How Braintrust measures factuality: Use the Factuality scorer from Braintrust's autoevals library. This LLM-as-a-judge scorer compares output against source context to determine if claims are supported by the provided facts.
Use cases:
Limitations: Requires reliable source context. Judge model accuracy depends on its own knowledge and instruction-following capability.
Evaluates whether the output appropriately addresses the input. A factually correct answer that doesn't address the question scores low on relevance.
How Braintrust measures relevance: Braintrust provides multiple relevance scorers through the autoevals library for different use cases:
// Measure answer relevance with Braintrust
const answerScore = await AnswerRelevancy({
input: "What are the side effects of aspirin?",
output: "Aspirin can cause stomach upset, bleeding, and allergic reactions",
});
// Measure context relevance for RAG systems
const contextScore = await ContextRelevancy({
question: "What are the side effects of aspirin?",
context: "Aspirin is a pain reliever and anti-inflammatory drug...",
});
Use cases:
Coherence measures logical flow and consistency within the text. Ideas should connect naturally. Arguments should follow logically. Pronouns should reference correctly.
Fluency measures grammatical correctness and naturalness. The text should read smoothly without awkward phrasing or errors.
How Braintrust measures coherence and fluency: Create custom LLM-as-a-judge scorers in Braintrust for these subjective qualities:
// Braintrust supports custom coherence scoring
const CoherenceScorer = LLMClassifierFromTemplate({
name: "Coherence",
promptTemplate: `Rate the coherence of this text from 1-5:
{{output}}
Does it flow logically? Are ideas connected? Rate 1-5.`,
choiceScores: { "1": 0.2, "2": 0.4, "3": 0.6, "4": 0.8, "5": 1.0 },
});
Use cases:
Evaluates whether output is safe, appropriate, and free from harmful content. Checks for toxicity, bias, offensive language, and policy violations.
How Braintrust measures safety: Create custom code-based scorers to pattern match against blocked words/phrases, or create LLM-as-a-judge scorers to evaluate against a specific set of safety criteria.
// Option 1: Code-based pattern matching in Braintrust
const blockedWords = ["spam", "scam", "offensive-term"];
function profanityScorer({ output }: { output: string }): number {
const lowerOutput = output.toLowerCase();
const hasProfanity = blockedWords.some((word) => lowerOutput.includes(word));
return hasProfanity ? 0 : 1;
}
// Option 2: LLM-as-a-judge for safety in Braintrust
const SafetyScorer = LLMClassifierFromTemplate({
name: "ContentSafety",
promptTemplate: `Evaluate if this content is safe and appropriate:
Content: {{output}}
Check for:
- Toxicity or offensive language
- Hate speech or discrimination
- Violence or harmful content
- Inappropriate sexual content
Is this content safe? Respond with 1 for safe, 0 for unsafe.`,
choiceScores: { "1": 1, "0": 0 },
useCoT: true,
});
Use cases:
Categories Braintrust checks:
Measures how close the output is to the expected answer in meaning, regardless of exact wording. "The capital of France is Paris" and "Paris is France's capital" score high similarity despite different phrasing.
How Braintrust measures semantic similarity: Use Braintrust's EmbeddingSimilarity scorer from autoevals:
// Braintrust compares semantic similarity using embeddings
const similarityScore = await EmbeddingSimilarity({
output: "Paris is France's capital city",
expected: "The capital of France is Paris",
});
Use cases:
Limitations: Doesn't distinguish between different correct answers with similar embeddings. Can give false positives for semantically similar but factually wrong answers.
Exact match: Binary metric. Does output exactly match expected?
Levenshtein distance: Counts minimum edits (insertions, deletions, substitutions) to transform output into expected text. Normalized to 0-1 score.
How Braintrust measures exact match and string distance: Use Braintrust's code-based scorers from autoevals for fast, deterministic evaluation:
// Braintrust provides exact match and string distance scoring
const exactMatch = (output: string, expected: string) =>
output === expected ? 1 : 0;
// Braintrust measures edit distance
const distance = await Levenshtein({
output: "The answer is 42",
expected: "42",
});
Use cases:
Limitations: Brittle. "The answer is 42" and "42" score poorly despite same semantic content.
RAG (Retrieval-Augmented Generation) systems have unique evaluation requirements. Braintrust provides specialized scorers through autoevals to measure both retrieval quality and generation quality.
Measures whether retrieved context contains relevant information. How much of the retrieved content actually helps answer the query?
How Braintrust measures context precision: Use the ContextPrecision scorer from Braintrust's autoevals library:
// Braintrust evaluates RAG context precision
const precisionScore = await ContextPrecision({
question: "What are the health benefits of exercise?",
context: [
"Doc1: Exercise improves cardiovascular health...",
"Doc2: The weather today...",
],
expected: "Exercise improves heart health and reduces disease risk",
});
Use cases:
Measures whether the retrieved context contains all information needed to answer the query. Did retrieval miss important documents?
How Braintrust measures context recall: Use the ContextRecall scorer from Braintrust's autoevals library:
// Braintrust checks if retrieval captured all necessary information
const recallScore = await ContextRecall({
question: "What are the health benefits of exercise?",
context: ["Exercise improves cardiovascular health and mental well-being"],
expected:
"Exercise improves heart health, mental well-being, and weight management",
});
Use cases:
Evaluates whether retrieved documents are topically related to the query. Similar to precision but focused on topical relevance rather than direct usefulness.
How Braintrust measures context relevance: Use the ContextRelevancy scorer shown earlier in the Relevance section. Braintrust evaluates whether each retrieved document relates to the query.
Measures whether the generated answer is grounded in the retrieved context. Does the LLM hallucinate information not present in the context?
How Braintrust measures faithfulness: Use the Faithfulness scorer from Braintrust's autoevals library:
// Braintrust detects hallucinations in RAG outputs
const faithfulnessScore = await Faithfulness({
context:
"Paris is the capital of France. It has a population of 2.1 million.",
output:
"Paris, with 2.1 million people, is France's capital and largest city",
});
Use cases:
Combines factual accuracy with completeness. Is the answer both correct and complete?
How Braintrust measures answer correctness: Use the AnswerCorrectness scorer from Braintrust's autoevals library:
// Braintrust evaluates both accuracy and completeness
const correctnessScore = await AnswerCorrectness({
input: "What is the capital of France?",
output: "Paris",
expected: "Paris is the capital of France",
});
Use cases:
Checks if output is valid JSON that can be parsed.
How Braintrust measures JSON validity: Use the JSONDiff scorer from Braintrust's autoevals library or create a custom code-based scorer:
// Braintrust validates and compares JSON outputs
const jsonScore = await JSONDiff({
output: '{"status": "success", "count": 42}',
expected: '{"status": "success", "count": 42}',
});
// Or use a simple custom validator in Braintrust
function validateJSON({ output }: { output: string }): number {
try {
JSON.parse(output);
return 1;
} catch {
return 0;
}
}
Use cases:
Evaluates whether generated SQL queries are syntactically valid and semantically correct.
How Braintrust measures SQL correctness: Create custom code-based scorers in Braintrust for multi-level SQL validation:
// Braintrust supports custom SQL validation scorers
function validateSQLSyntax({ output }: { output: string }): number {
// Parse SQL and check syntax
try {
sqlParser.parse(output);
return 1;
} catch {
return 0;
}
}
function validateSQLExecution({ output }: { output: string }): Promise<number> {
// Execute SQL and check if it runs without errors
try {
await database.query(output);
return 1;
} catch {
return 0;
}
}
Use cases:
Measures how close a numeric output is to the expected value, normalized to 0-1 score.
How Braintrust measures numeric difference: Use the NumericDiff scorer from Braintrust's autoevals library:
// Braintrust scores numeric accuracy
const numericScore = await NumericDiff({
output: "42",
expected: "40",
});
Use cases:
Braintrust provides comprehensive infrastructure for implementing, tracking, and acting on evaluation metrics.
Braintrust uses the autoevals library to provide 25+ pre-built scorers you can use immediately. These scorers cover common LLM evaluation scenarios:
LLM-as-a-judge scorers in Braintrust:
RAG scorers in Braintrust:
Heuristic scorers in Braintrust:
Use Braintrust's built-in metrics out of the box:
// Braintrust runs multiple metrics on every test case
Eval("RAG System", {
data: () => testCases,
task: async (input) => await ragPipeline(input),
scores: [Factuality, ContextRelevancy, AnswerCorrectness],
});
Braintrust supports custom code-based scorers for domain-specific requirements. These scorers are fast, deterministic, and cost nothing to run:
When to use custom code-based scorers in Braintrust:
function validJSONScorer({ output }: { output: string }): number {
try {
JSON.parse(output);
return 1;
} catch {
return 0;
}
}
function lengthConstraintScorer({ output }: { output: string }): number {
return output.length <= 100 ? 1 : 0;
}
Braintrust enables custom LLM-based scorers using LLMClassifierFromTemplate from autoevals. Create scorers for subjective, domain-specific quality criteria:
When to use custom LLM-as-a-judge scorers in Braintrust:
// Braintrust supports custom LLM-based evaluation
const domainAccuracyScorer = LLMClassifierFromTemplate({
name: "MedicalAccuracy",
promptTemplate: `Evaluate if the answer correctly addresses the medical question.
Question: {{input}}
Answer: {{output}}
Expected criteria: {{expected}}
Is the answer medically accurate and complete? Respond with 1 for yes, 0 for no.`,
choiceScores: { "1": 1, "0": 0 },
useCoT: true, // Braintrust enables chain-of-thought reasoning
});
Run evals with multiple scorers. Braintrust tracks all metrics for each test case and aggregates results:
Braintrust's GitHub Action integration runs evaluations automatically on every pull request. View metric changes directly in PR comments:
Factuality: 0.85 → 0.91 (+0.06)
Relevance: 0.78 → 0.82 (+0.04)
Safety: 1.00 → 1.00 (no change)
Braintrust supports quality gates that block merges if metrics degrade below thresholds, preventing quality regressions before code ships.
Run scorers on production traffic to monitor quality continuously. Online scoring evaluates traces asynchronously in the background after they're logged, with zero impact on request latency.
Configuration: Set up scoring rules at the project level through Braintrust's Configuration page. Each rule specifies:
How it works: When your application logs traces normally, Braintrust automatically applies configured scoring rules in the background. Scores appear as child spans in your logs with evaluation details.
const logger = initLogger({ projectName: "Production" });
async function handleUserQuery(query: string) {
return traced(async (span) => {
const response = await llm.generate(query);
// Just log your traces normally
span.log({ input: query, output: response });
return response;
});
}
After deployment, configure online scoring rules in the UI. Scoring happens automatically based on your rules without code changes.
Manual scoring: You can also score historical logs retroactively through the UI by selecting logs and applying scorers, useful for testing new evaluation criteria before enabling them as online rules.
Begin with exact match or string distance. Establish baselines. Add sophistication when simple metrics prove insufficient.
No single metric captures all quality dimensions. Use combinations:
LLM-as-a-judge introduces variability. Validate scorers against human judgments. Adjust prompts if scores don't align with human evaluation.
Use chain-of-thought in scorer prompts to understand reasoning. This helps debug score disagreements.
Generic metrics provide broad coverage but miss use-case-specific quality dimensions. Invest in custom metrics for criteria that matter to your application:
Metrics matter most when tracked across iterations. Compare scores before and after prompt changes, model updates, or retrieval modifications. Trend lines reveal gradual quality shifts that point-in-time evaluation misses.
Scoring every production request can be expensive. Configure online scoring rules with sampling rates based on:
Online scoring runs asynchronously with zero latency impact, but costs still accumulate based on volume. Balance coverage with evaluation costs.
Optimizing only for factuality might produce dry, technically correct but unhelpful answers. Balance multiple quality dimensions.
Metrics averaged across your test set can hide systematic failures on specific input types. Analyze score distributions. Identify low-scoring subgroups.
Exact match doesn't work for open-ended questions. Semantic similarity doesn't distinguish between different correct answers. Match metrics to your use case.
Running metrics on 10 examples doesn't reveal patterns. You need hundreds to thousands of diverse test cases to reliably measure quality.
Scorer changes affect metric values. Track scorer versions alongside prompt versions to ensure apples-to-apples comparisons.
Braintrust provides complete infrastructure for implementing and tracking evaluation metrics:
Get started with Braintrust for free with 1 million trace spans included, no credit card required.
What's the difference between metrics and scorers?
In practice, the terms are used interchangeably. Technically, a metric is what you're measuring (factuality, relevance) while a scorer is the implementation that measures it. In Braintrust, scorers are functions that return scores for specific metrics.
Should I use code-based or LLM-based scorers?
Use code-based scorers whenever possible because they're faster, cheaper, and deterministic. Use LLM-based scorers for subjective criteria that code can't capture: tone, creativity, nuanced accuracy. Many applications benefit from both types.
How many evaluation metrics should I track?
Start with 2-3 metrics covering your most critical quality dimensions. Add more as needed, but avoid tracking metrics you won't act on. More metrics mean more complexity in interpretation and higher evaluation costs.
Can I use the same metrics for development and production?
Yes, but sampling strategies differ. In development, run all metrics on your full test suite during evaluation. In production, configure online scoring rules with appropriate sampling rates (1-10% for high-volume applications, higher for low volume). Set up multiple rules with different sampling rates to prioritize inexpensive metrics at higher rates and expensive LLM-based metrics at lower rates.
How do I know if my custom scorer is reliable?
Validate against human judgments. Have humans score 100-200 examples. Compare human scores to your scorer's outputs. Calculate correlation. If alignment is poor, refine the scorer prompt or logic. Iterate until scores match human judgment reasonably well.