Spring AI RAG Evaluation — Measure and Improve Answer Quality
Building a RAG chatbot is easy. Knowing whether it's actually working well requires systematic evaluation. This tutorial covers using Spring AI's built-in evaluators, building custom evaluation metrics, and setting up automated quality gates to detect regressions before they reach production.
Key RAG Evaluation Metrics
Metric What It Measures Ideal Value
──────────────────────────────────────────────────────────────────────
Faithfulness Is the answer grounded in context? > 0.80
Answer Relevancy Does the answer address the question? > 0.85
Context Recall Does retrieved context contain the answer? > 0.75
Context Precision Are retrieved chunks relevant? > 0.70
RAGAS = all four combined into one score (Reference-free Automated Gen Eval)
Spring AI Built-in Evaluators
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
import org.springframework.ai.evaluation.*;
@Service
public class RagEvaluatorService {
private final RelevancyEvaluator relevancyEvaluator;
private final FactCheckingEvaluator factCheckEvaluator;
// Spring AI provides these evaluators using an AI judge
public RagEvaluatorService(ChatClient.Builder builder) {
ChatClient judgeClient = builder
.defaultOptions(OpenAiChatOptions.builder()
.withModel("gpt-4o")
.withTemperature(0.0f) // deterministic judging
.build())
.build();
this.relevancyEvaluator = new RelevancyEvaluator(judgeClient);
this.factCheckEvaluator = new FactCheckingEvaluator(judgeClient);
}
public EvaluationResult evaluate(String question, String answer, String context) {
// Check if answer is relevant to question
EvaluationRequest evalRequest = new EvaluationRequest(question, context, answer);
EvaluationResponse relevancyResult = relevancyEvaluator.evaluate(evalRequest);
EvaluationResponse factualResult = factCheckEvaluator.evaluate(evalRequest);
return new EvaluationResult(
relevancyResult.isPass(),
factualResult.isPass(),
relevancyResult.getScore(),
factualResult.getScore()
);
}
}
public record EvaluationResult(
boolean isRelevant,
boolean isFactual,
double relevancyScore,
double factualScore
) {
public boolean pass() {
return isRelevant && isFactual && relevancyScore > 0.7 && factualScore > 0.7;
}
}
Building an Evaluation Dataset
// Define ground-truth question-answer pairs for evaluation
public record EvalSample(
String id,
String question,
String expectedAnswer, // what a correct answer should contain
List<String> requiredFacts // specific facts the answer must include
) {}
@Component
public class EvalDataset {
public List<EvalSample> getSpringAiSamples() {
return List.of(
new EvalSample(
"sa-001",
"How do you configure Spring AI ChatClient?",
"Use ChatClient.Builder with defaultSystem() and build()",
List.of("ChatClient.Builder", "defaultSystem", "build()")
),
new EvalSample(
"sa-002",
"What is the difference between VectorStore and EmbeddingModel?",
"EmbeddingModel converts text to vectors. VectorStore stores and searches vectors.",
List.of("EmbeddingModel", "VectorStore", "embeddings", "similarity search")
),
new EvalSample(
"sa-003",
"How do you implement RAG with Spring AI?",
"Use QuestionAnswerAdvisor with a VectorStore",
List.of("QuestionAnswerAdvisor", "VectorStore", "SearchRequest")
)
);
}
}
Automated Evaluation Pipeline
@Service
public class RagEvaluationPipeline {
private final RagChatService ragService;
private final RagEvaluatorService evaluator;
private final EvalDataset dataset;
public EvaluationReport runEvaluation() {
List<EvalSample> samples = dataset.getSpringAiSamples();
List<SampleResult> results = new ArrayList<>();
for (EvalSample sample : samples) {
System.out.println("Evaluating: " + sample.id());
// 1. Run RAG to get answer + retrieved context
RagResult ragResult = ragService.answerWithContext(sample.question());
// 2. Check required facts
boolean allFactsPresent = sample.requiredFacts().stream()
.allMatch(fact -> ragResult.answer().toLowerCase()
.contains(fact.toLowerCase()));
// 3. AI judge evaluation
EvaluationResult evalResult = evaluator.evaluate(
sample.question(), ragResult.answer(), ragResult.context());
results.add(new SampleResult(
sample.id(),
sample.question(),
ragResult.answer(),
allFactsPresent,
evalResult
));
}
return buildReport(results);
}
private EvaluationReport buildReport(List<SampleResult> results) {
double avgRelevancy = results.stream()
.mapToDouble(r -> r.evalResult().relevancyScore())
.average().orElse(0);
double avgFactual = results.stream()
.mapToDouble(r -> r.evalResult().factualScore())
.average().orElse(0);
long passed = results.stream().filter(r -> r.evalResult().pass()).count();
double passRate = (double) passed / results.size();
return new EvaluationReport(results, avgRelevancy, avgFactual, passRate);
}
}
record RagResult(String answer, String context) {}
record SampleResult(String id, String question, String answer,
boolean allFactsPresent, EvaluationResult evalResult) {}
record EvaluationReport(List<SampleResult> results,
double avgRelevancy, double avgFactual, double passRate) {}
Integration Test — Quality Gate
@SpringBootTest
class RagQualityGateTest {
@Autowired
private RagEvaluationPipeline evaluationPipeline;
@Test
void ragAnswerQualityMeetsMinimumThreshold() {
EvaluationReport report = evaluationPipeline.runEvaluation();
// Quality gates — fail CI if these drop
assertThat(report.passRate()).as("RAG pass rate")
.isGreaterThanOrEqualTo(0.80);
assertThat(report.avgRelevancy()).as("Average relevancy score")
.isGreaterThanOrEqualTo(0.75);
assertThat(report.avgFactual()).as("Average factual accuracy score")
.isGreaterThanOrEqualTo(0.75);
// Print detailed results for review
report.results().forEach(r -> {
System.out.printf("[%s] pass=%s relevancy=%.2f factual=%.2f%n",
r.id(), r.evalResult().pass(),
r.evalResult().relevancyScore(),
r.evalResult().factualScore());
});
}
}
Output
Running RAG evaluation on 3 samples...
Evaluating: sa-001
Evaluating: sa-002
Evaluating: sa-003
[sa-001] pass=true relevancy=0.92 factual=0.89
[sa-002] pass=true relevancy=0.87 factual=0.85
[sa-003] pass=false relevancy=0.61 factual=0.73
Evaluation Report:
Average Relevancy: 0.80
Average Factual: 0.82
Pass Rate: 66.7% (2/3)
⚠ Quality gate FAILED: pass rate 66.7% below threshold 80%
RAG improvement actions:
sa-003: QuestionAnswerAdvisor not retrieving correct context
→ Check topK setting, similarity threshold, or document chunking for advisor docs
Key Points
- Use a stronger model (gpt-4o) as the judge even if your RAG system uses a cheaper model — evaluation quality determines how trustworthy your metrics are
- Set
temperature=0.0for evaluation calls — you want deterministic, consistent scoring, not creative answers - Required-facts checking is fast and deterministic — combine it with AI-based evaluation to reduce the number of expensive judge calls
- Run the evaluation suite as a nightly CI job on production data samples — catch regressions when you change prompt templates, chunking strategy, or models
- A pass rate below 70% usually signals a retrieval problem (wrong chunks), not a generation problem — start debugging by printing the retrieved context
Comments