What is RAG? — Retrieval Augmented Generation with Spring AI Explained
Large language models have a knowledge cutoff — they don't know about your company's internal documents, your latest API specs, or anything that happened after their training date. RAG (Retrieval Augmented Generation) solves this by retrieving relevant information from your own data sources and injecting it into the prompt before the LLM answers.
This tutorial explains the RAG concept, its architecture, and how Spring AI implements each step.
The Core Problem RAG Solves
Without RAG:
User: "What is the refund policy for our product?"
LLM: "I don't have access to your company's specific policies..."
↑ LLM has no idea about your documents
With RAG:
User: "What is the refund policy for our product?"
System fetches: [from your docs] "Refunds accepted within 30 days with receipt..."
LLM sees prompt: "Based on this policy: 'Refunds accepted within 30 days...' — answer: What is the refund policy?"
LLM: "Your refund policy allows returns within 30 days if you provide a receipt."
↑ Accurate answer grounded in your actual documents
RAG Architecture — 4 Steps
┌─────────────────────────────────────────────────────────┐
│ INDEXING (one-time) │
│ │
│ Your Docs (PDF/HTML/DB) │
│ ↓ │
│ Split into small chunks (500-1000 tokens each) │
│ ↓ │
│ Embedding Model converts each chunk → float[] vector │
│ ↓ │
│ Store vectors in Vector Database (PGVector, ChromaDB) │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ RETRIEVAL + GENERATION (per query) │
│ │
│ User asks: "What is the refund policy?" │
│ ↓ │
│ Convert question → vector using same Embedding Model │
│ ↓ │
│ Search vector DB for top-K most similar chunks │
│ ↓ │
│ Build prompt: "Context: [chunks] \n\n Question: ..." │
│ ↓ │
│ LLM generates answer grounded in retrieved context │
└─────────────────────────────────────────────────────────┘
Key Terms
Embedding → A float array (e.g. 1536 numbers) representing the semantic
meaning of a piece of text. Similar texts have similar vectors.
Vector Store → A database that stores embeddings and can find the
closest vectors to a query vector (cosine similarity search).
Chunking → Breaking documents into smaller pieces so that each chunk
covers one focused topic — improves retrieval precision.
Top-K Search → Find the K most semantically similar chunks to a query.
Typical K = 3 to 5.
Context Window → The maximum text an LLM can process. RAG fits relevant
chunks inside this window alongside the user's question.
Spring AI RAG Components Map
Step Spring AI Class
─────────────────────────────────────────────
Read documents → PagePdfDocumentReader
TextReader
JsonReader
Split into chunks → TokenTextSplitter
RecursiveCharacterTextSplitter
Create embeddings → EmbeddingModel
(OpenAiEmbeddingModel, OllamaEmbeddingModel)
Store vectors → VectorStore
(PgVectorStore, ChromaVectorStore, SimpleVectorStore)
Search at query → vectorStore.similaritySearch(SearchRequest)
Build RAG prompt → QuestionAnswerAdvisor (handles all of this automatically)
Minimal RAG Example — In-Memory Store
@Service
public class SimpleRagService {
private final ChatClient chatClient;
private final VectorStore vectorStore;
public SimpleRagService(ChatClient.Builder builder, VectorStore vectorStore) {
this.chatClient = builder
.defaultAdvisors(new QuestionAnswerAdvisor(vectorStore))
.build();
this.vectorStore = vectorStore;
}
// Load documents once at startup
@PostConstruct
public void loadDocuments() {
List<Document> docs = List.of(
new Document("Refunds are accepted within 30 days with a valid receipt."),
new Document("Shipping takes 3-5 business days for standard delivery."),
new Document("Contact support at support@java9r.com for order issues.")
);
// Split and embed — stored in-memory
vectorStore.add(new TokenTextSplitter().apply(docs));
}
// Each call auto-retrieves relevant docs and injects them into prompt
public String ask(String question) {
return chatClient.prompt()
.user(question)
.call()
.content();
}
}
Output
// ask("How do I return a product?")
"You can return a product within 30 days of purchase. Make sure you have a valid receipt."
// ask("How long does shipping take?")
"Standard delivery takes 3-5 business days."
// ask("Who do I contact for order problems?")
"For order issues, you can contact support at support@java9r.com."
When to Use RAG vs Fine-tuning
Use RAG when:
✔ Your data changes frequently (product docs, policies)
✔ You need citations and source traceability
✔ You want to keep data private (not sent to OpenAI for training)
✔ Quick setup — no model training required
Use Fine-tuning when:
✔ You want the model to adopt a specific writing style
✔ You have thousands of labeled question-answer pairs
✔ You want domain-specific vocabulary baked into the model
✔ You can invest in training time and cost
Key Points
- RAG = Retrieve relevant context → Augment the prompt → Generate a grounded answer
- The same embedding model must be used for both indexing and querying — vectors must be in the same space
- Chunk size matters: too small loses context, too large reduces precision — 500–800 tokens is typical
- Spring AI's
QuestionAnswerAdvisorhandles retrieval and prompt injection automatically - RAG does not modify the model — it only changes the input; the model weights stay unchanged
Comments