Java SpringAI

What is RAG? — Retrieval Augmented Generation with Spring AI Explained

What is RAG? — Retrieval Augmented Generation with Spring AI Explained

Large language models have a knowledge cutoff — they don't know about your company's internal documents, your latest API specs, or anything that happened after their training date. RAG (Retrieval Augmented Generation) solves this by retrieving relevant information from your own data sources and injecting it into the prompt before the LLM answers.

This tutorial explains the RAG concept, its architecture, and how Spring AI implements each step.

The Core Problem RAG Solves

Without RAG:
  User: "What is the refund policy for our product?"
  LLM:  "I don't have access to your company's specific policies..."
                       ↑ LLM has no idea about your documents

With RAG:
  User: "What is the refund policy for our product?"
  System fetches: [from your docs] "Refunds accepted within 30 days with receipt..."
  LLM sees prompt: "Based on this policy: 'Refunds accepted within 30 days...' — answer: What is the refund policy?"
  LLM:  "Your refund policy allows returns within 30 days if you provide a receipt."
                       ↑ Accurate answer grounded in your actual documents

RAG Architecture — 4 Steps

┌─────────────────────────────────────────────────────────┐
│                    INDEXING (one-time)                   │
│                                                          │
│  Your Docs (PDF/HTML/DB)                                 │
│       ↓                                                  │
│  Split into small chunks (500-1000 tokens each)          │
│       ↓                                                  │
│  Embedding Model converts each chunk → float[] vector   │
│       ↓                                                  │
│  Store vectors in Vector Database (PGVector, ChromaDB)   │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                   RETRIEVAL + GENERATION (per query)     │
│                                                          │
│  User asks: "What is the refund policy?"                 │
│       ↓                                                  │
│  Convert question → vector using same Embedding Model    │
│       ↓                                                  │
│  Search vector DB for top-K most similar chunks          │
│       ↓                                                  │
│  Build prompt: "Context: [chunks] \n\n Question: ..."    │
│       ↓                                                  │
│  LLM generates answer grounded in retrieved context      │
└─────────────────────────────────────────────────────────┘

Key Terms

Embedding     → A float array (e.g. 1536 numbers) representing the semantic
                meaning of a piece of text. Similar texts have similar vectors.

Vector Store  → A database that stores embeddings and can find the
                closest vectors to a query vector (cosine similarity search).

Chunking      → Breaking documents into smaller pieces so that each chunk
                covers one focused topic — improves retrieval precision.

Top-K Search  → Find the K most semantically similar chunks to a query.
                Typical K = 3 to 5.

Context Window → The maximum text an LLM can process. RAG fits relevant
                chunks inside this window alongside the user's question.

Spring AI RAG Components Map

Step                  Spring AI Class
─────────────────────────────────────────────
Read documents     →  PagePdfDocumentReader
                       TextReader
                       JsonReader

Split into chunks  →  TokenTextSplitter
                       RecursiveCharacterTextSplitter

Create embeddings  →  EmbeddingModel
                       (OpenAiEmbeddingModel, OllamaEmbeddingModel)

Store vectors      →  VectorStore
                       (PgVectorStore, ChromaVectorStore, SimpleVectorStore)

Search at query    →  vectorStore.similaritySearch(SearchRequest)

Build RAG prompt   →  QuestionAnswerAdvisor (handles all of this automatically)

Minimal RAG Example — In-Memory Store

@Service
public class SimpleRagService {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    public SimpleRagService(ChatClient.Builder builder, VectorStore vectorStore) {
        this.chatClient = builder
                .defaultAdvisors(new QuestionAnswerAdvisor(vectorStore))
                .build();
        this.vectorStore = vectorStore;
    }

    // Load documents once at startup
    @PostConstruct
    public void loadDocuments() {
        List<Document> docs = List.of(
            new Document("Refunds are accepted within 30 days with a valid receipt."),
            new Document("Shipping takes 3-5 business days for standard delivery."),
            new Document("Contact support at support@java9r.com for order issues.")
        );

        // Split and embed — stored in-memory
        vectorStore.add(new TokenTextSplitter().apply(docs));
    }

    // Each call auto-retrieves relevant docs and injects them into prompt
    public String ask(String question) {
        return chatClient.prompt()
                .user(question)
                .call()
                .content();
    }
}

Output

// ask("How do I return a product?")
"You can return a product within 30 days of purchase. Make sure you have a valid receipt."

// ask("How long does shipping take?")
"Standard delivery takes 3-5 business days."

// ask("Who do I contact for order problems?")
"For order issues, you can contact support at support@java9r.com."

When to Use RAG vs Fine-tuning

Use RAG when:
  ✔ Your data changes frequently (product docs, policies)
  ✔ You need citations and source traceability
  ✔ You want to keep data private (not sent to OpenAI for training)
  ✔ Quick setup — no model training required

Use Fine-tuning when:
  ✔ You want the model to adopt a specific writing style
  ✔ You have thousands of labeled question-answer pairs
  ✔ You want domain-specific vocabulary baked into the model
  ✔ You can invest in training time and cost

Key Points

  • RAG = Retrieve relevant context → Augment the prompt → Generate a grounded answer
  • The same embedding model must be used for both indexing and querying — vectors must be in the same space
  • Chunk size matters: too small loses context, too large reduces precision — 500–800 tokens is typical
  • Spring AI's QuestionAnswerAdvisor handles retrieval and prompt injection automatically
  • RAG does not modify the model — it only changes the input; the model weights stay unchanged
Topics: Java SpringAI
← Newer Post Older Post →