Java9R: Spring AI with Ollama — Run Local LLMs with Spring Boot (No API Key Needed)

Spring AI with Ollama — Run Local LLMs with Spring Boot (No API Key Needed)

Ollama lets you run large language models locally on your own machine — no API key, no internet required, no data leaving your system. Spring AI has first-class Ollama support, meaning you can switch your application between OpenAI and Ollama by changing only the dependency and configuration. This is ideal for development, offline use, and privacy-sensitive applications.

Install and Start Ollama

# Download from https://ollama.com and install

# Pull a model (one-time download)
ollama pull llama3.2          # 2GB, fast, good for general tasks
ollama pull codellama         # 4GB, optimized for code generation
ollama pull mistral           # 4GB, strong reasoning
ollama pull nomic-embed-text  # small embedding model for RAG

# Start Ollama server (runs on port 11434 by default)
ollama serve

# Verify:
curl http://localhost:11434/api/tags

Maven Dependency

<!-- Replace spring-ai-openai-spring-boot-starter with: -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
</dependency>

application.properties

# No API key needed!
spring.ai.ollama.base-url=http://localhost:11434

# Chat model
spring.ai.ollama.chat.options.model=llama3.2
spring.ai.ollama.chat.options.temperature=0.7
spring.ai.ollama.chat.options.num-predict=500   # max tokens

# Embedding model (for RAG)
spring.ai.ollama.embedding.options.model=nomic-embed-text

Service — Identical Code to OpenAI Version

@Service
public class LocalAiService {

    private final ChatClient chatClient;

    // Exactly the same code as OpenAI — Spring AI handles the difference
    public LocalAiService(ChatClient.Builder builder) {
        this.chatClient = builder
                .defaultSystem("You are a helpful Java programming assistant.")
                .build();
    }

    public String ask(String question) {
        return chatClient.prompt()
                .user(question)
                .call()
                .content();
    }

    public String generateCode(String requirement) {
        return chatClient.prompt()
                .system("You are an expert Java developer. Output only code, no explanations.")
                .user(requirement)
                .call()
                .content();
    }
}

Controller

@RestController
@RequestMapping("/local-ai")
public class LocalAiController {

    private final LocalAiService aiService;

    public LocalAiController(LocalAiService aiService) {
        this.aiService = aiService;
    }

    @GetMapping("/ask")
    public String ask(@RequestParam String q) {
        return aiService.ask(q);
    }

    @PostMapping("/code")
    public String generateCode(@RequestBody String requirement) {
        return aiService.generateCode(requirement);
    }
}

Output

GET /local-ai/ask?q=What is a Java record?

A Java record (introduced in Java 16) is an immutable data class that
automatically generates constructor, getters, equals(), hashCode(), and
toString() methods. Example:

public record Point(int x, int y) {}

Point p = new Point(3, 4);
System.out.println(p.x()); // 3

RAG with Ollama (Fully Local — No Cloud)

@Configuration
public class LocalRagConfig {

    @Bean
    public VectorStore vectorStore(EmbeddingModel embeddingModel) {
        // SimpleVectorStore uses Ollama's nomic-embed-text for embeddings
        return new SimpleVectorStore(embeddingModel);
    }
}

@Service
public class LocalRagService {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    public LocalRagService(ChatClient.Builder builder, VectorStore vectorStore) {
        this.vectorStore = vectorStore;
        this.chatClient  = builder
                .defaultAdvisors(new QuestionAnswerAdvisor(vectorStore))
                .defaultSystem("Answer only from the provided context. Be concise.")
                .build();
    }

    public void addDocument(String content) {
        vectorStore.add(new TokenTextSplitter().apply(
                List.of(new Document(content))));
    }

    public String ask(String question) {
        return chatClient.prompt().user(question).call().content();
    }
}

Code Generation with CodeLlama

# Switch to CodeLlama for code tasks:
# application.properties
spring.ai.ollama.chat.options.model=codellama

# Then:
POST /local-ai/code
Body: Write a Spring Boot REST controller for CRUD operations on a User entity
      with JPA repository. Include validation annotations.

Multi-Model Configuration

@Configuration
public class OllamaConfig {

    @Bean("generalChatClient")
    public ChatClient generalClient(ChatClient.Builder builder) {
        return builder
                .defaultOptions(OllamaOptions.builder().model("llama3.2").build())
                .build();
    }

    @Bean("codeChatClient")
    public ChatClient codeClient(ChatClient.Builder builder) {
        return builder
                .defaultOptions(OllamaOptions.builder().model("codellama").build())
                .build();
    }
}

@Service
public class MultiModelService {

    private final ChatClient generalClient;
    private final ChatClient codeClient;

    public MultiModelService(
            @Qualifier("generalChatClient") ChatClient generalClient,
            @Qualifier("codeChatClient")    ChatClient codeClient) {
        this.generalClient = generalClient;
        this.codeClient    = codeClient;
    }

    public String explain(String concept) {
        return generalClient.prompt().user("Explain " + concept).call().content();
    }

    public String generate(String requirement) {
        return codeClient.prompt().user(requirement).call().content();
    }
}

Performance Comparison

Model         | Size  | Speed (M1 Mac) | Quality        | Use Case
──────────────────────────────────────────────────────────────────
llama3.2      | 2GB   | ~30 tok/s      | Good           | General chat
codellama     | 4GB   | ~20 tok/s      | Good at code   | Code generation
mistral       | 4GB   | ~20 tok/s      | Strong         | Reasoning, analysis
llama3.1:70b  | 40GB  | ~5 tok/s       | Near GPT-4     | Complex tasks (needs GPU)
nomic-embed   | 274MB | Very fast      | Good           | Embeddings for RAG

Key Points

Your service code is identical for OpenAI and Ollama — only the starter dependency and properties change
Use nomic-embed-text for embeddings — it's small, fast, and produces 768-dimension vectors
Ollama requires a GPU for fast inference with 70B+ models; smaller models (3B-7B) run reasonably on CPU
For production private deployments, Ollama can run in Docker: docker run -p 11434:11434 ollama/ollama
Streaming is fully supported with Ollama — use .stream().content() the same way as with OpenAI

Spring AI with Ollama — Run Local LLMs with Spring Boot (No API Key Needed)

Spring AI with Ollama — Run Local LLMs with Spring Boot (No API Key Needed)

Install and Start Ollama

Maven Dependency

application.properties

Service — Identical Code to OpenAI Version

Controller

Output

RAG with Ollama (Fully Local — No Cloud)

Code Generation with CodeLlama

Multi-Model Configuration

Performance Comparison

Key Points

Comments