Spring AI with Ollama — Run Local LLMs with Spring Boot (No API Key Needed)
Ollama lets you run large language models locally on your own machine — no API key, no internet required, no data leaving your system. Spring AI has first-class Ollama support, meaning you can switch your application between OpenAI and Ollama by changing only the dependency and configuration. This is ideal for development, offline use, and privacy-sensitive applications.
Install and Start Ollama
# Download from https://ollama.com and install
# Pull a model (one-time download)
ollama pull llama3.2 # 2GB, fast, good for general tasks
ollama pull codellama # 4GB, optimized for code generation
ollama pull mistral # 4GB, strong reasoning
ollama pull nomic-embed-text # small embedding model for RAG
# Start Ollama server (runs on port 11434 by default)
ollama serve
# Verify:
curl http://localhost:11434/api/tags
Maven Dependency
<!-- Replace spring-ai-openai-spring-boot-starter with: -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
</dependency>
application.properties
# No API key needed!
spring.ai.ollama.base-url=http://localhost:11434
# Chat model
spring.ai.ollama.chat.options.model=llama3.2
spring.ai.ollama.chat.options.temperature=0.7
spring.ai.ollama.chat.options.num-predict=500 # max tokens
# Embedding model (for RAG)
spring.ai.ollama.embedding.options.model=nomic-embed-text
Service — Identical Code to OpenAI Version
@Service
public class LocalAiService {
private final ChatClient chatClient;
// Exactly the same code as OpenAI — Spring AI handles the difference
public LocalAiService(ChatClient.Builder builder) {
this.chatClient = builder
.defaultSystem("You are a helpful Java programming assistant.")
.build();
}
public String ask(String question) {
return chatClient.prompt()
.user(question)
.call()
.content();
}
public String generateCode(String requirement) {
return chatClient.prompt()
.system("You are an expert Java developer. Output only code, no explanations.")
.user(requirement)
.call()
.content();
}
}
Controller
@RestController
@RequestMapping("/local-ai")
public class LocalAiController {
private final LocalAiService aiService;
public LocalAiController(LocalAiService aiService) {
this.aiService = aiService;
}
@GetMapping("/ask")
public String ask(@RequestParam String q) {
return aiService.ask(q);
}
@PostMapping("/code")
public String generateCode(@RequestBody String requirement) {
return aiService.generateCode(requirement);
}
}
Output
GET /local-ai/ask?q=What is a Java record?
A Java record (introduced in Java 16) is an immutable data class that
automatically generates constructor, getters, equals(), hashCode(), and
toString() methods. Example:
public record Point(int x, int y) {}
Point p = new Point(3, 4);
System.out.println(p.x()); // 3
RAG with Ollama (Fully Local — No Cloud)
@Configuration
public class LocalRagConfig {
@Bean
public VectorStore vectorStore(EmbeddingModel embeddingModel) {
// SimpleVectorStore uses Ollama's nomic-embed-text for embeddings
return new SimpleVectorStore(embeddingModel);
}
}
@Service
public class LocalRagService {
private final ChatClient chatClient;
private final VectorStore vectorStore;
public LocalRagService(ChatClient.Builder builder, VectorStore vectorStore) {
this.vectorStore = vectorStore;
this.chatClient = builder
.defaultAdvisors(new QuestionAnswerAdvisor(vectorStore))
.defaultSystem("Answer only from the provided context. Be concise.")
.build();
}
public void addDocument(String content) {
vectorStore.add(new TokenTextSplitter().apply(
List.of(new Document(content))));
}
public String ask(String question) {
return chatClient.prompt().user(question).call().content();
}
}
Code Generation with CodeLlama
# Switch to CodeLlama for code tasks:
# application.properties
spring.ai.ollama.chat.options.model=codellama
# Then:
POST /local-ai/code
Body: Write a Spring Boot REST controller for CRUD operations on a User entity
with JPA repository. Include validation annotations.
Multi-Model Configuration
@Configuration
public class OllamaConfig {
@Bean("generalChatClient")
public ChatClient generalClient(ChatClient.Builder builder) {
return builder
.defaultOptions(OllamaOptions.builder().model("llama3.2").build())
.build();
}
@Bean("codeChatClient")
public ChatClient codeClient(ChatClient.Builder builder) {
return builder
.defaultOptions(OllamaOptions.builder().model("codellama").build())
.build();
}
}
@Service
public class MultiModelService {
private final ChatClient generalClient;
private final ChatClient codeClient;
public MultiModelService(
@Qualifier("generalChatClient") ChatClient generalClient,
@Qualifier("codeChatClient") ChatClient codeClient) {
this.generalClient = generalClient;
this.codeClient = codeClient;
}
public String explain(String concept) {
return generalClient.prompt().user("Explain " + concept).call().content();
}
public String generate(String requirement) {
return codeClient.prompt().user(requirement).call().content();
}
}
Performance Comparison
Model | Size | Speed (M1 Mac) | Quality | Use Case
──────────────────────────────────────────────────────────────────
llama3.2 | 2GB | ~30 tok/s | Good | General chat
codellama | 4GB | ~20 tok/s | Good at code | Code generation
mistral | 4GB | ~20 tok/s | Strong | Reasoning, analysis
llama3.1:70b | 40GB | ~5 tok/s | Near GPT-4 | Complex tasks (needs GPU)
nomic-embed | 274MB | Very fast | Good | Embeddings for RAG
Key Points
- Your service code is identical for OpenAI and Ollama — only the starter dependency and properties change
- Use
nomic-embed-textfor embeddings — it's small, fast, and produces 768-dimension vectors - Ollama requires a GPU for fast inference with 70B+ models; smaller models (3B-7B) run reasonably on CPU
- For production private deployments, Ollama can run in Docker:
docker run -p 11434:11434 ollama/ollama - Streaming is fully supported with Ollama — use
.stream().content()the same way as with OpenAI
Comments