Spring AI Multimodal — Image Analysis and Vision with GPT-4o and Claude
Modern LLMs can process both text and images in the same request. Spring AI supports multimodal input, letting you send images alongside text to vision models like GPT-4o and Claude Opus. This enables use cases like UI screenshot analysis, document image extraction, diagram explanation, and visual code review.
Models with Vision Support
Model Provider Vision Context
────────────────────────────────────────────────────
gpt-4o OpenAI Yes 128k tokens
gpt-4o-mini OpenAI Yes 128k tokens
claude-opus-4-5 Anthropic Yes 200k tokens
claude-sonnet-4-5 Anthropic Yes 200k tokens
llava Ollama Yes 4k tokens (local)
Maven Dependency
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
application.properties
spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.chat.options.model=gpt-4o
Analyze Image from URL
import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.ai.chat.messages.Media;
import org.springframework.util.MimeTypeUtils;
@Service
public class VisionService {
private final ChatClient chatClient;
public VisionService(ChatClient.Builder builder) {
this.chatClient = builder.build();
}
public String analyzeImageUrl(String imageUrl, String question) {
UserMessage userMessage = new UserMessage(
question,
List.of(new Media(MimeTypeUtils.IMAGE_JPEG, new URL(imageUrl)))
);
return chatClient.prompt()
.messages(userMessage)
.call()
.content();
}
}
Analyze Uploaded Image File
public String analyzeImageFile(MultipartFile imageFile, String question) throws IOException {
// Read image bytes
byte[] imageBytes = imageFile.getBytes();
String mimeType = imageFile.getContentType(); // e.g. "image/png"
UserMessage userMessage = new UserMessage(
question,
List.of(new Media(
MimeType.valueOf(mimeType),
new ByteArrayResource(imageBytes)
))
);
return chatClient.prompt()
.messages(userMessage)
.call()
.content();
}
REST Controller for Image Analysis
@RestController
@RequestMapping("/vision")
public class VisionController {
private final VisionService visionService;
public VisionController(VisionService visionService) {
this.visionService = visionService;
}
@PostMapping("/analyze")
public String analyzeUpload(
@RequestParam("image") MultipartFile image,
@RequestParam("question") String question) throws IOException {
return visionService.analyzeImageFile(image, question);
}
@PostMapping("/url")
public String analyzeUrl(
@RequestParam String url,
@RequestParam String question) throws MalformedURLException {
return visionService.analyzeImageUrl(url, question);
}
}
Practical Use Cases with Examples
// 1. UI Screenshot Analysis
String result = visionService.analyzeImageFile(screenshot,
"Describe the UI layout. Identify any usability issues. " +
"Is the button hierarchy clear? Are there accessibility concerns?");
// Output:
// The screenshot shows a dashboard with a top navigation bar and three data cards.
// Issue: The 'Delete' button is red and prominently placed next to 'Save' — high risk
// of accidental deletion. Consider adding a confirmation dialog.
// Accessibility: No alt text visible on the chart images.
// 2. Diagram to Code
String result = visionService.analyzeImageFile(erDiagram,
"Convert this ER diagram to JPA entity classes with proper annotations.");
// Output:
// @Entity public class Customer {
// @Id @GeneratedValue Long id;
// String name, email;
// @OneToMany List<Order> orders;
// }
// ...
// 3. Error Screenshot Debugging
String result = visionService.analyzeImageFile(errorScreenshot,
"This is a Spring Boot error page. What caused this exception and how to fix it?");
// Output:
// The error is: java.lang.NullPointerException at UserService.java:42
// Cause: 'user' is null — you're calling user.getName() without null check.
// Fix: Add @NotNull validation or check: if (user != null) before accessing it.
Structured Output from Image
public record InvoiceData(
String invoiceNumber,
String date,
String vendorName,
double totalAmount,
String currency
) {}
public InvoiceData extractInvoiceData(MultipartFile invoiceImage) throws IOException {
byte[] bytes = invoiceImage.getBytes();
UserMessage message = new UserMessage(
"Extract invoice data from this image. Return structured data.",
List.of(new Media(MimeType.valueOf("image/jpeg"), new ByteArrayResource(bytes)))
);
return chatClient.prompt()
.messages(message)
.call()
.entity(InvoiceData.class); // structured output works with multimodal!
}
Output
// analyzeImageUrl(architectureDiagram, "Explain this system architecture")
This diagram shows a microservices architecture with:
1. API Gateway (entry point) routing to 3 services
2. User Service → PostgreSQL database
3. Order Service → MySQL + Redis cache
4. Notification Service → message queue (RabbitMQ)
5. All services report to a centralized logging system (ELK stack)
The design follows CQRS pattern — read and write paths are separated in the Order Service.
Potential concern: No circuit breaker shown between API Gateway and services.
Local Vision with Ollama (LLaVA)
# application-local.properties
spring.ai.ollama.chat.options.model=llava
# ollama pull llava ← download first
// Same service code — Spring AI handles the provider difference
Key Points
- Pass images as
Mediaobjects inside aUserMessage— Spring AI serializes them correctly per provider (base64 for OpenAI, URL for some others) - Structured output (
.entity(MyClass.class)) works with multimodal inputs — extract typed data from images - Image resolution affects token usage — GPT-4o charges roughly 170 tokens per 512x512 tile; large images cost more
- For privacy-sensitive images, use LLaVA via Ollama — images never leave your server
- Always validate image MIME type and size on the server before passing to the AI model
Comments