Java9R: Spring AI Multimodal — Image Analysis and Vision with GPT-4o and Claude

Spring AI Multimodal — Image Analysis and Vision with GPT-4o and Claude

Modern LLMs can process both text and images in the same request. Spring AI supports multimodal input, letting you send images alongside text to vision models like GPT-4o and Claude Opus. This enables use cases like UI screenshot analysis, document image extraction, diagram explanation, and visual code review.

Models with Vision Support

Model               Provider    Vision   Context
────────────────────────────────────────────────────
gpt-4o             OpenAI      Yes      128k tokens
gpt-4o-mini        OpenAI      Yes      128k tokens
claude-opus-4-5    Anthropic   Yes      200k tokens
claude-sonnet-4-5  Anthropic   Yes      200k tokens
llava              Ollama      Yes      4k tokens (local)

Maven Dependency

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>

application.properties

spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.chat.options.model=gpt-4o

Analyze Image from URL

import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.ai.chat.messages.Media;
import org.springframework.util.MimeTypeUtils;

@Service
public class VisionService {

    private final ChatClient chatClient;

    public VisionService(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    public String analyzeImageUrl(String imageUrl, String question) {
        UserMessage userMessage = new UserMessage(
                question,
                List.of(new Media(MimeTypeUtils.IMAGE_JPEG, new URL(imageUrl)))
        );

        return chatClient.prompt()
                .messages(userMessage)
                .call()
                .content();
    }
}

Analyze Uploaded Image File

public String analyzeImageFile(MultipartFile imageFile, String question) throws IOException {
    // Read image bytes
    byte[] imageBytes = imageFile.getBytes();
    String mimeType   = imageFile.getContentType(); // e.g. "image/png"

    UserMessage userMessage = new UserMessage(
            question,
            List.of(new Media(
                    MimeType.valueOf(mimeType),
                    new ByteArrayResource(imageBytes)
            ))
    );

    return chatClient.prompt()
            .messages(userMessage)
            .call()
            .content();
}

REST Controller for Image Analysis

@RestController
@RequestMapping("/vision")
public class VisionController {

    private final VisionService visionService;

    public VisionController(VisionService visionService) {
        this.visionService = visionService;
    }

    @PostMapping("/analyze")
    public String analyzeUpload(
            @RequestParam("image") MultipartFile image,
            @RequestParam("question") String question) throws IOException {
        return visionService.analyzeImageFile(image, question);
    }

    @PostMapping("/url")
    public String analyzeUrl(
            @RequestParam String url,
            @RequestParam String question) throws MalformedURLException {
        return visionService.analyzeImageUrl(url, question);
    }
}

Practical Use Cases with Examples

// 1. UI Screenshot Analysis
String result = visionService.analyzeImageFile(screenshot,
        "Describe the UI layout. Identify any usability issues. " +
        "Is the button hierarchy clear? Are there accessibility concerns?");

// Output:
// The screenshot shows a dashboard with a top navigation bar and three data cards.
// Issue: The 'Delete' button is red and prominently placed next to 'Save' — high risk
// of accidental deletion. Consider adding a confirmation dialog.
// Accessibility: No alt text visible on the chart images.

// 2. Diagram to Code
String result = visionService.analyzeImageFile(erDiagram,
        "Convert this ER diagram to JPA entity classes with proper annotations.");

// Output:
// @Entity public class Customer {
//     @Id @GeneratedValue Long id;
//     String name, email;
//     @OneToMany List<Order> orders;
// }
// ...

// 3. Error Screenshot Debugging
String result = visionService.analyzeImageFile(errorScreenshot,
        "This is a Spring Boot error page. What caused this exception and how to fix it?");

// Output:
// The error is: java.lang.NullPointerException at UserService.java:42
// Cause: 'user' is null — you're calling user.getName() without null check.
// Fix: Add @NotNull validation or check: if (user != null) before accessing it.

Structured Output from Image

public record InvoiceData(
        String  invoiceNumber,
        String  date,
        String  vendorName,
        double  totalAmount,
        String  currency
) {}

public InvoiceData extractInvoiceData(MultipartFile invoiceImage) throws IOException {
    byte[] bytes = invoiceImage.getBytes();

    UserMessage message = new UserMessage(
            "Extract invoice data from this image. Return structured data.",
            List.of(new Media(MimeType.valueOf("image/jpeg"), new ByteArrayResource(bytes)))
    );

    return chatClient.prompt()
            .messages(message)
            .call()
            .entity(InvoiceData.class);   // structured output works with multimodal!
}

Output

// analyzeImageUrl(architectureDiagram, "Explain this system architecture")

This diagram shows a microservices architecture with:
1. API Gateway (entry point) routing to 3 services
2. User Service → PostgreSQL database
3. Order Service → MySQL + Redis cache
4. Notification Service → message queue (RabbitMQ)
5. All services report to a centralized logging system (ELK stack)

The design follows CQRS pattern — read and write paths are separated in the Order Service.
Potential concern: No circuit breaker shown between API Gateway and services.

Local Vision with Ollama (LLaVA)

# application-local.properties
spring.ai.ollama.chat.options.model=llava
# ollama pull llava   ← download first

// Same service code — Spring AI handles the provider difference

Key Points

Pass images as Media objects inside a UserMessage — Spring AI serializes them correctly per provider (base64 for OpenAI, URL for some others)
Structured output (.entity(MyClass.class)) works with multimodal inputs — extract typed data from images
Image resolution affects token usage — GPT-4o charges roughly 170 tokens per 512x512 tile; large images cost more
For privacy-sensitive images, use LLaVA via Ollama — images never leave your server
Always validate image MIME type and size on the server before passing to the AI model

Spring AI Multimodal — Image Analysis and Vision with GPT-4o and Claude

Spring AI Multimodal — Image Analysis and Vision with GPT-4o and Claude

Models with Vision Support

Maven Dependency

application.properties

Analyze Image from URL

Analyze Uploaded Image File

REST Controller for Image Analysis

Practical Use Cases with Examples

Structured Output from Image

Output

Local Vision with Ollama (LLaVA)

Key Points

Comments