Java SpringAI

Spring AI Multimodal — Image Analysis and Vision with GPT-4o and Claude

Spring AI Multimodal — Image Analysis and Vision with GPT-4o and Claude

Modern LLMs can process both text and images in the same request. Spring AI supports multimodal input, letting you send images alongside text to vision models like GPT-4o and Claude Opus. This enables use cases like UI screenshot analysis, document image extraction, diagram explanation, and visual code review.

Models with Vision Support

Model               Provider    Vision   Context
────────────────────────────────────────────────────
gpt-4o             OpenAI      Yes      128k tokens
gpt-4o-mini        OpenAI      Yes      128k tokens
claude-opus-4-5    Anthropic   Yes      200k tokens
claude-sonnet-4-5  Anthropic   Yes      200k tokens
llava              Ollama      Yes      4k tokens (local)

Maven Dependency

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>

application.properties

spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.chat.options.model=gpt-4o

Analyze Image from URL

import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.ai.chat.messages.Media;
import org.springframework.util.MimeTypeUtils;

@Service
public class VisionService {

    private final ChatClient chatClient;

    public VisionService(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    public String analyzeImageUrl(String imageUrl, String question) {
        UserMessage userMessage = new UserMessage(
                question,
                List.of(new Media(MimeTypeUtils.IMAGE_JPEG, new URL(imageUrl)))
        );

        return chatClient.prompt()
                .messages(userMessage)
                .call()
                .content();
    }
}

Analyze Uploaded Image File

public String analyzeImageFile(MultipartFile imageFile, String question) throws IOException {
    // Read image bytes
    byte[] imageBytes = imageFile.getBytes();
    String mimeType   = imageFile.getContentType(); // e.g. "image/png"

    UserMessage userMessage = new UserMessage(
            question,
            List.of(new Media(
                    MimeType.valueOf(mimeType),
                    new ByteArrayResource(imageBytes)
            ))
    );

    return chatClient.prompt()
            .messages(userMessage)
            .call()
            .content();
}

REST Controller for Image Analysis

@RestController
@RequestMapping("/vision")
public class VisionController {

    private final VisionService visionService;

    public VisionController(VisionService visionService) {
        this.visionService = visionService;
    }

    @PostMapping("/analyze")
    public String analyzeUpload(
            @RequestParam("image") MultipartFile image,
            @RequestParam("question") String question) throws IOException {
        return visionService.analyzeImageFile(image, question);
    }

    @PostMapping("/url")
    public String analyzeUrl(
            @RequestParam String url,
            @RequestParam String question) throws MalformedURLException {
        return visionService.analyzeImageUrl(url, question);
    }
}

Practical Use Cases with Examples

// 1. UI Screenshot Analysis
String result = visionService.analyzeImageFile(screenshot,
        "Describe the UI layout. Identify any usability issues. " +
        "Is the button hierarchy clear? Are there accessibility concerns?");

// Output:
// The screenshot shows a dashboard with a top navigation bar and three data cards.
// Issue: The 'Delete' button is red and prominently placed next to 'Save' — high risk
// of accidental deletion. Consider adding a confirmation dialog.
// Accessibility: No alt text visible on the chart images.

// 2. Diagram to Code
String result = visionService.analyzeImageFile(erDiagram,
        "Convert this ER diagram to JPA entity classes with proper annotations.");

// Output:
// @Entity public class Customer {
//     @Id @GeneratedValue Long id;
//     String name, email;
//     @OneToMany List<Order> orders;
// }
// ...

// 3. Error Screenshot Debugging
String result = visionService.analyzeImageFile(errorScreenshot,
        "This is a Spring Boot error page. What caused this exception and how to fix it?");

// Output:
// The error is: java.lang.NullPointerException at UserService.java:42
// Cause: 'user' is null — you're calling user.getName() without null check.
// Fix: Add @NotNull validation or check: if (user != null) before accessing it.

Structured Output from Image

public record InvoiceData(
        String  invoiceNumber,
        String  date,
        String  vendorName,
        double  totalAmount,
        String  currency
) {}

public InvoiceData extractInvoiceData(MultipartFile invoiceImage) throws IOException {
    byte[] bytes = invoiceImage.getBytes();

    UserMessage message = new UserMessage(
            "Extract invoice data from this image. Return structured data.",
            List.of(new Media(MimeType.valueOf("image/jpeg"), new ByteArrayResource(bytes)))
    );

    return chatClient.prompt()
            .messages(message)
            .call()
            .entity(InvoiceData.class);   // structured output works with multimodal!
}

Output

// analyzeImageUrl(architectureDiagram, "Explain this system architecture")

This diagram shows a microservices architecture with:
1. API Gateway (entry point) routing to 3 services
2. User Service → PostgreSQL database
3. Order Service → MySQL + Redis cache
4. Notification Service → message queue (RabbitMQ)
5. All services report to a centralized logging system (ELK stack)

The design follows CQRS pattern — read and write paths are separated in the Order Service.
Potential concern: No circuit breaker shown between API Gateway and services.

Local Vision with Ollama (LLaVA)

# application-local.properties
spring.ai.ollama.chat.options.model=llava
# ollama pull llava   ← download first

// Same service code — Spring AI handles the provider difference

Key Points

  • Pass images as Media objects inside a UserMessage — Spring AI serializes them correctly per provider (base64 for OpenAI, URL for some others)
  • Structured output (.entity(MyClass.class)) works with multimodal inputs — extract typed data from images
  • Image resolution affects token usage — GPT-4o charges roughly 170 tokens per 512x512 tile; large images cost more
  • For privacy-sensitive images, use LLaVA via Ollama — images never leave your server
  • Always validate image MIME type and size on the server before passing to the AI model
Topics: Java SpringAI
← Newer Post Older Post →