Java9R: Spring AI Document Intelligence — Extract Structured Data from PDFs and Documents

Spring AI Document Intelligence — Extract Structured Data from PDFs and Documents

Document intelligence goes beyond RAG — instead of answering questions from documents, you extract structured data, fill Java records, and feed it into your business logic. Spring AI's document readers and structured output converters make building automated document processing pipelines straightforward.

Use Cases

Document Type         Extract To                      Business Value
──────────────────────────────────────────────────────────────────────────
Invoice PDF           InvoiceRecord (amount, vendor)  Automated AP processing
Resume/CV             CandidateProfile                HR screening automation
Bank Statement        List<Transaction>               Financial analysis
Contract PDF          ContractTerms (dates, parties)  Legal review automation
Support Ticket email  TicketData (priority, category) Auto-routing
Medical Form PDF      PatientRecord                   EMR data entry

Maven Dependencies

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-tika-document-reader</artifactId>
</dependency>

Structured Extraction Records

// Invoice extraction target
public record Invoice(
        String invoiceNumber,
        String vendorName,
        String vendorEmail,
        LocalDate invoiceDate,
        LocalDate dueDate,
        BigDecimal subtotal,
        BigDecimal taxAmount,
        BigDecimal totalAmount,
        String currency,
        List<LineItem> lineItems
) {}

public record LineItem(
        String description,
        int quantity,
        BigDecimal unitPrice,
        BigDecimal totalPrice
) {}

// Resume extraction target
public record CandidateProfile(
        String fullName,
        String email,
        String phone,
        int yearsOfExperience,
        List<String> skills,
        String currentRole,
        String currentCompany,
        List<String> certifications,
        String highestEducation
) {}

Document Extraction Service

@Service
public class DocumentIntelligenceService {

    private final ChatClient chatClient;

    public DocumentIntelligenceService(ChatClient.Builder builder) {
        this.chatClient = builder
                .defaultSystem("""
                    You are a precise document data extractor.
                    Extract only what is explicitly stated in the document.
                    Use null for fields not found in the document.
                    Never infer or assume values not present in the text.
                    """)
                .defaultOptions(OpenAiChatOptions.builder()
                        .withModel("gpt-4o")
                        .withTemperature(0.0f)  // deterministic extraction
                        .build())
                .build();
    }

    // Extract from PDF file
    public Invoice extractInvoice(MultipartFile pdfFile) throws IOException {
        String documentText = readPdf(pdfFile.getResource());

        return chatClient.prompt()
                .user("Extract all invoice data from this document:\n\n" + documentText)
                .call()
                .entity(Invoice.class);
    }

    // Extract from any document format (using Tika)
    public CandidateProfile extractResume(MultipartFile file) throws IOException {
        String documentText = readWithTika(file.getResource());

        return chatClient.prompt()
                .user("Extract the candidate profile from this resume:\n\n" + documentText)
                .call()
                .entity(CandidateProfile.class);
    }

    private String readPdf(Resource resource) {
        PagePdfDocumentReader reader = new PagePdfDocumentReader(resource,
                PdfDocumentReaderConfig.builder()
                        .withPagesPerDocument(1000)  // treat entire PDF as one document
                        .build());

        return reader.get().stream()
                .map(Document::getContent)
                .collect(Collectors.joining("\n"));
    }

    private String readWithTika(Resource resource) {
        TikaDocumentReader reader = new TikaDocumentReader(resource);
        return reader.get().stream()
                .map(Document::getContent)
                .collect(Collectors.joining("\n"));
    }
}

Batch Document Processing

@Service
public class BatchDocumentProcessor {

    private final DocumentIntelligenceService extractor;
    private final InvoiceRepository           invoiceRepo;

    // Process multiple invoices from a directory
    public BatchResult processInvoiceDirectory(Path directory) throws IOException {
        List<Path> pdfFiles = Files.list(directory)
                .filter(p -> p.toString().endsWith(".pdf"))
                .toList();

        int success = 0, failed = 0;

        for (Path pdfPath : pdfFiles) {
            try {
                // Load as Spring Resource
                Resource resource = new FileSystemResource(pdfPath);

                // Extract invoice data
                Invoice invoice = extractor.extractInvoice(
                        toMockMultipart(resource));

                // Validate extraction
                if (invoice.invoiceNumber() == null || invoice.totalAmount() == null) {
                    System.out.println("Incomplete extraction: " + pdfPath.getFileName());
                    failed++;
                    continue;
                }

                // Save to database
                invoiceRepo.save(toEntity(invoice));
                System.out.println("Processed: " + invoice.invoiceNumber());
                success++;

            } catch (Exception e) {
                System.out.printf("Failed to process %s: %s%n",
                        pdfPath.getFileName(), e.getMessage());
                failed++;
            }
        }

        return new BatchResult(success, failed, pdfFiles.size());
    }
}

record BatchResult(int success, int failed, int total) {}

Confidence Scoring

public record ExtractionWithConfidence<T>(
        T data,
        double confidence,
        List<String> missingFields,
        String warningMessage
) {}

// Ask AI to assess its own extraction confidence
public ExtractionWithConfidence<Invoice> extractWithConfidence(String documentText) {
    // Step 1: Extract the invoice
    Invoice invoice = chatClient.prompt()
            .user("Extract invoice data:\n" + documentText)
            .call()
            .entity(Invoice.class);

    // Step 2: Self-assess confidence
    String confidenceCheck = chatClient.prompt()
            .user("""
                  You extracted this invoice: %s
                  From this document: %s (first 500 chars)

                  Rate your extraction confidence 0.0-1.0.
                  List any fields you were uncertain about.
                  Format: {"confidence": 0.85, "uncertainFields": ["dueDate", "taxAmount"]}
                  """.formatted(invoice, documentText.substring(0, Math.min(500, documentText.length()))))
            .call()
            .content();

    // Parse confidence JSON
    double confidence = parseConfidence(confidenceCheck);
    List<String> uncertain = parseUncertainFields(confidenceCheck);

    return new ExtractionWithConfidence<>(invoice, confidence, uncertain,
            confidence < 0.7 ? "Low confidence — human review recommended" : null);
}

Output

// extractInvoice(invoice.pdf)
Invoice[
  invoiceNumber=INV-2026-00234,
  vendorName=Tech Solutions Pvt Ltd,
  vendorEmail=billing@techsolutions.com,
  invoiceDate=2026-06-01,
  dueDate=2026-06-30,
  subtotal=45000.00,
  taxAmount=8100.00,
  totalAmount=53100.00,
  currency=INR,
  lineItems=[
    LineItem[description=Spring Boot Training, quantity=1, unitPrice=45000.00, totalPrice=45000.00]
  ]
]

// Batch processing 10 invoices:
Processed: INV-2026-00234
Processed: INV-2026-00235
Incomplete extraction: damaged-invoice.pdf
BatchResult[success=9, failed=1, total=10]

Key Points

Use temperature=0.0 for extraction — you want deterministic, reproducible output, not creative variation
GPT-4o is significantly better at structured extraction than GPT-4o-mini for complex documents — the quality difference justifies the cost for business-critical data
Always validate required fields after extraction and route to human review when confidence is low or required fields are null
Tika can read 1000+ file formats (PDF, DOCX, Excel, email) — use it as the universal reader and avoid format-specific handling code
For high-volume document processing (1000+/day), use async extraction via Kafka to avoid blocking HTTP threads for 5-15 second AI calls

Spring AI Document Intelligence — Extract Structured Data from PDFs and Documents

Spring AI Document Intelligence — Extract Structured Data from PDFs and Documents

Use Cases

Maven Dependencies

Structured Extraction Records

Document Extraction Service

Batch Document Processing

Confidence Scoring

Output

Key Points

Comments