Spring AI Document Intelligence — Extract Structured Data from PDFs and Documents
Document intelligence goes beyond RAG — instead of answering questions from documents, you extract structured data, fill Java records, and feed it into your business logic. Spring AI's document readers and structured output converters make building automated document processing pipelines straightforward.
Use Cases
Document Type Extract To Business Value
──────────────────────────────────────────────────────────────────────────
Invoice PDF InvoiceRecord (amount, vendor) Automated AP processing
Resume/CV CandidateProfile HR screening automation
Bank Statement List<Transaction> Financial analysis
Contract PDF ContractTerms (dates, parties) Legal review automation
Support Ticket email TicketData (priority, category) Auto-routing
Medical Form PDF PatientRecord EMR data entry
Maven Dependencies
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-tika-document-reader</artifactId>
</dependency>
Structured Extraction Records
// Invoice extraction target
public record Invoice(
String invoiceNumber,
String vendorName,
String vendorEmail,
LocalDate invoiceDate,
LocalDate dueDate,
BigDecimal subtotal,
BigDecimal taxAmount,
BigDecimal totalAmount,
String currency,
List<LineItem> lineItems
) {}
public record LineItem(
String description,
int quantity,
BigDecimal unitPrice,
BigDecimal totalPrice
) {}
// Resume extraction target
public record CandidateProfile(
String fullName,
String email,
String phone,
int yearsOfExperience,
List<String> skills,
String currentRole,
String currentCompany,
List<String> certifications,
String highestEducation
) {}
Document Extraction Service
@Service
public class DocumentIntelligenceService {
private final ChatClient chatClient;
public DocumentIntelligenceService(ChatClient.Builder builder) {
this.chatClient = builder
.defaultSystem("""
You are a precise document data extractor.
Extract only what is explicitly stated in the document.
Use null for fields not found in the document.
Never infer or assume values not present in the text.
""")
.defaultOptions(OpenAiChatOptions.builder()
.withModel("gpt-4o")
.withTemperature(0.0f) // deterministic extraction
.build())
.build();
}
// Extract from PDF file
public Invoice extractInvoice(MultipartFile pdfFile) throws IOException {
String documentText = readPdf(pdfFile.getResource());
return chatClient.prompt()
.user("Extract all invoice data from this document:\n\n" + documentText)
.call()
.entity(Invoice.class);
}
// Extract from any document format (using Tika)
public CandidateProfile extractResume(MultipartFile file) throws IOException {
String documentText = readWithTika(file.getResource());
return chatClient.prompt()
.user("Extract the candidate profile from this resume:\n\n" + documentText)
.call()
.entity(CandidateProfile.class);
}
private String readPdf(Resource resource) {
PagePdfDocumentReader reader = new PagePdfDocumentReader(resource,
PdfDocumentReaderConfig.builder()
.withPagesPerDocument(1000) // treat entire PDF as one document
.build());
return reader.get().stream()
.map(Document::getContent)
.collect(Collectors.joining("\n"));
}
private String readWithTika(Resource resource) {
TikaDocumentReader reader = new TikaDocumentReader(resource);
return reader.get().stream()
.map(Document::getContent)
.collect(Collectors.joining("\n"));
}
}
Batch Document Processing
@Service
public class BatchDocumentProcessor {
private final DocumentIntelligenceService extractor;
private final InvoiceRepository invoiceRepo;
// Process multiple invoices from a directory
public BatchResult processInvoiceDirectory(Path directory) throws IOException {
List<Path> pdfFiles = Files.list(directory)
.filter(p -> p.toString().endsWith(".pdf"))
.toList();
int success = 0, failed = 0;
for (Path pdfPath : pdfFiles) {
try {
// Load as Spring Resource
Resource resource = new FileSystemResource(pdfPath);
// Extract invoice data
Invoice invoice = extractor.extractInvoice(
toMockMultipart(resource));
// Validate extraction
if (invoice.invoiceNumber() == null || invoice.totalAmount() == null) {
System.out.println("Incomplete extraction: " + pdfPath.getFileName());
failed++;
continue;
}
// Save to database
invoiceRepo.save(toEntity(invoice));
System.out.println("Processed: " + invoice.invoiceNumber());
success++;
} catch (Exception e) {
System.out.printf("Failed to process %s: %s%n",
pdfPath.getFileName(), e.getMessage());
failed++;
}
}
return new BatchResult(success, failed, pdfFiles.size());
}
}
record BatchResult(int success, int failed, int total) {}
Confidence Scoring
public record ExtractionWithConfidence<T>(
T data,
double confidence,
List<String> missingFields,
String warningMessage
) {}
// Ask AI to assess its own extraction confidence
public ExtractionWithConfidence<Invoice> extractWithConfidence(String documentText) {
// Step 1: Extract the invoice
Invoice invoice = chatClient.prompt()
.user("Extract invoice data:\n" + documentText)
.call()
.entity(Invoice.class);
// Step 2: Self-assess confidence
String confidenceCheck = chatClient.prompt()
.user("""
You extracted this invoice: %s
From this document: %s (first 500 chars)
Rate your extraction confidence 0.0-1.0.
List any fields you were uncertain about.
Format: {"confidence": 0.85, "uncertainFields": ["dueDate", "taxAmount"]}
""".formatted(invoice, documentText.substring(0, Math.min(500, documentText.length()))))
.call()
.content();
// Parse confidence JSON
double confidence = parseConfidence(confidenceCheck);
List<String> uncertain = parseUncertainFields(confidenceCheck);
return new ExtractionWithConfidence<>(invoice, confidence, uncertain,
confidence < 0.7 ? "Low confidence — human review recommended" : null);
}
Output
// extractInvoice(invoice.pdf)
Invoice[
invoiceNumber=INV-2026-00234,
vendorName=Tech Solutions Pvt Ltd,
vendorEmail=billing@techsolutions.com,
invoiceDate=2026-06-01,
dueDate=2026-06-30,
subtotal=45000.00,
taxAmount=8100.00,
totalAmount=53100.00,
currency=INR,
lineItems=[
LineItem[description=Spring Boot Training, quantity=1, unitPrice=45000.00, totalPrice=45000.00]
]
]
// Batch processing 10 invoices:
Processed: INV-2026-00234
Processed: INV-2026-00235
Incomplete extraction: damaged-invoice.pdf
BatchResult[success=9, failed=1, total=10]
Key Points
- Use
temperature=0.0for extraction — you want deterministic, reproducible output, not creative variation - GPT-4o is significantly better at structured extraction than GPT-4o-mini for complex documents — the quality difference justifies the cost for business-critical data
- Always validate required fields after extraction and route to human review when confidence is low or required fields are null
- Tika can read 1000+ file formats (PDF, DOCX, Excel, email) — use it as the universal reader and avoid format-specific handling code
- For high-volume document processing (1000+/day), use async extraction via Kafka to avoid blocking HTTP threads for 5-15 second AI calls
Comments