Spring AI Enterprise Architecture — Production-Grade AI Platform Design
Building a single AI chatbot is straightforward. Building an enterprise AI platform that serves dozens of teams, handles millions of requests, maintains compliance, and evolves without disruption requires deliberate architectural decisions. This tutorial presents the reference architecture for enterprise-scale Spring AI deployments.
Enterprise AI Platform Architecture
┌─────────────────────────────────┐
│ API Gateway Layer │
│ (Rate limiting, Auth, Routing) │
└──────────────┬──────────────────┘
│
┌─────────────────────────┼──────────────────────────┐
│ │ │
┌────────▼────────┐ ┌───────────▼─────────┐ ┌──────────▼────────┐
│ AI Orchestrator │ │ Knowledge Service │ │ Model Registry │
│ (routing, │ │ (RAG + embeddings) │ │ (model versions, │
│ fallback, │ │ │ │ A/B testing) │
│ cost tracking) │ │ │ │ │
└────────┬────────┘ └────────┬────────────┘ └──────────┬────────┘
│ │ │
┌────────▼────────────────────────────────────────────────────▼────────┐
│ Provider Abstraction Layer │
│ Spring AI ChatClient, VectorStore, EmbeddingModel │
└────────┬──────────────┬──────────────┬──────────────┬────────────────┘
│ │ │ │
┌───────▼──┐ ┌───────▼──┐ ┌──────▼────┐ ┌───▼──────┐
│ OpenAI │ │ Anthropic│ │ Gemini │ │ Ollama │
│ gpt-4o │ │ claude │ │ 1.5-pro │ │ (local) │
└──────────┘ └──────────┘ └───────────┘ └──────────┘
Core Principles
1. Provider Agnosticism
Use Spring AI VectorStore/ChatClient interfaces everywhere
→ Can swap OpenAI for Anthropic by changing one property
2. Defense in Depth
Input validation → guardrails → output validation
→ No single point of failure in AI safety
3. Observable by Default
Every AI call emits Micrometer metrics + structured logs
→ Can answer: cost/day, p99 latency, error rate per feature
4. Fail Gracefully
Circuit breakers + fallback models + cached responses
→ AI outage = degraded experience, not application failure
5. Cost-Aware Routing
Task complexity → model tier → tracked per team
→ No bill surprises, cross-charge teams for their AI usage
AI Orchestrator Service
@Service
public class AiOrchestrator {
private final LlmGateway gateway;
private final KnowledgeService knowledge;
private final GuardrailService guardrails;
private final CostTracker costTracker;
private final AuditLogger auditLogger;
public OrchestrationResult process(AiRequest request) {
String requestId = UUID.randomUUID().toString();
long start = System.currentTimeMillis();
try {
// 1. Input validation
guardrails.validateInput(request.userMessage());
// 2. Route to appropriate model tier
String route = selectRoute(request);
// 3. Retrieve context if RAG is needed
String context = "";
if (request.useRag()) {
context = knowledge.retrieveContext(request.userMessage());
}
// 4. Build the final prompt
String systemPrompt = buildSystemPrompt(request, context);
// 5. Call the appropriate model
GatewayResponse response = gateway.route(route, systemPrompt, request.userMessage());
// 6. Validate output
guardrails.validateOutput(request.userMessage(), response.content());
// 7. Track cost and log
costTracker.record(request.tenantId(), request.userId(),
response.provider(), response.model(), response.tokensUsed());
auditLogger.log(requestId, request, response, System.currentTimeMillis() - start);
return OrchestrationResult.success(response.content(), requestId);
} catch (GuardrailException e) {
auditLogger.logBlocked(requestId, request, e.getMessage());
return OrchestrationResult.blocked(e.getMessage(), requestId);
} catch (Exception e) {
auditLogger.logError(requestId, request, e);
return OrchestrationResult.error("Service temporarily unavailable", requestId);
}
}
private String selectRoute(AiRequest request) {
return switch (request.taskType()) {
case "classify", "extract_keywords" -> "fast"; // cheap model
case "code_review", "security_audit" -> "code"; // powerful model
case "legal_review", "compliance_check" -> "quality"; // best model
default -> request.complexity() > 0.7 ? "quality" : "fast";
};
}
}
Model Registry — A/B Testing
@Service
public class ModelRegistry {
private final Map<String, ModelConfig> models = new ConcurrentHashMap<>();
@PostConstruct
public void init() {
// Register model variants for A/B testing
models.put("chat-v1", new ModelConfig("openai", "gpt-4o-mini", 1.0, "stable"));
models.put("chat-v2", new ModelConfig("anthropic", "claude-haiku-4-5", 0.0, "canary"));
}
// Traffic splitting: 90% stable, 10% canary
public ModelConfig selectModel(String feature, String userId) {
List<ModelConfig> variants = models.values().stream()
.filter(m -> m.feature().equals(feature))
.toList();
double random = Math.abs(userId.hashCode() % 100) / 100.0;
double cumulative = 0;
for (ModelConfig variant : variants) {
cumulative += variant.trafficShare();
if (random < cumulative) {
return variant;
}
}
return variants.get(0); // fallback to first
}
}
record ModelConfig(String provider, String model, double trafficShare, String variant) {}
Knowledge Service — Multi-Tenant RAG
@Service
public class KnowledgeService {
private final VectorStore vectorStore;
public String retrieveContext(String query) {
List<Document> docs = vectorStore.similaritySearch(
SearchRequest.query(query)
.withTopK(5)
.withSimilarityThreshold(0.7));
return docs.stream()
.map(Document::getContent)
.collect(Collectors.joining("\n\n---\n\n"));
}
// Tenant-isolated document ingestion
public void ingestTenantDocuments(String tenantId, List<Document> docs) {
docs.forEach(doc -> doc.getMetadata().put("tenantId", tenantId));
vectorStore.add(docs);
}
// Tenant-scoped retrieval
public String retrieveForTenant(String tenantId, String query) {
List<Document> docs = vectorStore.similaritySearch(
SearchRequest.query(query)
.withTopK(5)
.withFilterExpression("tenantId == '" + tenantId + "'"));
return docs.stream().map(Document::getContent)
.collect(Collectors.joining("\n\n"));
}
}
Operational Dashboard Metrics
Metric Alert Threshold
────────────────────────────────────────────────────────────
gen_ai.requests.per_minute > 1000 (capacity warning)
gen_ai.cost.usd.per_hour > $50 (budget alert)
gen_ai.error.rate > 5% (provider issue)
gen_ai.p99.latency.ms > 10000ms (performance)
guardrail.blocks.per_minute > 100 (abuse detection)
vector_store.search.latency.p99 > 500ms (index issue)
Key Points
- The AI orchestrator is the single entry point for all AI calls — it enforces consistent security, cost tracking, and observability regardless of which team or feature is calling
- Multi-tenant RAG requires filtering by
tenantIdmetadata on every search — never let one tenant's documents contaminate another tenant's context - Model A/B testing via user ID hashing ensures the same user always gets the same variant within a test period — critical for measuring quality differences fairly
- Cost cross-charging by team drives AI efficiency — when teams see their own bills, they optimize prompt lengths and choose appropriate model tiers
- The most important operational metric is
guardrail.blocks.per_minute— a sudden spike means either an abuse campaign or a legitimate feature is hitting guardrail rules it shouldn't
Comments