Java SpringAI

Spring AI Enterprise Architecture — Production-Grade AI Platform Design

Spring AI Enterprise Architecture — Production-Grade AI Platform Design

Building a single AI chatbot is straightforward. Building an enterprise AI platform that serves dozens of teams, handles millions of requests, maintains compliance, and evolves without disruption requires deliberate architectural decisions. This tutorial presents the reference architecture for enterprise-scale Spring AI deployments.

Enterprise AI Platform Architecture

                         ┌─────────────────────────────────┐
                         │        API Gateway Layer         │
                         │  (Rate limiting, Auth, Routing)  │
                         └──────────────┬──────────────────┘
                                        │
              ┌─────────────────────────┼──────────────────────────┐
              │                         │                          │
     ┌────────▼────────┐    ┌───────────▼─────────┐    ┌──────────▼────────┐
     │  AI Orchestrator │    │   Knowledge Service  │    │  Model Registry   │
     │  (routing,       │    │   (RAG + embeddings) │    │  (model versions, │
     │  fallback,       │    │                     │    │   A/B testing)    │
     │  cost tracking)  │    │                     │    │                   │
     └────────┬────────┘    └────────┬────────────┘    └──────────┬────────┘
              │                      │                             │
     ┌────────▼────────────────────────────────────────────────────▼────────┐
     │                        Provider Abstraction Layer                     │
     │              Spring AI ChatClient, VectorStore, EmbeddingModel        │
     └────────┬──────────────┬──────────────┬──────────────┬────────────────┘
              │              │              │              │
      ┌───────▼──┐   ┌───────▼──┐   ┌──────▼────┐   ┌───▼──────┐
      │  OpenAI   │   │ Anthropic│   │  Gemini   │   │  Ollama  │
      │ gpt-4o   │   │ claude   │   │ 1.5-pro   │   │ (local)  │
      └──────────┘   └──────────┘   └───────────┘   └──────────┘

Core Principles

1. Provider Agnosticism
   Use Spring AI VectorStore/ChatClient interfaces everywhere
   → Can swap OpenAI for Anthropic by changing one property

2. Defense in Depth
   Input validation → guardrails → output validation
   → No single point of failure in AI safety

3. Observable by Default
   Every AI call emits Micrometer metrics + structured logs
   → Can answer: cost/day, p99 latency, error rate per feature

4. Fail Gracefully
   Circuit breakers + fallback models + cached responses
   → AI outage = degraded experience, not application failure

5. Cost-Aware Routing
   Task complexity → model tier → tracked per team
   → No bill surprises, cross-charge teams for their AI usage

AI Orchestrator Service

@Service
public class AiOrchestrator {

    private final LlmGateway           gateway;
    private final KnowledgeService     knowledge;
    private final GuardrailService     guardrails;
    private final CostTracker          costTracker;
    private final AuditLogger          auditLogger;

    public OrchestrationResult process(AiRequest request) {
        String requestId = UUID.randomUUID().toString();
        long start = System.currentTimeMillis();

        try {
            // 1. Input validation
            guardrails.validateInput(request.userMessage());

            // 2. Route to appropriate model tier
            String route = selectRoute(request);

            // 3. Retrieve context if RAG is needed
            String context = "";
            if (request.useRag()) {
                context = knowledge.retrieveContext(request.userMessage());
            }

            // 4. Build the final prompt
            String systemPrompt = buildSystemPrompt(request, context);

            // 5. Call the appropriate model
            GatewayResponse response = gateway.route(route, systemPrompt, request.userMessage());

            // 6. Validate output
            guardrails.validateOutput(request.userMessage(), response.content());

            // 7. Track cost and log
            costTracker.record(request.tenantId(), request.userId(),
                    response.provider(), response.model(), response.tokensUsed());
            auditLogger.log(requestId, request, response, System.currentTimeMillis() - start);

            return OrchestrationResult.success(response.content(), requestId);

        } catch (GuardrailException e) {
            auditLogger.logBlocked(requestId, request, e.getMessage());
            return OrchestrationResult.blocked(e.getMessage(), requestId);
        } catch (Exception e) {
            auditLogger.logError(requestId, request, e);
            return OrchestrationResult.error("Service temporarily unavailable", requestId);
        }
    }

    private String selectRoute(AiRequest request) {
        return switch (request.taskType()) {
            case "classify", "extract_keywords" -> "fast";       // cheap model
            case "code_review", "security_audit" -> "code";      // powerful model
            case "legal_review", "compliance_check" -> "quality"; // best model
            default -> request.complexity() > 0.7 ? "quality" : "fast";
        };
    }
}

Model Registry — A/B Testing

@Service
public class ModelRegistry {

    private final Map<String, ModelConfig> models = new ConcurrentHashMap<>();

    @PostConstruct
    public void init() {
        // Register model variants for A/B testing
        models.put("chat-v1", new ModelConfig("openai", "gpt-4o-mini", 1.0, "stable"));
        models.put("chat-v2", new ModelConfig("anthropic", "claude-haiku-4-5", 0.0, "canary"));
    }

    // Traffic splitting: 90% stable, 10% canary
    public ModelConfig selectModel(String feature, String userId) {
        List<ModelConfig> variants = models.values().stream()
                .filter(m -> m.feature().equals(feature))
                .toList();

        double random = Math.abs(userId.hashCode() % 100) / 100.0;
        double cumulative = 0;

        for (ModelConfig variant : variants) {
            cumulative += variant.trafficShare();
            if (random < cumulative) {
                return variant;
            }
        }

        return variants.get(0);  // fallback to first
    }
}

record ModelConfig(String provider, String model, double trafficShare, String variant) {}

Knowledge Service — Multi-Tenant RAG

@Service
public class KnowledgeService {

    private final VectorStore vectorStore;

    public String retrieveContext(String query) {
        List<Document> docs = vectorStore.similaritySearch(
                SearchRequest.query(query)
                        .withTopK(5)
                        .withSimilarityThreshold(0.7));

        return docs.stream()
                .map(Document::getContent)
                .collect(Collectors.joining("\n\n---\n\n"));
    }

    // Tenant-isolated document ingestion
    public void ingestTenantDocuments(String tenantId, List<Document> docs) {
        docs.forEach(doc -> doc.getMetadata().put("tenantId", tenantId));
        vectorStore.add(docs);
    }

    // Tenant-scoped retrieval
    public String retrieveForTenant(String tenantId, String query) {
        List<Document> docs = vectorStore.similaritySearch(
                SearchRequest.query(query)
                        .withTopK(5)
                        .withFilterExpression("tenantId == '" + tenantId + "'"));

        return docs.stream().map(Document::getContent)
                .collect(Collectors.joining("\n\n"));
    }
}

Operational Dashboard Metrics

Metric                              Alert Threshold
────────────────────────────────────────────────────────────
gen_ai.requests.per_minute          > 1000 (capacity warning)
gen_ai.cost.usd.per_hour            > $50 (budget alert)
gen_ai.error.rate                   > 5% (provider issue)
gen_ai.p99.latency.ms               > 10000ms (performance)
guardrail.blocks.per_minute         > 100 (abuse detection)
vector_store.search.latency.p99     > 500ms (index issue)

Key Points

  • The AI orchestrator is the single entry point for all AI calls — it enforces consistent security, cost tracking, and observability regardless of which team or feature is calling
  • Multi-tenant RAG requires filtering by tenantId metadata on every search — never let one tenant's documents contaminate another tenant's context
  • Model A/B testing via user ID hashing ensures the same user always gets the same variant within a test period — critical for measuring quality differences fairly
  • Cost cross-charging by team drives AI efficiency — when teams see their own bills, they optimize prompt lengths and choose appropriate model tiers
  • The most important operational metric is guardrail.blocks.per_minute — a sudden spike means either an abuse campaign or a legitimate feature is hitting guardrail rules it shouldn't
Topics: Java SpringAI
← Newer Post Older Post →