Java SpringAI

Spring AI Production Checklist — Deploy AI Applications with Confidence

Spring AI Production Checklist — Deploy AI Applications with Confidence

Moving a Spring AI application from development to production requires attention to reliability, security, cost, observability, and performance. This checklist covers every concern you should verify before going live — from API key rotation to prompt versioning to graceful degradation.

1. Configuration and Secrets

✔ Never hard-code API keys in source code or application.properties
  BAD:  spring.ai.openai.api-key=sk-proj-abc123...
  GOOD: spring.ai.openai.api-key=${OPENAI_API_KEY}

✔ Use a secrets manager in production:
  - AWS Secrets Manager + Spring Cloud AWS
  - HashiCorp Vault + Spring Cloud Vault
  - Kubernetes Secrets mounted as env vars

✔ Rotate API keys regularly (at least every 90 days)

✔ Use different API keys for development, staging, and production
  → Separate billing, separate rate limits, separate audit logs

✔ Set spending limits in your AI provider dashboard
  → OpenAI: Platform → Settings → Billing → Usage limits
  → Anthropic: Console → Settings → Usage limits

2. Reliability

# application.properties — production retry configuration
spring.ai.retry.max-attempts=5
spring.ai.retry.on-http-codes=429,503,502
spring.ai.retry.exclude-on-http-codes=400,401,403,404
spring.ai.retry.backoff.initial-interval=2000
spring.ai.retry.backoff.multiplier=2.5
spring.ai.retry.backoff.max-interval=30000

✔ Circuit breaker configured (Resilience4j):
  failure-rate-threshold: 50%
  wait-duration-in-open-state: 30s

✔ Fallback model configured:
  Primary: GPT-4o (OpenAI)
  Fallback: Ollama/llama3.2 (local)

✔ Request timeout set (prevents thread starvation):
  spring.ai.openai.chat.options.timeout: 60s

✔ Health indicator implemented (/actuator/health shows AI status)

3. Security

✔ All AI endpoints are authenticated (Spring Security + JWT or OAuth2)
✔ Input validation:
    - Max prompt length enforced (4000 chars)
    - Prompt injection patterns blocked
    - XSS/HTML sanitized before inclusion in prompts
✔ PII detection runs before any prompt sent to cloud AI provider
✔ RAG documents filtered by user access level
✔ AI responses not trusted as safe HTML (always escape before rendering)
✔ Audit log of all AI calls (user, timestamp, input hash, output hash)

4. Cost Control

✔ Token usage tracked per user and per feature
✔ Per-user rate limits enforced (Redis counter)
✔ Daily budget alerts configured (email/Slack at 80% of limit)
✔ Model selection optimized:
    - Classification → gpt-4o-mini or claude-haiku (10-50x cheaper)
    - Generation     → gpt-4o or claude-sonnet
    - Private data   → Ollama (free)
✔ Response caching for repeated questions (Redis, 1h TTL)
✔ Streaming used for long responses (better UX, same cost)

// Cost estimate tracking:
@Scheduled(cron = "0 0 8 * * *")  // daily 8am report
public void dailyCostReport() {
    double yesterdayCost = repository.sumCostForDay(LocalDate.now().minusDays(1));
    log.info("Yesterday AI cost: ${}", String.format("%.4f", yesterdayCost));
}

5. Observability

✔ Micrometer metrics exported to Prometheus/Grafana:
    gen_ai.client.operation.duration (p50, p95, p99)
    gen_ai.client.token.usage (input + output)
    app.ai.calls.errors (error rate)

✔ Distributed tracing enabled (OpenTelemetry → Zipkin/Jaeger)
✔ Log level set correctly:
    production:   WARN for ai.spring.io (no prompt logging)
    development:  DEBUG for ai.spring.io (full prompt/response logging)
✔ Alerts defined:
    - p95 latency > 5s → page on-call
    - error rate > 5% → alert
    - daily cost > 80% of budget → notify

# application-prod.properties
logging.level.org.springframework.ai=WARN
spring.ai.chat.observations.include-prompt=false      # never log prompts in prod
spring.ai.chat.observations.include-completion=false  # never log responses in prod

6. Performance

✔ Cache common responses (60-80% of AI calls for FAQ-style questions can be cached)
✔ Async processing for non-blocking endpoints:
    - Use @Async + CompletableFuture for background AI tasks
    - WebFlux reactive pipeline for streaming
✔ Connection pool tuned for AI calls (long-running HTTP connections)
✔ VectorStore index type: HNSW (not exact search) for large datasets
✔ Embedding batch size: process 100+ documents per batch, not one at a time

// Batch embedding example:
EmbeddingResponse response = embeddingModel.embedForResponse(
    List.of("text1", "text2", "text3", ... /* up to 2048 */));

7. Prompt Management

✔ System prompts stored in external files (not hard-coded Java strings):
    src/main/resources/prompts/chat-system.st
    src/main/resources/prompts/rag-system.st
    src/main/resources/prompts/classifier.st

✔ Prompt versioning tracked (git history of prompt files)
✔ Prompt changes reviewed like code changes (PR review)
✔ A/B testing framework for prompt variations
✔ Prompt regression test suite (evaluation tests run on prompt changes)

// Externalized prompt example:
@Value("classpath:prompts/chat-system.st")
private Resource systemPromptResource;

String systemPrompt = systemPromptResource.getContentAsString(StandardCharsets.UTF_8);

8. Data and Privacy

✔ Documented which data is sent to which AI provider
✔ Users informed in privacy policy (AI processing disclosure)
✔ Option for users to opt out of cloud AI (use local model instead)
✔ No PII in vector store metadata that external users can access
✔ GDPR: ability to delete user data including AI conversation history
✔ Data residency: use region-specific API endpoints if required
    spring.ai.openai.base-url=https://api.openai.com  # default (US)

9. Deployment

# Dockerfile for Spring AI app
FROM eclipse-temurin:21-jre-alpine
COPY target/app.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", \
  "-Xmx512m", \
  "-Dspring.ai.openai.api-key=${OPENAI_API_KEY}", \
  "-jar", "app.jar"]

# Kubernetes deployment with secret
env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: ai-secrets
        key: openai-api-key

10. Post-Deployment Verification

✔ Health endpoint responds: GET /actuator/health → {"status":"UP","ai":"UP"}
✔ Metrics flowing to Prometheus: GET /actuator/prometheus → gen_ai_* metrics visible
✔ Smoke test: POST /ai/ask → valid AI response returned
✔ Cache working: same question twice → second response in <10ms
✔ Rate limiting working: 60th request in 1 minute → 429 returned
✔ Fallback working: disable OpenAI key → Ollama fallback responds
✔ No API keys in logs: grep 'sk-' logs/*.log → no matches

Key Points

  • Never log prompts in production — they may contain user PII that ends up in log aggregators
  • Set spending limits in your AI provider dashboard as a hard backstop — application-level rate limiting can fail
  • Test your fallback model before going live — an untested Ollama fallback that fails defeats the purpose
  • Prompt files should be versioned in git and reviewed in PRs — prompt changes affect behavior as much as code changes
  • Run your full evaluation test suite after every prompt change and before every production deployment
Topics: Java SpringAI
← Newer Post Older Post →