Java9R: Spring AI Production Checklist — Deploy AI Applications with Confidence

Spring AI Production Checklist — Deploy AI Applications with Confidence

Moving a Spring AI application from development to production requires attention to reliability, security, cost, observability, and performance. This checklist covers every concern you should verify before going live — from API key rotation to prompt versioning to graceful degradation.

1. Configuration and Secrets

✔ Never hard-code API keys in source code or application.properties
  BAD:  spring.ai.openai.api-key=sk-proj-abc123...
  GOOD: spring.ai.openai.api-key=${OPENAI_API_KEY}

✔ Use a secrets manager in production:
  - AWS Secrets Manager + Spring Cloud AWS
  - HashiCorp Vault + Spring Cloud Vault
  - Kubernetes Secrets mounted as env vars

✔ Rotate API keys regularly (at least every 90 days)

✔ Use different API keys for development, staging, and production
  → Separate billing, separate rate limits, separate audit logs

✔ Set spending limits in your AI provider dashboard
  → OpenAI: Platform → Settings → Billing → Usage limits
  → Anthropic: Console → Settings → Usage limits

2. Reliability

# application.properties — production retry configuration
spring.ai.retry.max-attempts=5
spring.ai.retry.on-http-codes=429,503,502
spring.ai.retry.exclude-on-http-codes=400,401,403,404
spring.ai.retry.backoff.initial-interval=2000
spring.ai.retry.backoff.multiplier=2.5
spring.ai.retry.backoff.max-interval=30000

✔ Circuit breaker configured (Resilience4j):
  failure-rate-threshold: 50%
  wait-duration-in-open-state: 30s

✔ Fallback model configured:
  Primary: GPT-4o (OpenAI)
  Fallback: Ollama/llama3.2 (local)

✔ Request timeout set (prevents thread starvation):
  spring.ai.openai.chat.options.timeout: 60s

✔ Health indicator implemented (/actuator/health shows AI status)

3. Security

✔ All AI endpoints are authenticated (Spring Security + JWT or OAuth2)
✔ Input validation:
    - Max prompt length enforced (4000 chars)
    - Prompt injection patterns blocked
    - XSS/HTML sanitized before inclusion in prompts
✔ PII detection runs before any prompt sent to cloud AI provider
✔ RAG documents filtered by user access level
✔ AI responses not trusted as safe HTML (always escape before rendering)
✔ Audit log of all AI calls (user, timestamp, input hash, output hash)

4. Cost Control

✔ Token usage tracked per user and per feature
✔ Per-user rate limits enforced (Redis counter)
✔ Daily budget alerts configured (email/Slack at 80% of limit)
✔ Model selection optimized:
    - Classification → gpt-4o-mini or claude-haiku (10-50x cheaper)
    - Generation     → gpt-4o or claude-sonnet
    - Private data   → Ollama (free)
✔ Response caching for repeated questions (Redis, 1h TTL)
✔ Streaming used for long responses (better UX, same cost)

// Cost estimate tracking:
@Scheduled(cron = "0 0 8 * * *")  // daily 8am report
public void dailyCostReport() {
    double yesterdayCost = repository.sumCostForDay(LocalDate.now().minusDays(1));
    log.info("Yesterday AI cost: ${}", String.format("%.4f", yesterdayCost));
}

5. Observability

✔ Micrometer metrics exported to Prometheus/Grafana:
    gen_ai.client.operation.duration (p50, p95, p99)
    gen_ai.client.token.usage (input + output)
    app.ai.calls.errors (error rate)

✔ Distributed tracing enabled (OpenTelemetry → Zipkin/Jaeger)
✔ Log level set correctly:
    production:   WARN for ai.spring.io (no prompt logging)
    development:  DEBUG for ai.spring.io (full prompt/response logging)
✔ Alerts defined:
    - p95 latency > 5s → page on-call
    - error rate > 5% → alert
    - daily cost > 80% of budget → notify

# application-prod.properties
logging.level.org.springframework.ai=WARN
spring.ai.chat.observations.include-prompt=false      # never log prompts in prod
spring.ai.chat.observations.include-completion=false  # never log responses in prod

6. Performance

✔ Cache common responses (60-80% of AI calls for FAQ-style questions can be cached)
✔ Async processing for non-blocking endpoints:
    - Use @Async + CompletableFuture for background AI tasks
    - WebFlux reactive pipeline for streaming
✔ Connection pool tuned for AI calls (long-running HTTP connections)
✔ VectorStore index type: HNSW (not exact search) for large datasets
✔ Embedding batch size: process 100+ documents per batch, not one at a time

// Batch embedding example:
EmbeddingResponse response = embeddingModel.embedForResponse(
    List.of("text1", "text2", "text3", ... /* up to 2048 */));

7. Prompt Management

✔ System prompts stored in external files (not hard-coded Java strings):
    src/main/resources/prompts/chat-system.st
    src/main/resources/prompts/rag-system.st
    src/main/resources/prompts/classifier.st

✔ Prompt versioning tracked (git history of prompt files)
✔ Prompt changes reviewed like code changes (PR review)
✔ A/B testing framework for prompt variations
✔ Prompt regression test suite (evaluation tests run on prompt changes)

// Externalized prompt example:
@Value("classpath:prompts/chat-system.st")
private Resource systemPromptResource;

String systemPrompt = systemPromptResource.getContentAsString(StandardCharsets.UTF_8);

8. Data and Privacy

✔ Documented which data is sent to which AI provider
✔ Users informed in privacy policy (AI processing disclosure)
✔ Option for users to opt out of cloud AI (use local model instead)
✔ No PII in vector store metadata that external users can access
✔ GDPR: ability to delete user data including AI conversation history
✔ Data residency: use region-specific API endpoints if required
    spring.ai.openai.base-url=https://api.openai.com  # default (US)

9. Deployment

# Dockerfile for Spring AI app
FROM eclipse-temurin:21-jre-alpine
COPY target/app.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", \
  "-Xmx512m", \
  "-Dspring.ai.openai.api-key=${OPENAI_API_KEY}", \
  "-jar", "app.jar"]

# Kubernetes deployment with secret
env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: ai-secrets
        key: openai-api-key

10. Post-Deployment Verification

✔ Health endpoint responds: GET /actuator/health → {"status":"UP","ai":"UP"}
✔ Metrics flowing to Prometheus: GET /actuator/prometheus → gen_ai_* metrics visible
✔ Smoke test: POST /ai/ask → valid AI response returned
✔ Cache working: same question twice → second response in <10ms
✔ Rate limiting working: 60th request in 1 minute → 429 returned
✔ Fallback working: disable OpenAI key → Ollama fallback responds
✔ No API keys in logs: grep 'sk-' logs/*.log → no matches

Key Points

Never log prompts in production — they may contain user PII that ends up in log aggregators
Set spending limits in your AI provider dashboard as a hard backstop — application-level rate limiting can fail
Test your fallback model before going live — an untested Ollama fallback that fails defeats the purpose
Prompt files should be versioned in git and reviewed in PRs — prompt changes affect behavior as much as code changes
Run your full evaluation test suite after every prompt change and before every production deployment

Spring AI Production Checklist — Deploy AI Applications with Confidence

Spring AI Production Checklist — Deploy AI Applications with Confidence

1. Configuration and Secrets

2. Reliability

3. Security

4. Cost Control

5. Observability

6. Performance

7. Prompt Management

8. Data and Privacy

9. Deployment

10. Post-Deployment Verification

Key Points

Comments