Spring AI Production Checklist — Deploy AI Applications with Confidence
Moving a Spring AI application from development to production requires attention to reliability, security, cost, observability, and performance. This checklist covers every concern you should verify before going live — from API key rotation to prompt versioning to graceful degradation.
1. Configuration and Secrets
✔ Never hard-code API keys in source code or application.properties
BAD: spring.ai.openai.api-key=sk-proj-abc123...
GOOD: spring.ai.openai.api-key=${OPENAI_API_KEY}
✔ Use a secrets manager in production:
- AWS Secrets Manager + Spring Cloud AWS
- HashiCorp Vault + Spring Cloud Vault
- Kubernetes Secrets mounted as env vars
✔ Rotate API keys regularly (at least every 90 days)
✔ Use different API keys for development, staging, and production
→ Separate billing, separate rate limits, separate audit logs
✔ Set spending limits in your AI provider dashboard
→ OpenAI: Platform → Settings → Billing → Usage limits
→ Anthropic: Console → Settings → Usage limits
2. Reliability
# application.properties — production retry configuration
spring.ai.retry.max-attempts=5
spring.ai.retry.on-http-codes=429,503,502
spring.ai.retry.exclude-on-http-codes=400,401,403,404
spring.ai.retry.backoff.initial-interval=2000
spring.ai.retry.backoff.multiplier=2.5
spring.ai.retry.backoff.max-interval=30000
✔ Circuit breaker configured (Resilience4j):
failure-rate-threshold: 50%
wait-duration-in-open-state: 30s
✔ Fallback model configured:
Primary: GPT-4o (OpenAI)
Fallback: Ollama/llama3.2 (local)
✔ Request timeout set (prevents thread starvation):
spring.ai.openai.chat.options.timeout: 60s
✔ Health indicator implemented (/actuator/health shows AI status)
3. Security
✔ All AI endpoints are authenticated (Spring Security + JWT or OAuth2)
✔ Input validation:
- Max prompt length enforced (4000 chars)
- Prompt injection patterns blocked
- XSS/HTML sanitized before inclusion in prompts
✔ PII detection runs before any prompt sent to cloud AI provider
✔ RAG documents filtered by user access level
✔ AI responses not trusted as safe HTML (always escape before rendering)
✔ Audit log of all AI calls (user, timestamp, input hash, output hash)
4. Cost Control
✔ Token usage tracked per user and per feature
✔ Per-user rate limits enforced (Redis counter)
✔ Daily budget alerts configured (email/Slack at 80% of limit)
✔ Model selection optimized:
- Classification → gpt-4o-mini or claude-haiku (10-50x cheaper)
- Generation → gpt-4o or claude-sonnet
- Private data → Ollama (free)
✔ Response caching for repeated questions (Redis, 1h TTL)
✔ Streaming used for long responses (better UX, same cost)
// Cost estimate tracking:
@Scheduled(cron = "0 0 8 * * *") // daily 8am report
public void dailyCostReport() {
double yesterdayCost = repository.sumCostForDay(LocalDate.now().minusDays(1));
log.info("Yesterday AI cost: ${}", String.format("%.4f", yesterdayCost));
}
5. Observability
✔ Micrometer metrics exported to Prometheus/Grafana:
gen_ai.client.operation.duration (p50, p95, p99)
gen_ai.client.token.usage (input + output)
app.ai.calls.errors (error rate)
✔ Distributed tracing enabled (OpenTelemetry → Zipkin/Jaeger)
✔ Log level set correctly:
production: WARN for ai.spring.io (no prompt logging)
development: DEBUG for ai.spring.io (full prompt/response logging)
✔ Alerts defined:
- p95 latency > 5s → page on-call
- error rate > 5% → alert
- daily cost > 80% of budget → notify
# application-prod.properties
logging.level.org.springframework.ai=WARN
spring.ai.chat.observations.include-prompt=false # never log prompts in prod
spring.ai.chat.observations.include-completion=false # never log responses in prod
6. Performance
✔ Cache common responses (60-80% of AI calls for FAQ-style questions can be cached)
✔ Async processing for non-blocking endpoints:
- Use @Async + CompletableFuture for background AI tasks
- WebFlux reactive pipeline for streaming
✔ Connection pool tuned for AI calls (long-running HTTP connections)
✔ VectorStore index type: HNSW (not exact search) for large datasets
✔ Embedding batch size: process 100+ documents per batch, not one at a time
// Batch embedding example:
EmbeddingResponse response = embeddingModel.embedForResponse(
List.of("text1", "text2", "text3", ... /* up to 2048 */));
7. Prompt Management
✔ System prompts stored in external files (not hard-coded Java strings):
src/main/resources/prompts/chat-system.st
src/main/resources/prompts/rag-system.st
src/main/resources/prompts/classifier.st
✔ Prompt versioning tracked (git history of prompt files)
✔ Prompt changes reviewed like code changes (PR review)
✔ A/B testing framework for prompt variations
✔ Prompt regression test suite (evaluation tests run on prompt changes)
// Externalized prompt example:
@Value("classpath:prompts/chat-system.st")
private Resource systemPromptResource;
String systemPrompt = systemPromptResource.getContentAsString(StandardCharsets.UTF_8);
8. Data and Privacy
✔ Documented which data is sent to which AI provider
✔ Users informed in privacy policy (AI processing disclosure)
✔ Option for users to opt out of cloud AI (use local model instead)
✔ No PII in vector store metadata that external users can access
✔ GDPR: ability to delete user data including AI conversation history
✔ Data residency: use region-specific API endpoints if required
spring.ai.openai.base-url=https://api.openai.com # default (US)
9. Deployment
# Dockerfile for Spring AI app
FROM eclipse-temurin:21-jre-alpine
COPY target/app.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", \
"-Xmx512m", \
"-Dspring.ai.openai.api-key=${OPENAI_API_KEY}", \
"-jar", "app.jar"]
# Kubernetes deployment with secret
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: ai-secrets
key: openai-api-key
10. Post-Deployment Verification
✔ Health endpoint responds: GET /actuator/health → {"status":"UP","ai":"UP"}
✔ Metrics flowing to Prometheus: GET /actuator/prometheus → gen_ai_* metrics visible
✔ Smoke test: POST /ai/ask → valid AI response returned
✔ Cache working: same question twice → second response in <10ms
✔ Rate limiting working: 60th request in 1 minute → 429 returned
✔ Fallback working: disable OpenAI key → Ollama fallback responds
✔ No API keys in logs: grep 'sk-' logs/*.log → no matches
Key Points
- Never log prompts in production — they may contain user PII that ends up in log aggregators
- Set spending limits in your AI provider dashboard as a hard backstop — application-level rate limiting can fail
- Test your fallback model before going live — an untested Ollama fallback that fails defeats the purpose
- Prompt files should be versioned in git and reviewed in PRs — prompt changes affect behavior as much as code changes
- Run your full evaluation test suite after every prompt change and before every production deployment
Comments