Spring AI Retry, Rate Limiting, and Resilience — Handle AI Service Failures Gracefully
AI API calls fail — rate limits hit, timeouts occur, models return errors. A production Spring AI application must handle these failures gracefully with retries, fallbacks, circuit breakers, and rate limiting. This tutorial implements all four resilience patterns using Spring AI's built-in retry support and Resilience4j.
Types of AI Failures
HTTP 429 (Too Many Requests) → Rate limit hit, retry after delay
HTTP 503 (Service Unavailable) → Provider outage, try fallback model
HTTP 408 (Timeout) → Slow response, retry with backoff
HTTP 400 (Bad Request) → Invalid prompt, do not retry
Connection refused → Network issue, retry with backoff
Spring AI Built-in Retry
# application.properties — configure Spring AI's built-in retry
spring.ai.retry.max-attempts=5
spring.ai.retry.on-http-codes=429,503
spring.ai.retry.exclude-on-http-codes=400,401,403
spring.ai.retry.backoff.initial-interval=1000
spring.ai.retry.backoff.multiplier=2.0
spring.ai.retry.backoff.max-interval=30000
// Spring AI automatically retries on 429/503 — no extra code needed
@Service
public class ResilientAiService {
private final ChatClient chatClient;
public ResilientAiService(ChatClient.Builder builder) {
this.chatClient = builder.build();
}
public String ask(String question) {
// If OpenAI returns 429, Spring AI retries automatically
// with exponential backoff: 1s, 2s, 4s, 8s, 16s
return chatClient.prompt()
.user(question)
.call()
.content();
}
}
Resilience4j Circuit Breaker
<!-- pom.xml -->
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
</dependency>
# application.properties
resilience4j.circuitbreaker.instances.ai-service.failure-rate-threshold=50
resilience4j.circuitbreaker.instances.ai-service.slow-call-rate-threshold=80
resilience4j.circuitbreaker.instances.ai-service.slow-call-duration-threshold=10s
resilience4j.circuitbreaker.instances.ai-service.sliding-window-size=10
resilience4j.circuitbreaker.instances.ai-service.wait-duration-in-open-state=30s
import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import io.github.resilience4j.retry.annotation.Retry;
@Service
public class CircuitBreakerAiService {
private final ChatClient primaryClient; // OpenAI
private final ChatClient fallbackClient; // Ollama (local)
public CircuitBreakerAiService(ChatClient.Builder builder,
@Qualifier("ollamaBuilder") ChatClient.Builder ollamaBuilder) {
this.primaryClient = builder.build();
this.fallbackClient = ollamaBuilder.build();
}
@CircuitBreaker(name = "ai-service", fallbackMethod = "fallbackToLocalModel")
@Retry(name = "ai-service")
public String ask(String question) {
return primaryClient.prompt()
.user(question)
.call()
.content();
}
// Called when circuit opens — switch to local Ollama model
public String fallbackToLocalModel(String question, Exception e) {
System.out.println("Primary AI unavailable (" + e.getMessage() + "), using local model");
return fallbackClient.prompt()
.user(question)
.call()
.content();
}
}
Rate Limiting — Protect API Budget
import io.github.resilience4j.ratelimiter.annotation.RateLimiter;
# application.properties
resilience4j.ratelimiter.instances.ai-service.limit-for-period=20
resilience4j.ratelimiter.instances.ai-service.limit-refresh-period=1m
resilience4j.ratelimiter.instances.ai-service.timeout-duration=5s
@Service
public class RateLimitedAiService {
private final ChatClient chatClient;
public RateLimitedAiService(ChatClient.Builder builder) {
this.chatClient = builder.build();
}
@RateLimiter(name = "ai-service", fallbackMethod = "rateLimitFallback")
public String ask(String question) {
return chatClient.prompt().user(question).call().content();
}
public String rateLimitFallback(String question, Throwable t) {
return "Our AI service is currently busy. Please try again in a moment.";
}
}
Per-User Rate Limiting with Redis
@Service
public class PerUserRateLimitService {
private final ChatClient chatClient;
private final RedisTemplate<String, Integer> redisTemplate;
private static final int MAX_CALLS_PER_HOUR = 50;
public PerUserRateLimitService(ChatClient.Builder builder,
RedisTemplate<String, Integer> redisTemplate) {
this.chatClient = builder.build();
this.redisTemplate = redisTemplate;
}
public String ask(String userId, String question) {
String key = "ai:ratelimit:" + userId;
Integer count = redisTemplate.opsForValue().get(key);
if (count == null) {
redisTemplate.opsForValue().set(key, 1, Duration.ofHours(1));
} else if (count >= MAX_CALLS_PER_HOUR) {
throw new RateLimitExceededException(
"Rate limit exceeded. Max " + MAX_CALLS_PER_HOUR + " calls per hour.");
} else {
redisTemplate.opsForValue().increment(key);
}
return chatClient.prompt().user(question).call().content();
}
}
Timeout Configuration
# application.properties
spring.ai.openai.chat.options.timeout=30s # 30 second timeout per call
# Or configure on OkHttp level:
@Configuration
public class AiHttpConfig {
@Bean
public RestClient.Builder restClientBuilder() {
return RestClient.builder()
.requestFactory(new HttpComponentsClientHttpRequestFactory(
HttpClients.custom()
.setConnectionTimeToLive(10, TimeUnit.SECONDS)
.setDefaultRequestConfig(RequestConfig.custom()
.setConnectTimeout(Timeout.of(5, TimeUnit.SECONDS))
.setResponseTimeout(Timeout.of(30, TimeUnit.SECONDS))
.build())
.build()
));
}
}
Resilience Output Log
# Normal operation:
User asks question → OpenAI responds in 1.2s ✓
# Rate limit hit (429):
AI call attempt 1 → 429 Too Many Requests
Retrying in 1s...
AI call attempt 2 → 429 Too Many Requests
Retrying in 2s...
AI call attempt 3 → 200 OK ✓
# Circuit open (50% failure rate exceeded):
Primary AI unavailable (Connection refused), using local model
Ollama responds in 3.8s ✓
# User rate limit:
RateLimitExceededException: Rate limit exceeded. Max 50 calls per hour.
Key Points
- Set
spring.ai.retry.max-attempts=5andon-http-codes=429,503— this handles the most common AI service errors with zero extra code - Never retry on 400 (bad request) or 401 (invalid API key) — they won't succeed on retry
- A local Ollama fallback is the best circuit breaker fallback — it works offline and has zero marginal cost
- Track per-user call counts in Redis to implement fair-use limits and prevent a single user from consuming the entire API budget
- Set a 30-second timeout — AI calls can take 60+ seconds for large outputs; without a timeout your threads will block indefinitely
Comments