Spring AI Response Caching — Cache AI Answers with Redis and Spring Cache
AI API calls are expensive and slow. Caching is one of the most impactful optimizations you can apply — if two users ask the same question, the second answer should come from cache in milliseconds, not from the LLM in seconds. This tutorial implements semantic caching (exact match and near-duplicate detection) for Spring AI using Redis.
Two Types of AI Caching
Exact Match Cache:
"What is Spring Boot?" → cached → same question gets instant response
"What is spring boot?" → different string → cache miss (even though same meaning)
Semantic Cache (Smart):
"What is Spring Boot?" → cached response
"What is Spring Boot framework?" → vector similarity ≥ 0.95 → CACHE HIT ← same response returned
"What is Quarkus?" → vector similarity 0.3 → cache miss → new AI call
Maven Dependencies
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
application.properties
spring.ai.openai.api-key=${OPENAI_API_KEY}
# Redis
spring.data.redis.host=localhost
spring.data.redis.port=6379
# Spring Cache
spring.cache.type=redis
spring.cache.redis.time-to-live=3600000 # 1 hour TTL in milliseconds
Simple Exact-Match Cache
import org.springframework.cache.annotation.Cacheable;
import org.springframework.cache.annotation.CacheEvict;
@Service
public class CachedAiService {
private final ChatClient chatClient;
public CachedAiService(ChatClient.Builder builder) {
this.chatClient = builder.build();
}
// Cache result by the exact question string
@Cacheable(value = "ai-responses", key = "#question.toLowerCase().trim()")
public String ask(String question) {
System.out.println("Cache MISS — calling AI for: " + question);
return chatClient.prompt()
.user(question)
.call()
.content();
}
// Clear the cache for a specific question
@CacheEvict(value = "ai-responses", key = "#question.toLowerCase().trim()")
public void evict(String question) {
System.out.println("Evicted cache for: " + question);
}
// Clear entire cache (admin use)
@CacheEvict(value = "ai-responses", allEntries = true)
public void clearAll() {
System.out.println("AI response cache cleared");
}
}
Cache Configuration
@Configuration
@EnableCaching
public class CacheConfig {
@Bean
public RedisCacheManager cacheManager(RedisConnectionFactory factory) {
RedisCacheConfiguration config = RedisCacheConfiguration.defaultCacheConfig()
.entryTtl(Duration.ofHours(1))
.serializeKeysWith(RedisSerializationContext.SerializationPair
.fromSerializer(new StringRedisSerializer()))
.serializeValuesWith(RedisSerializationContext.SerializationPair
.fromSerializer(new GenericJackson2JsonRedisSerializer()));
return RedisCacheManager.builder(factory)
.cacheDefaults(config)
.withCacheConfiguration("ai-responses",
config.entryTtl(Duration.ofHours(6))) // 6h for AI responses
.withCacheConfiguration("ai-embeddings",
config.entryTtl(Duration.ofDays(7))) // 7d for embeddings
.build();
}
}
Semantic Cache — Cache by Meaning, Not Exact String
@Service
public class SemanticCacheService {
private final ChatClient chatClient;
private final EmbeddingModel embeddingModel;
private final VectorStore cacheStore; // Redis or SimpleVectorStore as cache
private static final double SIMILARITY_THRESHOLD = 0.95;
public SemanticCacheService(ChatClient.Builder builder,
EmbeddingModel embeddingModel,
VectorStore cacheStore) {
this.chatClient = builder.build();
this.embeddingModel = embeddingModel;
this.cacheStore = cacheStore;
}
public String ask(String question) {
// 1. Search cache for semantically similar question
List<Document> cached = cacheStore.similaritySearch(
SearchRequest.query(question)
.withTopK(1)
.withSimilarityThreshold(SIMILARITY_THRESHOLD)
.withFilterExpression("type == 'cache-entry'")
);
if (!cached.isEmpty()) {
System.out.println("Semantic cache HIT for: " + question);
return (String) cached.get(0).getMetadata().get("answer");
}
// 2. Cache miss — call AI
System.out.println("Cache MISS — calling AI for: " + question);
String answer = chatClient.prompt().user(question).call().content();
// 3. Store in cache with metadata
Document cacheEntry = new Document(question, Map.of(
"type", "cache-entry",
"answer", answer,
"created_at", Instant.now().toString()
));
cacheStore.add(List.of(cacheEntry));
return answer;
}
}
Cache-Aside Pattern for RAG Results
@Service
public class CachedRagService {
private final ChatClient chatClient;
private final RedisTemplate<String, String> redis;
private static final Duration TTL = Duration.ofMinutes(30);
public CachedRagService(ChatClient.Builder builder,
VectorStore vectorStore,
RedisTemplate<String, String> redis) {
this.redis = redis;
this.chatClient = builder
.defaultAdvisors(new QuestionAnswerAdvisor(vectorStore))
.build();
}
public String ask(String question) {
String cacheKey = "rag:" + DigestUtils.md5DigestAsHex(question.getBytes());
// Check cache
String cached = redis.opsForValue().get(cacheKey);
if (cached != null) {
return cached;
}
// Call AI (with RAG)
String answer = chatClient.prompt().user(question).call().content();
// Store in cache
redis.opsForValue().set(cacheKey, answer, TTL);
return answer;
}
}
Performance Results
First call (cache miss):
"What is Spring AI?" → AI call → 1,240ms → stored in cache
Second call (exact match):
"What is Spring AI?" → Redis cache hit → 3ms → 99.8% faster
Third call (semantic match, 97% similarity):
"Tell me about Spring AI" → semantic cache hit → 12ms → 99% faster
Fourth call (different topic):
"What is Kubernetes?" → cache miss → 980ms → AI called
Key Points
- Use exact-match caching (
@Cacheable) for FAQ-style questions where users ask the same thing verbatim - Semantic caching is more powerful but adds embedding cost per request — use it for varied phrasings of common questions
- Set TTL based on how often the underlying data changes: 1h for dynamic answers, 7d for static reference material
- Cache RAG answers with a hash of the question — RAG results are deterministic given the same documents and same question
- Always invalidate cache when your knowledge base documents are updated — stale RAG answers are worse than no cache
Comments