Java9R: Spring AI Response Caching — Cache AI Answers with Redis and Spring Cache

Spring AI Response Caching — Cache AI Answers with Redis and Spring Cache

AI API calls are expensive and slow. Caching is one of the most impactful optimizations you can apply — if two users ask the same question, the second answer should come from cache in milliseconds, not from the LLM in seconds. This tutorial implements semantic caching (exact match and near-duplicate detection) for Spring AI using Redis.

Two Types of AI Caching

Exact Match Cache:
  "What is Spring Boot?" → cached → same question gets instant response
  "What is spring boot?" → different string → cache miss (even though same meaning)

Semantic Cache (Smart):
  "What is Spring Boot?" → cached response
  "What is Spring Boot framework?" → vector similarity ≥ 0.95 → CACHE HIT ← same response returned
  "What is Quarkus?" → vector similarity 0.3 → cache miss → new AI call

Maven Dependencies

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>

application.properties

spring.ai.openai.api-key=${OPENAI_API_KEY}

# Redis
spring.data.redis.host=localhost
spring.data.redis.port=6379

# Spring Cache
spring.cache.type=redis
spring.cache.redis.time-to-live=3600000   # 1 hour TTL in milliseconds

Simple Exact-Match Cache

import org.springframework.cache.annotation.Cacheable;
import org.springframework.cache.annotation.CacheEvict;

@Service
public class CachedAiService {

    private final ChatClient chatClient;

    public CachedAiService(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    // Cache result by the exact question string
    @Cacheable(value = "ai-responses", key = "#question.toLowerCase().trim()")
    public String ask(String question) {
        System.out.println("Cache MISS — calling AI for: " + question);
        return chatClient.prompt()
                .user(question)
                .call()
                .content();
    }

    // Clear the cache for a specific question
    @CacheEvict(value = "ai-responses", key = "#question.toLowerCase().trim()")
    public void evict(String question) {
        System.out.println("Evicted cache for: " + question);
    }

    // Clear entire cache (admin use)
    @CacheEvict(value = "ai-responses", allEntries = true)
    public void clearAll() {
        System.out.println("AI response cache cleared");
    }
}

Cache Configuration

@Configuration
@EnableCaching
public class CacheConfig {

    @Bean
    public RedisCacheManager cacheManager(RedisConnectionFactory factory) {
        RedisCacheConfiguration config = RedisCacheConfiguration.defaultCacheConfig()
                .entryTtl(Duration.ofHours(1))
                .serializeKeysWith(RedisSerializationContext.SerializationPair
                        .fromSerializer(new StringRedisSerializer()))
                .serializeValuesWith(RedisSerializationContext.SerializationPair
                        .fromSerializer(new GenericJackson2JsonRedisSerializer()));

        return RedisCacheManager.builder(factory)
                .cacheDefaults(config)
                .withCacheConfiguration("ai-responses",
                        config.entryTtl(Duration.ofHours(6)))    // 6h for AI responses
                .withCacheConfiguration("ai-embeddings",
                        config.entryTtl(Duration.ofDays(7)))     // 7d for embeddings
                .build();
    }
}

Semantic Cache — Cache by Meaning, Not Exact String

@Service
public class SemanticCacheService {

    private final ChatClient     chatClient;
    private final EmbeddingModel embeddingModel;
    private final VectorStore    cacheStore;   // Redis or SimpleVectorStore as cache

    private static final double SIMILARITY_THRESHOLD = 0.95;

    public SemanticCacheService(ChatClient.Builder builder,
                                 EmbeddingModel embeddingModel,
                                 VectorStore cacheStore) {
        this.chatClient     = builder.build();
        this.embeddingModel = embeddingModel;
        this.cacheStore     = cacheStore;
    }

    public String ask(String question) {
        // 1. Search cache for semantically similar question
        List<Document> cached = cacheStore.similaritySearch(
                SearchRequest.query(question)
                        .withTopK(1)
                        .withSimilarityThreshold(SIMILARITY_THRESHOLD)
                        .withFilterExpression("type == 'cache-entry'")
        );

        if (!cached.isEmpty()) {
            System.out.println("Semantic cache HIT for: " + question);
            return (String) cached.get(0).getMetadata().get("answer");
        }

        // 2. Cache miss — call AI
        System.out.println("Cache MISS — calling AI for: " + question);
        String answer = chatClient.prompt().user(question).call().content();

        // 3. Store in cache with metadata
        Document cacheEntry = new Document(question, Map.of(
                "type",       "cache-entry",
                "answer",     answer,
                "created_at", Instant.now().toString()
        ));
        cacheStore.add(List.of(cacheEntry));

        return answer;
    }
}

Cache-Aside Pattern for RAG Results

@Service
public class CachedRagService {

    private final ChatClient  chatClient;
    private final RedisTemplate<String, String> redis;

    private static final Duration TTL = Duration.ofMinutes(30);

    public CachedRagService(ChatClient.Builder builder,
                             VectorStore vectorStore,
                             RedisTemplate<String, String> redis) {
        this.redis = redis;
        this.chatClient = builder
                .defaultAdvisors(new QuestionAnswerAdvisor(vectorStore))
                .build();
    }

    public String ask(String question) {
        String cacheKey = "rag:" + DigestUtils.md5DigestAsHex(question.getBytes());

        // Check cache
        String cached = redis.opsForValue().get(cacheKey);
        if (cached != null) {
            return cached;
        }

        // Call AI (with RAG)
        String answer = chatClient.prompt().user(question).call().content();

        // Store in cache
        redis.opsForValue().set(cacheKey, answer, TTL);

        return answer;
    }
}

Performance Results

First call  (cache miss):
  "What is Spring AI?" → AI call → 1,240ms → stored in cache

Second call (exact match):
  "What is Spring AI?" → Redis cache hit → 3ms → 99.8% faster

Third call  (semantic match, 97% similarity):
  "Tell me about Spring AI" → semantic cache hit → 12ms → 99% faster

Fourth call (different topic):
  "What is Kubernetes?" → cache miss → 980ms → AI called

Key Points

Use exact-match caching (@Cacheable) for FAQ-style questions where users ask the same thing verbatim
Semantic caching is more powerful but adds embedding cost per request — use it for varied phrasings of common questions
Set TTL based on how often the underlying data changes: 1h for dynamic answers, 7d for static reference material
Cache RAG answers with a hash of the question — RAG results are deterministic given the same documents and same question
Always invalidate cache when your knowledge base documents are updated — stale RAG answers are worse than no cache

Spring AI Response Caching — Cache AI Answers with Redis and Spring Cache

Spring AI Response Caching — Cache AI Answers with Redis and Spring Cache

Two Types of AI Caching

Maven Dependencies

application.properties

Simple Exact-Match Cache

Cache Configuration

Semantic Cache — Cache by Meaning, Not Exact String

Cache-Aside Pattern for RAG Results

Performance Results

Key Points

Comments