Caching

Understand the three-tier caching system and how to control cache behavior.

Overview

Universal AI API includes a built-in three-tier caching system that reduces latency and cost by serving previously computed responses when possible. Caching happens transparently at the edge — no configuration is required to benefit from it.

The three cache tiers are checked in order:

  1. L1: Exact Match (KV) — checks for an identical request that was previously cached
  2. L2: Semantic Similarity (Vectorize) — finds a cached response to a semantically similar (but not identical) request
  3. L3: Prefix Match — matches requests that share a common prompt prefix

If any tier produces a hit, the cached response is returned immediately without calling the upstream provider. This can reduce response times from seconds to single-digit milliseconds.

How It Works

L1: Exact Match Cache

The fastest tier. A hash of the full request (model, messages, parameters) is used as a key in Cloudflare KV. If the exact same request was made before, the cached response is returned.

  • Hit rate: High for repeated identical queries (e.g., system prompts, template-based requests)
  • Latency: Sub-millisecond at the edge
  • TTL: 1 hour by default

L2: Semantic Similarity Cache

When L1 misses, the input is embedded and compared against previously cached request embeddings using Cloudflare Vectorize. If a semantically similar request is found above the similarity threshold (default: 0.95), the cached response is returned.

  • Hit rate: Catches paraphrased versions of the same question
  • Latency: ~5-10ms (embedding lookup + vector search)
  • Threshold: 0.95 cosine similarity (configurable)

Example: These two requests would produce a semantic cache hit:

  • "What is the capital of France?"
  • "Tell me the capital city of France"

L3: Prefix Match Cache

For long conversations, this tier checks whether the current prompt shares a prefix with a previously cached conversation. If a match is found and only the last message differs, it can partially reuse cached context.

  • Hit rate: Useful for multi-turn conversations
  • Latency: ~2-5ms

Cache Control

Response Headers

Every response includes cache status headers:

HeaderValuesDescription
X-CacheHIT, MISSWhether the response was served from cache
X-Cache-TierL1, L2, L3Which cache tier produced the hit
X-Cache-TTLintegerRemaining TTL in seconds

Disabling Cache

To bypass the cache and force a fresh response from the upstream provider, set the X-Cache-Control header:

curl https://api.universal-ai.dev/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Cache-Control: no-cache" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "What time is it?"}]
  }'

Cache Modes

Header ValueBehavior
(not set)Normal caching — check all tiers, cache the response
no-cacheSkip cache lookup, but still cache the response for future requests
no-storeSkip cache lookup AND do not cache the response

Setting Custom TTL

Override the default TTL for a specific request:

curl https://api.universal-ai.dev/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Cache-TTL: 3600" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Explain quantum computing."}]
  }'

The X-Cache-TTL value is in seconds. Maximum: 86400 (24 hours).

What Gets Cached

  • Chat completions — non-streaming responses are cached by default
  • Embeddings — cached since the same input always produces the same vector
  • Transcriptions — cached by file hash

Not cached:

  • Streaming responses (SSE)
  • Image generation (non-deterministic)
  • Text-to-speech (stored in R2 instead)

Cost Savings

Cache hits avoid upstream API calls entirely, which means:

  • Zero provider cost — you are not billed for the upstream token usage
  • Lower latency — responses are served from the edge in milliseconds
  • Higher throughput — cache hits do not count against provider rate limits

The usage field in cached responses reflects the original token counts, and the X-Cache: HIT header indicates no provider call was made.

Best Practices

  • Use deterministic parameters — set temperature: 0 for requests where you want consistent, cacheable results
  • Standardize prompts — normalize whitespace and formatting so identical queries produce identical cache keys
  • Leverage system prompts — shared system prompts across users benefit from prefix caching
  • Monitor cache performance — check the X-Cache response header to track your hit rate