Caching
Understand the three-tier caching system and how to control cache behavior.
Overview
Universal AI API includes a built-in three-tier caching system that reduces latency and cost by serving previously computed responses when possible. Caching happens transparently at the edge — no configuration is required to benefit from it.
The three cache tiers are checked in order:
- L1: Exact Match (KV) — checks for an identical request that was previously cached
- L2: Semantic Similarity (Vectorize) — finds a cached response to a semantically similar (but not identical) request
- L3: Prefix Match — matches requests that share a common prompt prefix
If any tier produces a hit, the cached response is returned immediately without calling the upstream provider. This can reduce response times from seconds to single-digit milliseconds.
How It Works
L1: Exact Match Cache
The fastest tier. A hash of the full request (model, messages, parameters) is used as a key in Cloudflare KV. If the exact same request was made before, the cached response is returned.
- Hit rate: High for repeated identical queries (e.g., system prompts, template-based requests)
- Latency: Sub-millisecond at the edge
- TTL: 1 hour by default
L2: Semantic Similarity Cache
When L1 misses, the input is embedded and compared against previously cached request embeddings using Cloudflare Vectorize. If a semantically similar request is found above the similarity threshold (default: 0.95), the cached response is returned.
- Hit rate: Catches paraphrased versions of the same question
- Latency: ~5-10ms (embedding lookup + vector search)
- Threshold: 0.95 cosine similarity (configurable)
Example: These two requests would produce a semantic cache hit:
- "What is the capital of France?"
- "Tell me the capital city of France"
L3: Prefix Match Cache
For long conversations, this tier checks whether the current prompt shares a prefix with a previously cached conversation. If a match is found and only the last message differs, it can partially reuse cached context.
- Hit rate: Useful for multi-turn conversations
- Latency: ~2-5ms
Cache Control
Response Headers
Every response includes cache status headers:
| Header | Values | Description |
|---|---|---|
X-Cache | HIT, MISS | Whether the response was served from cache |
X-Cache-Tier | L1, L2, L3 | Which cache tier produced the hit |
X-Cache-TTL | integer | Remaining TTL in seconds |
Disabling Cache
To bypass the cache and force a fresh response from the upstream provider, set the X-Cache-Control header:
curl https://api.universal-ai.dev/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "X-Cache-Control: no-cache" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "What time is it?"}]
}'Cache Modes
| Header Value | Behavior |
|---|---|
| (not set) | Normal caching — check all tiers, cache the response |
no-cache | Skip cache lookup, but still cache the response for future requests |
no-store | Skip cache lookup AND do not cache the response |
Setting Custom TTL
Override the default TTL for a specific request:
curl https://api.universal-ai.dev/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "X-Cache-TTL: 3600" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Explain quantum computing."}]
}'The X-Cache-TTL value is in seconds. Maximum: 86400 (24 hours).
What Gets Cached
- Chat completions — non-streaming responses are cached by default
- Embeddings — cached since the same input always produces the same vector
- Transcriptions — cached by file hash
Not cached:
- Streaming responses (SSE)
- Image generation (non-deterministic)
- Text-to-speech (stored in R2 instead)
Cost Savings
Cache hits avoid upstream API calls entirely, which means:
- Zero provider cost — you are not billed for the upstream token usage
- Lower latency — responses are served from the edge in milliseconds
- Higher throughput — cache hits do not count against provider rate limits
The usage field in cached responses reflects the original token counts, and the X-Cache: HIT header indicates no provider call was made.
Best Practices
- Use deterministic parameters — set
temperature: 0for requests where you want consistent, cacheable results - Standardize prompts — normalize whitespace and formatting so identical queries produce identical cache keys
- Leverage system prompts — shared system prompts across users benefit from prefix caching
- Monitor cache performance — check the
X-Cacheresponse header to track your hit rate