Caching

Overview

Universal AI API includes a built-in three-tier caching system that reduces latency and cost by serving previously computed responses when possible. Caching happens transparently at the edge — no configuration is required to benefit from it.

The three cache tiers are checked in order:

L1: Exact Match (KV) — checks for an identical request that was previously cached
L2: Semantic Similarity (Vectorize) — finds a cached response to a semantically similar (but not identical) request
L3: Prefix Match — matches requests that share a common prompt prefix

If any tier produces a hit, the cached response is returned immediately without calling the upstream provider. This can reduce response times from seconds to single-digit milliseconds.

How It Works

L1: Exact Match Cache

The fastest tier. A hash of the full request (model, messages, parameters) is used as a key in Cloudflare KV. If the exact same request was made before, the cached response is returned.

Hit rate: High for repeated identical queries (e.g., system prompts, template-based requests)
Latency: Sub-millisecond at the edge
TTL: 1 hour by default

L2: Semantic Similarity Cache

When L1 misses, the input is embedded and compared against previously cached request embeddings using Cloudflare Vectorize. If a semantically similar request is found above the similarity threshold (default: 0.95), the cached response is returned.

Hit rate: Catches paraphrased versions of the same question
Latency: ~5-10ms (embedding lookup + vector search)
Threshold: 0.95 cosine similarity (configurable)

Example: These two requests would produce a semantic cache hit:

"What is the capital of France?"
"Tell me the capital city of France"

L3: Prefix Match Cache

For long conversations, this tier checks whether the current prompt shares a prefix with a previously cached conversation. If a match is found and only the last message differs, it can partially reuse cached context.

Hit rate: Useful for multi-turn conversations
Latency: ~2-5ms

Cache Control

Response Headers

Every response includes cache status headers:

Header	Values	Description
`X-Cache`	`HIT`, `MISS`	Whether the response was served from cache
`X-Cache-Tier`	`L1`, `L2`, `L3`	Which cache tier produced the hit
`X-Cache-TTL`	integer	Remaining TTL in seconds

Disabling Cache

To bypass the cache and force a fresh response from the upstream provider, set the X-Cache-Control header:

curl https://api.universal-ai.dev/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Cache-Control: no-cache" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "What time is it?"}]
  }'

Cache Modes

Header Value	Behavior
(not set)	Normal caching — check all tiers, cache the response
`no-cache`	Skip cache lookup, but still cache the response for future requests
`no-store`	Skip cache lookup AND do not cache the response

Setting Custom TTL

Override the default TTL for a specific request:

curl https://api.universal-ai.dev/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Cache-TTL: 3600" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Explain quantum computing."}]
  }'

The X-Cache-TTL value is in seconds. Maximum: 86400 (24 hours).

What Gets Cached

Chat completions — non-streaming responses are cached by default
Embeddings — cached since the same input always produces the same vector
Transcriptions — cached by file hash

Not cached:

Streaming responses (SSE)
Image generation (non-deterministic)
Text-to-speech (stored in R2 instead)

Cost Savings

Cache hits avoid upstream API calls entirely, which means:

Zero provider cost — you are not billed for the upstream token usage
Lower latency — responses are served from the edge in milliseconds
Higher throughput — cache hits do not count against provider rate limits

The usage field in cached responses reflects the original token counts, and the X-Cache: HIT header indicates no provider call was made.

Best Practices

Use deterministic parameters — set temperature: 0 for requests where you want consistent, cacheable results
Standardize prompts — normalize whitespace and formatting so identical queries produce identical cache keys
Leverage system prompts — shared system prompts across users benefit from prefix caching
Monitor cache performance — check the X-Cache response header to track your hit rate

On this page