Smart Routing
Automatic model selection, complexity classification, and provider failover.
Overview
Universal AI API includes a smart routing engine that automatically selects the best model for each request. When you omit the model parameter from a chat completions request, the router analyzes your prompt and selects a model based on complexity, cost, speed, and provider availability.
You can also use routing hints to guide model selection without specifying an exact model.
How It Works
1. Complexity Classification
When a request arrives without a model parameter, the routing engine first classifies the query complexity using a lightweight classifier (Cloudflare Workers AI Granite Micro). The query is assigned one of three complexity levels:
| Level | Description | Example |
|---|---|---|
| Simple | Factual lookups, short answers, translations | "What is the capital of Japan?" |
| Moderate | Explanations, summaries, basic code generation | "Explain how TCP/IP works" |
| Complex | Multi-step reasoning, long-form content, advanced code | "Design a microservices architecture for an e-commerce platform" |
2. Model Scoring
Based on the classified complexity, the router scores available models on three dimensions:
- Cost — price per token for the request
- Speed — expected latency (time to first token + generation speed)
- Quality — model capability level for the given task
Each dimension is weighted according to the routing mode (see below).
3. Provider Health Check
Before routing to a model, the engine checks the health status of the provider. If a provider is experiencing elevated error rates or latency, it is deprioritized. This happens automatically based on real-time monitoring.
4. Fallback
If the selected provider returns an error (5xx, timeout, rate limit), the router automatically retries with the next-best model from a different provider. Up to 2 fallback attempts are made before returning an error to the caller.
Routing Modes
Control how the router prioritizes models by setting the X-Routing-Mode header:
curl https://api.universal-ai.dev/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "X-Routing-Mode: cost" \
-d '{
"messages": [{"role": "user", "content": "Summarize this article..."}]
}'| Mode | Header Value | Behavior |
|---|---|---|
| Balanced | balanced (default) | Equal weight to cost, speed, and quality |
| Cost | cost | Minimize cost — prefers smaller, cheaper models |
| Speed | speed | Minimize latency — prefers faster providers (Groq, Cloudflare Workers AI) |
| Quality | quality | Maximize output quality — prefers the most capable models |
Typical Model Selection by Mode
| Complexity | Cost Mode | Speed Mode | Quality Mode |
|---|---|---|---|
| Simple | cf/llama-3.3-8b | groq/llama-3.3-8b | gpt-4o-mini |
| Moderate | cf/llama-3.3-70b | groq/llama-3.3-70b | gpt-4o |
| Complex | together/llama-3.3-70b | groq/llama-3.3-70b | anthropic/claude-sonnet-4-20250514 |
Using Routing with a Specific Model
When you specify a model, the smart router is bypassed. However, provider failover still applies. If the specified model's provider is unavailable, the router looks for the same model on an alternative provider:
# If Groq is down, this may fall back to the same Llama model on Together AI
curl https://api.universal-ai.dev/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "groq/llama-3.3-70b",
"messages": [{"role": "user", "content": "Hello!"}]
}'To disable fallback and fail immediately if the specified provider is unavailable, set:
-H "X-No-Fallback: true"Provider Health Monitoring
The routing engine continuously monitors provider health using:
- Error rate — percentage of 5xx responses in a rolling 5-minute window
- Latency P95 — 95th percentile response time over the last 5 minutes
- Availability — whether the provider is responding at all
Providers are classified into health states:
| State | Criteria | Effect |
|---|---|---|
| Healthy | Error rate < 1%, latency normal | Full priority in routing |
| Degraded | Error rate 1-10% or elevated latency | Deprioritized but still available |
| Unhealthy | Error rate > 10% or not responding | Excluded from routing |
Health checks run at the edge, so routing decisions account for regional provider performance. A provider may be healthy in one region and degraded in another.
Examples
Auto-route for cost efficiency
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.universal-ai.dev/v1",
default_headers={"X-Routing-Mode": "cost"}
)
# No model specified — the router picks the cheapest suitable model
response = client.chat.completions.create(
messages=[{"role": "user", "content": "What is 2 + 2?"}]
)
# The response tells you which model was selected
print(f"Model used: {response.model}")
print(response.choices[0].message.content)Auto-route for quality
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "YOUR_API_KEY",
baseURL: "https://api.universal-ai.dev/v1",
defaultHeaders: { "X-Routing-Mode": "quality" },
});
const response = await client.chat.completions.create({
messages: [
{
role: "user",
content: "Write a detailed analysis of the economic impacts of AI automation.",
},
],
});
// For complex queries in quality mode, this will typically use GPT-4o or Claude
console.log(`Model used: ${response.model}`);
console.log(response.choices[0].message.content);Response Headers
When smart routing is used, additional headers are included in the response:
| Header | Description |
|---|---|
X-Routed-Model | The model ID that was selected by the router |
X-Routed-Provider | The provider that served the request |
X-Complexity | The classified complexity level (simple, moderate, complex) |
X-Routing-Mode | The routing mode that was applied |
X-Fallback-Used | true if the response came from a fallback provider |