Smart Routing

Overview

Universal AI API includes a smart routing engine that automatically selects the best model for each request. When you omit the model parameter from a chat completions request, the router analyzes your prompt and selects a model based on complexity, cost, speed, and provider availability.

You can also use routing hints to guide model selection without specifying an exact model.

How It Works

1. Complexity Classification

When a request arrives without a model parameter, the routing engine first classifies the query complexity using a lightweight classifier (Cloudflare Workers AI Granite Micro). The query is assigned one of three complexity levels:

Level	Description	Example
Simple	Factual lookups, short answers, translations	"What is the capital of Japan?"
Moderate	Explanations, summaries, basic code generation	"Explain how TCP/IP works"
Complex	Multi-step reasoning, long-form content, advanced code	"Design a microservices architecture for an e-commerce platform"

2. Model Scoring

Based on the classified complexity, the router scores available models on three dimensions:

Cost — price per token for the request
Speed — expected latency (time to first token + generation speed)
Quality — model capability level for the given task

Each dimension is weighted according to the routing mode (see below).

3. Provider Health Check

Before routing to a model, the engine checks the health status of the provider. If a provider is experiencing elevated error rates or latency, it is deprioritized. This happens automatically based on real-time monitoring.

4. Fallback

If the selected provider returns an error (5xx, timeout, rate limit), the router automatically retries with the next-best model from a different provider. Up to 2 fallback attempts are made before returning an error to the caller.

Routing Modes

Control how the router prioritizes models by setting the X-Routing-Mode header:

curl https://api.universal-ai.dev/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Routing-Mode: cost" \
  -d '{
    "messages": [{"role": "user", "content": "Summarize this article..."}]
  }'

Mode	Header Value	Behavior
Balanced	`balanced` (default)	Equal weight to cost, speed, and quality
Cost	`cost`	Minimize cost — prefers smaller, cheaper models
Speed	`speed`	Minimize latency — prefers faster providers (Groq, Cloudflare Workers AI)
Quality	`quality`	Maximize output quality — prefers the most capable models

Typical Model Selection by Mode

Complexity	Cost Mode	Speed Mode	Quality Mode
Simple	cf/llama-3.3-8b	groq/llama-3.3-8b	gpt-4o-mini
Moderate	cf/llama-3.3-70b	groq/llama-3.3-70b	gpt-4o
Complex	together/llama-3.3-70b	groq/llama-3.3-70b	anthropic/claude-sonnet-4-20250514

Using Routing with a Specific Model

When you specify a model, the smart router is bypassed. However, provider failover still applies. If the specified model's provider is unavailable, the router looks for the same model on an alternative provider:

# If Groq is down, this may fall back to the same Llama model on Together AI
curl https://api.universal-ai.dev/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "groq/llama-3.3-70b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

To disable fallback and fail immediately if the specified provider is unavailable, set:

-H "X-No-Fallback: true"

Provider Health Monitoring

The routing engine continuously monitors provider health using:

Error rate — percentage of 5xx responses in a rolling 5-minute window
Latency P95 — 95th percentile response time over the last 5 minutes
Availability — whether the provider is responding at all

Providers are classified into health states:

State	Criteria	Effect
Healthy	Error rate < 1%, latency normal	Full priority in routing
Degraded	Error rate 1-10% or elevated latency	Deprioritized but still available
Unhealthy	Error rate > 10% or not responding	Excluded from routing

Health checks run at the edge, so routing decisions account for regional provider performance. A provider may be healthy in one region and degraded in another.

Examples

Auto-route for cost efficiency

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.universal-ai.dev/v1",
    default_headers={"X-Routing-Mode": "cost"}
)

# No model specified — the router picks the cheapest suitable model
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "What is 2 + 2?"}]
)

# The response tells you which model was selected
print(f"Model used: {response.model}")
print(response.choices[0].message.content)

Auto-route for quality

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "YOUR_API_KEY",
  baseURL: "https://api.universal-ai.dev/v1",
  defaultHeaders: { "X-Routing-Mode": "quality" },
});

const response = await client.chat.completions.create({
  messages: [
    {
      role: "user",
      content: "Write a detailed analysis of the economic impacts of AI automation.",
    },
  ],
});

// For complex queries in quality mode, this will typically use GPT-4o or Claude
console.log(`Model used: ${response.model}`);
console.log(response.choices[0].message.content);

Response Headers

When smart routing is used, additional headers are included in the response:

Header	Description
`X-Routed-Model`	The model ID that was selected by the router
`X-Routed-Provider`	The provider that served the request
`X-Complexity`	The classified complexity level (`simple`, `moderate`, `complex`)
`X-Routing-Mode`	The routing mode that was applied
`X-Fallback-Used`	`true` if the response came from a fallback provider

On this page