Smart Routing

Automatic model selection, complexity classification, and provider failover.

Overview

Universal AI API includes a smart routing engine that automatically selects the best model for each request. When you omit the model parameter from a chat completions request, the router analyzes your prompt and selects a model based on complexity, cost, speed, and provider availability.

You can also use routing hints to guide model selection without specifying an exact model.

How It Works

1. Complexity Classification

When a request arrives without a model parameter, the routing engine first classifies the query complexity using a lightweight classifier (Cloudflare Workers AI Granite Micro). The query is assigned one of three complexity levels:

LevelDescriptionExample
SimpleFactual lookups, short answers, translations"What is the capital of Japan?"
ModerateExplanations, summaries, basic code generation"Explain how TCP/IP works"
ComplexMulti-step reasoning, long-form content, advanced code"Design a microservices architecture for an e-commerce platform"

2. Model Scoring

Based on the classified complexity, the router scores available models on three dimensions:

  • Cost — price per token for the request
  • Speed — expected latency (time to first token + generation speed)
  • Quality — model capability level for the given task

Each dimension is weighted according to the routing mode (see below).

3. Provider Health Check

Before routing to a model, the engine checks the health status of the provider. If a provider is experiencing elevated error rates or latency, it is deprioritized. This happens automatically based on real-time monitoring.

4. Fallback

If the selected provider returns an error (5xx, timeout, rate limit), the router automatically retries with the next-best model from a different provider. Up to 2 fallback attempts are made before returning an error to the caller.

Routing Modes

Control how the router prioritizes models by setting the X-Routing-Mode header:

curl https://api.universal-ai.dev/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Routing-Mode: cost" \
  -d '{
    "messages": [{"role": "user", "content": "Summarize this article..."}]
  }'
ModeHeader ValueBehavior
Balancedbalanced (default)Equal weight to cost, speed, and quality
CostcostMinimize cost — prefers smaller, cheaper models
SpeedspeedMinimize latency — prefers faster providers (Groq, Cloudflare Workers AI)
QualityqualityMaximize output quality — prefers the most capable models

Typical Model Selection by Mode

ComplexityCost ModeSpeed ModeQuality Mode
Simplecf/llama-3.3-8bgroq/llama-3.3-8bgpt-4o-mini
Moderatecf/llama-3.3-70bgroq/llama-3.3-70bgpt-4o
Complextogether/llama-3.3-70bgroq/llama-3.3-70banthropic/claude-sonnet-4-20250514

Using Routing with a Specific Model

When you specify a model, the smart router is bypassed. However, provider failover still applies. If the specified model's provider is unavailable, the router looks for the same model on an alternative provider:

# If Groq is down, this may fall back to the same Llama model on Together AI
curl https://api.universal-ai.dev/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "groq/llama-3.3-70b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

To disable fallback and fail immediately if the specified provider is unavailable, set:

-H "X-No-Fallback: true"

Provider Health Monitoring

The routing engine continuously monitors provider health using:

  • Error rate — percentage of 5xx responses in a rolling 5-minute window
  • Latency P95 — 95th percentile response time over the last 5 minutes
  • Availability — whether the provider is responding at all

Providers are classified into health states:

StateCriteriaEffect
HealthyError rate < 1%, latency normalFull priority in routing
DegradedError rate 1-10% or elevated latencyDeprioritized but still available
UnhealthyError rate > 10% or not respondingExcluded from routing

Health checks run at the edge, so routing decisions account for regional provider performance. A provider may be healthy in one region and degraded in another.

Examples

Auto-route for cost efficiency

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.universal-ai.dev/v1",
    default_headers={"X-Routing-Mode": "cost"}
)

# No model specified — the router picks the cheapest suitable model
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "What is 2 + 2?"}]
)

# The response tells you which model was selected
print(f"Model used: {response.model}")
print(response.choices[0].message.content)

Auto-route for quality

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "YOUR_API_KEY",
  baseURL: "https://api.universal-ai.dev/v1",
  defaultHeaders: { "X-Routing-Mode": "quality" },
});

const response = await client.chat.completions.create({
  messages: [
    {
      role: "user",
      content: "Write a detailed analysis of the economic impacts of AI automation.",
    },
  ],
});

// For complex queries in quality mode, this will typically use GPT-4o or Claude
console.log(`Model used: ${response.model}`);
console.log(response.choices[0].message.content);

Response Headers

When smart routing is used, additional headers are included in the response:

HeaderDescription
X-Routed-ModelThe model ID that was selected by the router
X-Routed-ProviderThe provider that served the request
X-ComplexityThe classified complexity level (simple, moderate, complex)
X-Routing-ModeThe routing mode that was applied
X-Fallback-Usedtrue if the response came from a fallback provider