Audio (STT & TTS)

Transcribe audio to text and generate speech from text.

Overview

The audio endpoints support two operations:

  • Speech-to-Text (STT) — transcribe audio files into text
  • Text-to-Speech (TTS) — convert text into spoken audio

Both follow the OpenAI audio API format.


Speech-to-Text (Transcription)

Transcribe audio into text using the transcriptions endpoint.

POST https://api.universal-ai.dev/v1/audio/transcriptions

Request

This endpoint accepts multipart/form-data with the audio file and parameters.

ParameterTypeRequiredDescription
filefileYesThe audio file to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac.
modelstringYesThe transcription model to use (e.g., whisper-1, cf/whisper).
languagestringNoISO 639-1 language code (e.g., en, es, fr). Improves accuracy when specified.
promptstringNoOptional context or spelling guidance for the transcription.
response_formatstringNoOutput format: json (default), text, srt, verbose_json, or vtt.
temperaturenumberNoSampling temperature between 0 and 1. Default: 0.

Example Request

curl https://api.universal-ai.dev/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F file=@recording.mp3 \
  -F model=whisper-1 \
  -F language=en

Response

{
  "text": "Hello, this is a test recording for the Universal AI API transcription service."
}

With response_format: "verbose_json":

{
  "task": "transcribe",
  "language": "english",
  "duration": 5.42,
  "text": "Hello, this is a test recording for the Universal AI API transcription service.",
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 2.8,
      "text": "Hello, this is a test recording"
    },
    {
      "id": 1,
      "start": 2.8,
      "end": 5.42,
      "text": " for the Universal AI API transcription service."
    }
  ]
}

SDK Examples

Python:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.universal-ai.dev/v1"
)

with open("recording.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        language="en"
    )

print(transcript.text)

JavaScript / TypeScript:

import OpenAI from "openai";
import fs from "fs";

const client = new OpenAI({
  apiKey: "YOUR_API_KEY",
  baseURL: "https://api.universal-ai.dev/v1",
});

const transcript = await client.audio.transcriptions.create({
  model: "whisper-1",
  file: fs.createReadStream("recording.mp3"),
  language: "en",
});

console.log(transcript.text);

Text-to-Speech

Generate spoken audio from text using the speech endpoint.

POST https://api.universal-ai.dev/v1/audio/speech

Request

ParameterTypeRequiredDescription
modelstringYesThe TTS model to use (e.g., tts-1, tts-1-hd).
inputstringYesThe text to convert to speech. Maximum 4,096 characters.
voicestringYesThe voice to use: alloy, echo, fable, onyx, nova, or shimmer.
response_formatstringNoAudio format: mp3 (default), opus, aac, flac, or wav.
speednumberNoPlayback speed from 0.25 to 4.0. Default: 1.0.

Example Request

curl https://api.universal-ai.dev/v1/audio/speech \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1-hd",
    "input": "Welcome to the Universal AI API. This is a text-to-speech demonstration.",
    "voice": "nova"
  }' \
  --output speech.mp3

The response body is the raw audio file in the requested format.

SDK Examples

Python:

from openai import OpenAI
from pathlib import Path

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.universal-ai.dev/v1"
)

response = client.audio.speech.create(
    model="tts-1-hd",
    input="Welcome to the Universal AI API.",
    voice="nova"
)

Path("output.mp3").write_bytes(response.content)

JavaScript / TypeScript:

import OpenAI from "openai";
import fs from "fs";

const client = new OpenAI({
  apiKey: "YOUR_API_KEY",
  baseURL: "https://api.universal-ai.dev/v1",
});

const response = await client.audio.speech.create({
  model: "tts-1-hd",
  input: "Welcome to the Universal AI API.",
  voice: "nova",
});

const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync("output.mp3", buffer);

Supported Models

Transcription Models

Model IDProviderDescription
whisper-1OpenAIWhisper large-v2 — supports 50+ languages
cf/whisperCloudflareWhisper on Workers AI (low latency, no egress cost)
cf/whisper-tiny-enCloudflareWhisper Tiny English-only (fastest)
deepgram/nova-2DeepgramNova-2 — high accuracy, real-time capable

Text-to-Speech Models

Model IDProviderDescription
tts-1OpenAIStandard quality, low latency
tts-1-hdOpenAIHigh definition, richer audio
elevenlabs/eleven_multilingual_v2ElevenLabsMultilingual, expressive voices

Supported Audio Formats

Input (transcription): mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac

Output (speech): mp3, opus, aac, flac, wav

Maximum input file size for transcription is 25 MB.