v2.23.0•June 21, 2025

Introduced Multimodal Audio Generation & Tracking

This release introduces a major new capability: multimodal audio generation. The API has been significantly enhanced to handle audio inputs and outputs, along with a more granular system for tracking usage and costs across different modalities.

New Features

The Chat Completions API now supports audio generation for capable models (e.g., gpt-4o-audio-preview).
A new ChatMessageContentPartInputAudio schema has been added to the API payload to support audio data within chat messages.
A new AudioResponseData schema structures audio output in standardized API responses.
Model definitions now support a nested pricing object, allowing for separate pricing for different modalities like text and audio within a single model.

Improvements

The database schema and all usage tracking components (ModelUsageTracker, UserUsageUpdater, GlobalUsageUpdater) have been updated to separately track text tokens, audio tokens, and their respective costs.
The cost calculation logic for API usage has been refactored to handle multimodal responses, accurately calculating costs based on the new nested pricing structures.
Provider payload adapters and response transformers have been enhanced to support new audio-related parameters and data structures.
Streaming is now explicitly blocked for audio-capable models to prevent unsupported API requests and provide clearer error feedback.