v2.23.0

Introduced Multimodal Audio Generation & Tracking

Sreejan
SreejanAuthor

This release introduces a major new capability: multimodal audio generation. The API has been significantly enhanced to handle audio inputs and outputs, along with a more granular system for tracking usage and costs across different modalities.

New Features

4
  • The Chat Completions API now supports audio generation for capable models (e.g., gpt-4o-audio-preview).
  • A new ChatMessageContentPartInputAudio schema has been added to the API payload to support audio data within chat messages.
  • A new AudioResponseData schema structures audio output in standardized API responses.
  • Model definitions now support a nested pricing object, allowing for separate pricing for different modalities like text and audio within a single model.

Improvements

4
  • The database schema and all usage tracking components (ModelUsageTracker, UserUsageUpdater, GlobalUsageUpdater) have been updated to separately track text tokens, audio tokens, and their respective costs.
  • The cost calculation logic for API usage has been refactored to handle multimodal responses, accurately calculating costs based on the new nested pricing structures.
  • Provider payload adapters and response transformers have been enhanced to support new audio-related parameters and data structures.
  • Streaming is now explicitly blocked for audio-capable models to prevent unsupported API requests and provide clearer error feedback.