v2.23.0•
Introduced Multimodal Audio Generation & Tracking
SreejanAuthor
This release introduces a major new capability: multimodal audio generation. The API has been significantly enhanced to handle audio inputs and outputs, along with a more granular system for tracking usage and costs across different modalities.
New Features
4
- The Chat Completions API now supports audio generation for capable models (e.g.,
gpt-4o-audio-preview). - A new
ChatMessageContentPartInputAudioschema has been added to the API payload to support audio data within chat messages. - A new
AudioResponseDataschema structures audio output in standardized API responses. - Model definitions now support a nested
pricingobject, allowing for separate pricing for different modalities like text and audio within a single model.
Improvements
4
- The database schema and all usage tracking components (
ModelUsageTracker,UserUsageUpdater,GlobalUsageUpdater) have been updated to separately track text tokens, audio tokens, and their respective costs. - The cost calculation logic for API usage has been refactored to handle multimodal responses, accurately calculating costs based on the new nested pricing structures.
- Provider payload adapters and response transformers have been enhanced to support new audio-related parameters and data structures.
- Streaming is now explicitly blocked for audio-capable models to prevent unsupported API requests and provide clearer error feedback.