Info
- Title: CAVA: Comprehensive Assessment for Voice Assistants
- Group: Stanford
- Keywords: Benchmark, speech LLMs
- Venue: Website
Comments
Turn-taking, instruction following, function calling, tone awareness, safety, latency
AI Model Performance Comparison
Results from their website.
Category | Task | # Data Points | GPT-4o | Gemini Pipeline | Gemini 2.0 | Gemini 2.5 |
---|---|---|---|---|---|---|
Latency | Jeopardy (Win Rate % β) | 1k | 73.0% | 15.4% | 6.0% | No Speech Output |
Function Calling | Function Calling (Function Calls Match β) | 1k | 24% | 27% | N/A | N/A |
Instruction Following | System Prompt Following (Adhere % β) | 1k | 64.6% | 64.7% | 69.7% | 70.2% |
Pronunciation Control (OED % Correct β) | 283 | 58% | 45% | 32% | No Speech Output | |
Tone Awareness | Counterfactual Response (Likert Scale Score / 5 β) | 1.5k | 3.37 | 3.27 | 3.30 | 3.32 |
Turn-Taking | Turn Prediction (Accuracy β) | 1k | 40.7% | 37.0% | 38.3% | 47.5% |
Safety | Deception Detection (Accuracy β) | 151 | [REFUSES] | 14.5% | 26.1% | 12.5% |
Speech Jailbreaking (Success Rate β) | 520 | 68.3% | 79.0% | 79.2% | 49.0% |
To ensure LAMs can recognize social cues and respond appropriately, we introduce the task of tone-aware response generation. This evaluates the model's ability to generate appropriate responses that adapt to the same text input delivered with different emotional tones. This capability is essential for voice assistants to maintain natural conversation flow and demonstrate appropriate social awareness during interactions. Performance is measured through a text LLM as a Judge to judge the emotional relevance and specificity of model generated replies.
Takeaway: OpenAI’s TTS model now can have settings on tone for the generated speech.
Link: https://www.openai.fm/