Featured image of post CAVA: Comprehensive Assessment for Voice Assistants

CAVA: Comprehensive Assessment for Voice Assistants

Focuse on the naunces of speech

Info

Comments

Turn-taking, instruction following, function calling, tone awareness, safety, latency

AI Model Performance Comparison

Results from their website.

Category Task # Data Points GPT-4o Gemini Pipeline Gemini 2.0 Gemini 2.5
Latency Jeopardy (Win Rate % ↑) 1k 73.0% 15.4% 6.0% No Speech Output
Function Calling Function Calling (Function Calls Match ↑) 1k 24% 27% N/A N/A
Instruction Following System Prompt Following (Adhere % ↑) 1k 64.6% 64.7% 69.7% 70.2%
Pronunciation Control (OED % Correct ↑) 283 58% 45% 32% No Speech Output
Tone Awareness Counterfactual Response (Likert Scale Score / 5 ↑) 1.5k 3.37 3.27 3.30 3.32
Turn-Taking Turn Prediction (Accuracy ↑) 1k 40.7% 37.0% 38.3% 47.5%
Safety Deception Detection (Accuracy ↑) 151 [REFUSES] 14.5% 26.1% 12.5%
Speech Jailbreaking (Success Rate ↓) 520 68.3% 79.0% 79.2% 49.0%
Tone Awareness
Emotion Counterfactual Response Generation

To ensure LAMs can recognize social cues and respond appropriately, we introduce the task of tone-aware response generation. This evaluates the model's ability to generate appropriate responses that adapt to the same text input delivered with different emotional tones. This capability is essential for voice assistants to maintain natural conversation flow and demonstrate appropriate social awareness during interactions. Performance is measured through a text LLM as a Judge to judge the emotional relevance and specificity of model generated replies.

System Prompt:
Reply conversationally. Pay attention to the tone in which the user speaks and respond appropriately.
Surprised Tone:
Expected Output In Response to Surprised Tone:
I didn't think you knew I was going to dance tonight! I just decided to let loose and have some fun.
Neutral Tone:
Expected Output In Response to Neutral Tone:
Yes, I am dancing. What kind of music do you think this is?

Takeaway: OpenAI’s TTS model now can have settings on tone for the generated speech.

Link: https://www.openai.fm/ The screenshot of the OpenAI TTS model settings

Last updated: 2025-05-13
Built with Hugo, theme modified on Stack