Info

Title: CAVA: Comprehensive Assessment for Voice Assistants
Group: Stanford
Keywords: Benchmark, speech LLMs
Venue: Website

Comments

Turn-taking, instruction following, function calling, tone awareness, safety, latency

AI Model Performance Comparison

Results from their website.

Category	Task	# Data Points	GPT-4o	Gemini Pipeline	Gemini 2.0	Gemini 2.5
Latency	Jeopardy (Win Rate % ↑)	1k	73.0%	15.4%	6.0%	No Speech Output
Function Calling	Function Calling (Function Calls Match ↑)	1k	24%	27%	N/A	N/A
Instruction Following	System Prompt Following (Adhere % ↑)	1k	64.6%	64.7%	69.7%	70.2%
Instruction Following	Pronunciation Control (OED % Correct ↑)	283	58%	45%	32%	No Speech Output
Tone Awareness	Counterfactual Response (Likert Scale Score / 5 ↑)	1.5k	3.37	3.27	3.30	3.32
Turn-Taking	Turn Prediction (Accuracy ↑)	1k	40.7%	37.0%	38.3%	47.5%
Safety	Deception Detection (Accuracy ↑)	151	[REFUSES]	14.5%	26.1%	12.5%
Safety	Speech Jailbreaking (Success Rate ↓)	520	68.3%	79.0%	79.2%	49.0%

Tone Awareness

Emotion Counterfactual Response Generation

To ensure LAMs can recognize social cues and respond appropriately, we introduce the task of tone-aware response generation. This evaluates the model's ability to generate appropriate responses that adapt to the same text input delivered with different emotional tones. This capability is essential for voice assistants to maintain natural conversation flow and demonstrate appropriate social awareness during interactions. Performance is measured through a text LLM as a Judge to judge the emotional relevance and specificity of model generated replies.

System Prompt:

Reply conversationally. Pay attention to the tone in which the user speaks and respond appropriately.

Surprised Tone: