Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

Info

Method

They use a pre-trained Q-former and a GPT-2 decoder to “caption” the speech first.
TextrolSpeech dataset, which consists of 236,220 pairs of captions and the corresponding speech sample
The PerceptiveAgent’s evaluation metrics assess its cognitive empathy using BERTScore for text quality and affective empathy via an expressive style classifier’s accuracy for audio. MELD is used for BERTScore.