Kimi-Audio Technical Report

Info

Title: Kimi-Audio Technical Report
Group: Kimi
Keywords: audio llm
Venue: arXiv

Comments

Use 13 million hours of audio, wow! Audio delay (blank tokens) for subsequent audio generation. Overview

They have open-sourced their evaluation kit.

Input: audio tokenizer (transform to discrete semantic tokens), Whisper encoder (extract continuous acoustic features) (added together)

Model: first few layers are shared, then have text head and audio head

Output: use “look-ahead” mechanism, take the future semantic tokens when generate the previous chunk, and only retain the previous chunk’s mel-spectrogram.

Suprised to see that Xu Tan is with Kimi team now.