Featured image of post Kimi-Audio Technical Report

Kimi-Audio Technical Report

Kimi-Audio Technical Report

Info

Comments

Use 13 million hours of audio, wow! Audio delay (blank tokens) for subsequent audio generation. Overview

They have open-sourced their evaluation kit.

Input: audio tokenizer (transform to discrete semantic tokens), Whisper encoder (extract continuous acoustic features) (added together)

Model: first few layers are shared, then have text head and audio head

Output: use “look-ahead” mechanism, take the future semantic tokens when generate the previous chunk, and only retain the previous chunk’s mel-spectrogram.

Suprised to see that Xu Tan is with Kimi team now.

Last updated: 2025-05-03
Built with Hugo, theme modified on Stack