
Categories
2025
Seed1.5-VL Technical Report

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

Beyond Text: Utilizing Vocal Cues to Improve Decision Making in LLMs for Robot Navigation Tasks

LLaMA-Omni 2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

CAVA: Comprehensive Assessment for Voice Assistants

ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation

Param Δ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost

Kimi-Audio Technical Report

MR. Video: “MapReduce” is the Principle for Long Video Understanding

Φ-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation

Memory-enhanced Retrieval Augmentation for Long Video Understanding

VACE: Video Tasks within an All-in-one Framework for Creation and Editing

Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models

MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems

Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

LeCun's talk on Advanced AI
OPTISHEAR: Towards Efficient and Adaptive Pruning of Large Language Models via Evolutionary Optimization

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Knowledge Bridger: Towards Training-Free Missing Modality Completion
