CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Info

Title: CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Group: MIT
Keywords: audio-visual, mask autoencoder, contrastive learning
Venue: CVPR 2025

Comments

CAV-MAE Sync improves audio-visual learning by aligning audio temporally with video, separating contrastive and reconstruction objectives, and using register tokens for better spatial localization Method

Takeaways:

For temporal alignment, it just split the audio and video in each frame correspondingly.
It is not surprising that adding several “register tokens” can help improve the performance of the model.