Featured image of post CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Generating temporal alignment for video-text retrieval

Info

Comments

CAV-MAE Sync improves audio-visual learning by aligning audio temporally with video, separating contrastive and reconstruction objectives, and using register tokens for better spatial localization Method

Takeaways:

  • For temporal alignment, it just split the audio and video in each frame correspondingly.
  • It is not surprising that adding several “register tokens” can help improve the performance of the model.
Last updated: 2025-05-07
Built with Hugo, theme modified on Stack