Featured image of post SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation

Generating temporal alignment for video-text retrieval

Info

Comments

It uses VirtualHome to generate synthetic video-text pairs. Method

Last updated: 2025-05-03
Built with Hugo, theme modified on Stack