Abstract:While significant progress has been made in Video Question Answering and cross-modal understanding, causal reasoning about how visual dynamics drive musical structure in music videos remains under-explored. We introduce KARMA-MV, a large-scale multiple-choice QA dataset derived from 2,682 YouTube music videos, designed to test models' ability to integrate temporal audio-visual cues and reason about visual-to-musical influence across reasoning, prediction, and counterfactual questions. Unlike traditional datasets requiring manual annotation, KARMA-MV leverages LLM reasoning for scalable generation and validation, yielding 37,737 MCQs. We propose a causal knowledge graph (CKG) approach that augments vision-language models (VLMs) with structured retrieval of cross-modal dependencies. Experiments on state-of-the-art VLMs and LLMs show consistent gains from CKG grounding -- especially for smaller models -- establishing the value of explicit causal structure for music-video reasoning. KARMA-MV provides a new benchmark for advancing causal audio-visual understanding beyond correlation.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| MSC classes: | 68T01 |
| ACM classes: | I.2.6; I.2.10; H.3.3 |
| Cite as: | arXiv:2605.08175 [cs.CV] |
| (or arXiv:2605.08175v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.08175 arXiv-issued DOI via DataCite |
Submission history
From: Archishman Ghosh [view email]
[v1]
Tue, 5 May 2026 06:48:39 UTC (3,131 KB)
