EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation

View PDF HTML (experimental)

Abstract:Long-horizon video generation has advanced in visual quality, yet existing methods still struggle to maintain knowledge consistency and coherent pedagogical narratives across multi-shot instructional videos, especially in STEM domains. To address these challenges, we propose EduStory, a unified framework for reliable instructional video generation. EduStory integrates pedagogical state modeling to track persistent knowledge states, script-guided structured control to organize multi-shot narratives, and learning-oriented evaluation metrics to assess knowledge fidelity and constraint satisfaction. To support rigorous evaluation, we further introduce EduVideoBench, a diagnostic benchmark with multi-granularity annotations, including pedagogical storyboards, shot-level semantics, and knowledge state transitions, together with baseline tasks for controllable instructional video generation. Extensive experiments demonstrate that domain-aware state modeling and structured control substantially reduce narrative breakdown and improve alignment with instructional intent. These results highlight the significance of domain-specific structural constraints and tailored benchmarks for advancing reliable, controllable, and also trustworthy long-horizon video generation.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2605.09378 [cs.CV]
	(or arXiv:2605.09378v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.09378 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Shuai Zhao [view email]
[v1] Sun, 10 May 2026 07:03:37 UTC (2,039 KB)