Abstract:Motion forecasts of road users (i.e., agents) vary in complexity depending on the number of agents, scene constraints, and interactions. In particular, the output space of joint trajectory distributions grows exponentially with the number of agents. Therefore, we decompose multi-agent motion forecasts into (1) marginal distributions for all modeled agents and (2) joint distributions for interacting agents. Using a transformer model, we generate joint distributions by re-encoding marginal distributions followed by pairwise modeling. This incorporates a retrocausal flow of information from later points in marginal trajectories to earlier points in joint trajectories. For each time step, we model the positional uncertainty using compressed exponential power distributions. Notably, our method achieves strong results in the Waymo Interaction Prediction Challenge and generalizes well to the Argoverse 2 and V2X-Seq datasets. Additionally, our method provides an interface for issuing instructions. We show that standard motion forecasting training implicitly enables the model to follow instructions and adapt them to the scene context. GitHub repository: this https URL
| Comments: | CVPRW26 |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO) |
| Cite as: | arXiv:2505.20414 [cs.CV] |
| (or arXiv:2505.20414v2 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2505.20414 arXiv-issued DOI via DataCite |
Submission history
From: Royden Wagner [view email]
[v1]
Mon, 26 May 2025 18:05:59 UTC (920 KB)
[v2]
Wed, 29 Apr 2026 08:48:06 UTC (1,175 KB)
