Abstract:Existing language-conditioned navigation systems typically rely on modular pipelines or trajectory generators, but the latter use each scene--instruction annotation mainly to supervise one start-conditioned rollout. To address these limitations, we present CoFL, an end-to-end policy that maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation. CoFL reformulates navigation as workspace-conditioned field learning rather than start-conditioned trajectory prediction: it learns local motion vectors at arbitrary BEV locations, turning each scene--instruction annotation into dense spatial control supervision. Trajectories are generated from any start by numerical integration of the predicted field, enabling simple real-time rollout and closed-loop recovery. To enable large-scale training and evaluation, we build a dataset of over 500k BEV image--instruction pairs, each procedurally annotated with a flow field and a trajectory derived from semantic maps built on Matterport3D and ScanNet. Evaluating on strictly unseen scenes, CoFL significantly outperforms modular Vision-Language Model (VLM)-based planners and trajectory generation policies in both navigation precision and safety, while maintaining real-time inference. Finally, we deploy CoFL zero-shot in real-world experiments with BEV observations across multiple layouts, maintaining feasible closed-loop control and a high success rate.
| Comments: | 18 pages, 13 figures |
| Subjects: | Robotics (cs.RO); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2603.02854 [cs.RO] |
| (or arXiv:2603.02854v2 [cs.RO] for this version) | |
| https://doi.org/10.48550/arXiv.2603.02854 arXiv-issued DOI via DataCite |
Submission history
From: Haokun Liu [view email]
[v1]
Tue, 3 Mar 2026 11:02:55 UTC (5,458 KB)
[v2]
Wed, 29 Apr 2026 04:47:16 UTC (6,974 KB)
