Abstract:Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-based framework that enables test-time adaptation in offline safe RL without retraining. In SAS, the main mechanism is self-alignment: at test time, the pretrained agent generates several imagined trajectories and selects those satisfying the Lyapunov condition. These feasible segments are then recycled as in-context prompts, allowing the agent to realign its behavior toward safety while avoiding parameter updates. In effect, SAS turns Lyapunov-guided imagination into control-invariant prompts, and its transformer architecture admits a hierarchical RL interpretation where prompting functions as Bayesian inference over latent skills. Across Safety Gymnasium and MuJoCo benchmarks, SAS consistently reduces cost and failure while maintaining or improving return.
| Comments: | Accepted at AISTATS 2026. First two authors contributed equally. Project page: this https URL. Code: this https URL |
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2604.26516 [cs.LG] |
| (or arXiv:2604.26516v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2604.26516 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Seungyub Han [view email]
[v1]
Wed, 29 Apr 2026 10:32:18 UTC (335 KB)
