Abstract:Event stream data often exhibit hierarchical structure in which multiple events co-occur, resulting in a sequence of multisets (i.e., bags of events). In electronic health records (EHRs), for example, medical events are grouped into a sequence of clinical encounters with well-defined temporal structure, but the order and timing of events within each encounter may be unknown or unreliable. Most existing foundation models (FMs) for event stream data flatten this hierarchy into a one-dimensional sequence, leading to (i) computational inefficiency associated with dense attention and learning spurious within-set relationships, and (ii) lower-quality set-level representations from heuristic post-training pooling for downstream tasks. Here, we show that preserving the original hierarchy in the FM architecture provides a useful inductive bias that improves both computational efficiency and representation quality. We then introduce Nested Event Stream Transformer (NEST), a FM for event streams comprised of sequences of multisets. Building on this architecture, we formulate Masked Set Modeling (MSM), an efficient paradigm that promotes improved set-level representation learning. Experiments on real-world multiset sequence data show that NEST captures real-world dynamics while improving both pretraining efficiency and downstream performance.
| Comments: | 10-page main text |
| Subjects: | Machine Learning (cs.LG) |
| Cite as: | arXiv:2602.00520 [cs.LG] |
| (or arXiv:2602.00520v3 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2602.00520 arXiv-issued DOI via DataCite |
Submission history
From: Minghui Sun [view email]
[v1]
Sat, 31 Jan 2026 05:21:27 UTC (484 KB)
[v2]
Tue, 3 Feb 2026 03:10:51 UTC (484 KB)
[v3]
Thu, 14 May 2026 03:32:56 UTC (655 KB)
