Abstract:LLMs and speech assistants are increasingly used for task-oriented interactions, yet their evaluation often relies on controlled scenarios that fail to capture the variability and complexity of real user requests. Drink ordering, for example, involves diverse named entities, drink types, sizes, customizations, and brand-specific terminology, as well as spontaneous speech phenomena such as hesitations and self-corrections. To address this gap, we introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing a realistic benchmark for model robustness and generalization in a linguistically rich, real-world task.
| Comments: | Accepted at LREC 2026 |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2604.26500 [cs.CL] |
| (or arXiv:2604.26500v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2604.26500 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Marcely Zanon Boito [view email]
[v1]
Wed, 29 Apr 2026 10:03:12 UTC (710 KB)
