Abstract:Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-to-structure benchmark with 20,000 synthetic composite crests and auxiliary component examples. Each composite crest is paired with a formal kamon description language - "kamon yōgo" - description, a segmented Japanese analysis, an English translation, and a non-linguistic program code. Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed container-modifier contexts, and linear probes of factor accessibility. We include baseline results for a ViT encoder/Transformer decoder and two VGG n-gram decoders, with and without learned positional masks. KamonBench therefore provides a controlled testbed for sparse compositional visual recognition and factor recovery in vision-language models.
| Comments: | Preprint |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.13322 [cs.CV] |
| (or arXiv:2605.13322v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.13322 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Richard Sproat [view email]
[v1]
Wed, 13 May 2026 10:35:07 UTC (2,813 KB)
