Abstract:Citation extraction tools are designed for the structured end-of-document bibliographies of the natural sciences, but law and humanities scholarship cites references primarily in footnotes, where bibliographic data is interleaved with commentary and cross-references and varies widely across languages and styles. To address the scarcity of suitable gold-standard resources, we present FOSSIL (Footnote-based Open-access SSH Scientific Instance Labels), an openly licensed multilingual dataset of 96 annotated scholarly articles containing over 7,600 footnote-embedded references, together with PDF-TEI Editor (a collaborative web annotation tool), a documented seven-annotator workflow, and a Grobid specialization for footnote-based citations. In end-to-end evaluation, the specialized pipeline nearly doubles extraction quality over default Grobid (micro-F1 from 0.36 to 0.72), driven largely by improved recall, while showing that substantial headroom remains for cross-references and mixed-content footnotes. This extended abstract presents work in progress; annotations of citations segmentation and parsing, and cross-reference resolution are ongoing.
| Comments: | This is an extended abstract, peer-reviewed and presented at CiteX2026 this https URL |
| Subjects: | Digital Libraries (cs.DL); Computation and Language (cs.CL) |
| Cite as: | arXiv:2606.01109 [cs.DL] |
| (or arXiv:2606.01109v1 [cs.DL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.01109 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Luca Foppiano [view email]
[v1]
Sun, 31 May 2026 08:59:49 UTC (72 KB)
