Authors:Jônatas H. dos Santos, Julio C. S. Reis, Philipe Melo, João F. H. Olivetti, Thales H. Silva, Matheus Gontijo Guimaraes, Glaucio de Souza, Marcos A. Gonçalves, Fabricio Benevenuto, Filipe B. B. Zanovello, Marco A. G. Rodrigues, Cristiano X. Lima
Abstract:We introduce WhaVax, a new expert-annotated dataset of vaccine-related WhatsApp messages collected from large Brazilian public groups spanning multiple pandemic years. The dataset was constructed through a rigorous, carefully designed pipeline that integrates keyword-based data collection, semantic deduplication to remove near-duplicate content, and a multi-stage annotation protocol conducted by medical specialists. This process produced a high-quality gold-standard corpus, characterized by substantial inter-annotator agreement and strong reliability for downstream analysis. Additionally, we provide a detailed characterization of WhatsApp misinformation, revealing distinctive linguistic, structural, lexical, temporal, and group-level patterns, as well as a meaningful layer of ambiguous cases that reflect the complexity of health discourse in private messaging. We also benchmark classical models, fine-tuned Small Language Models, and zero- or few-shot Large Language Models under realistic data-scarcity constraints, demonstrating that strong embeddings and LLM approaches perform competitively, while domain alignment and data availability remain critical factors. This study provides a rare, high-quality resource to support misinformation research and computational modeling in encrypted communication environments.
| Comments: | 10 pages. This is a preprint version of a paper accepted for the International AAAI Conference on Web and Social Media (ICWSM'26). Please cite the conference version rather than this preprint |
| Subjects: | Social and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY) |
| Cite as: | arXiv:2605.12510 [cs.SI] |
| (or arXiv:2605.12510v1 [cs.SI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.12510 arXiv-issued DOI via DataCite |
Submission history
From: Julio C. S. Reis [view email]
[v1]
Wed, 25 Mar 2026 14:54:09 UTC (455 KB)
