Publications - AImageLab

Augmenting and Mixing Transformers with Synthetic Data for Image Captioning

Authors: Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IMAGE AND VISION COMPUTING

Image captioning has attracted significant attention within the Computer Vision and Multimedia research domains, resulting in the development of effective … (Read full abstract)

Image captioning has attracted significant attention within the Computer Vision and Multimedia research domains, resulting in the development of effective methods for generating natural language descriptions of images. Concurrently, the rise of generative models has facilitated the production of highly realistic and high-quality images, particularly through recent advancements in latent diffusion models. In this paper, we propose to leverage the recent advances in Generative AI and create additional training data that can be effectively used to boost the performance of an image captioning model. Specifically, we combine real images with their synthetic counterparts generated by Stable Diffusion using a Mixup data augmentation technique to create novel training examples. Extensive experiments on the COCO dataset demonstrate the effectiveness of our solution in comparison to different baselines and state-of-the-art methods and validate the benefits of using synthetic data to augment the training stage of an image captioning model and improve the quality of the generated captions. Source code and trained models are publicly available at: https://github.com/aimagelab/synthcap_pp.

2025 Articolo su rivista

DOI IRIS

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Authors: Cocchi, Federico; Moratelli, Nicholas; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. … (Read full abstract)

Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge and predict the relevance of information retrieved from an external database. Tokens are trained following a two-stage two-model training recipe. This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed. Through our experiments, we demonstrate the efficacy of ReflectiVA for knowledge-based visual question answering, highlighting its superior performance compared to existing methods. Source code and trained models are publicly available at https://github.com/aimagelab/ReflectiVA.

2025 Relazione in Atti di Convegno

DOI IRIS

AURALYS: smart glasses to improve audio selection and perception in educational and working contexts

Authors: Filippini, Gianluca; Borghi, Guido; Giliberti, Enrico; Damiani, Paola; Vezzani, Roberto

2025 Relazione in Atti di Convegno

IRIS

BarBeR - Barcode Benchmark Repository: Implementation and Reproducibility Notes

Authors: Vezzali, Enrico; Bolelli, Federico; Santi, Stefano; Grana, Costantino

This paper provides a detailed description of how to install, set up, and use "BarBeR" (Barcode Benchmark Repository) to reproduce … (Read full abstract)

This paper provides a detailed description of how to install, set up, and use "BarBeR" (Barcode Benchmark Repository) to reproduce the results presented in the ICPR 2024 paper "BarBeR: A Barcode Benchmarking Repository". The paper details the tests available in the repository and how the configuration parameters affect and influence experimental results.

2025 Relazione in Atti di Convegno

IRIS

BarBeR: A Barcode Benchmarking Repository

Authors: Vezzali, E.; Bolelli, F.; Santi, S.; Grana, C.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Since their invention in 1949, barcodes have remained the preferred method for automatic data capture, playing a crucial role in … (Read full abstract)

Since their invention in 1949, barcodes have remained the preferred method for automatic data capture, playing a crucial role in supply chain management. To detect a barcode in an image, multiple algorithms have been proposed in the literature, with a significant increase of interest in the topic since the rise of deep learning. However, research in the field suffers from many limitations, including the scarcity of public datasets and code implementations, which hampers the reproducibility and reliability of published results. For this reason, we developed "BarBeR" (Barcode Benchmark Repository), a benchmark designed for testing and comparing barcode detection algorithms. This benchmark includes the code implementation of various detection algorithms for barcodes, along with a suite of useful metrics. It offers a range of test setups and can be expanded to include any localization algorithm. In addition, we provide a large, annotated dataset of 8748 barcode images, combining multiple public barcode datasets with standardized annotation formats for both detection and segmentation tasks. Finally, we share the results obtained from running the benchmark on our dataset, offering valuable insights into the performance of different algorithms.

2025 Relazione in Atti di Convegno

DOI IRIS

Benchmarking BERT-based Models for Latin: A Case Study on Biblical References in Ancient Christian Literature

Authors: Caffagni, Davide; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella; Cucchiara, Rita

Published in: CEUR WORKSHOP PROCEEDINGS

Transformer-based language models like BERT have revolutionized Natural Language Processing (NLP) research, but their application to historical languages remains underexplored. … (Read full abstract)

Transformer-based language models like BERT have revolutionized Natural Language Processing (NLP) research, but their application to historical languages remains underexplored. This paper investigates the adaptation of BERT-based embedding models for Latin, a language central to the study of the sacred texts of Christianity. Focusing on Jerome’s Vulgate, pre-Vulgate Latin translations of the Bible, and patristic commentaries such as Augustine’s De Genesi ad litteram, we address the challenges posed by Latin’s complex syntax, specialized vocabulary, and historical variations at the orthographic, morphological, and semantic levels. In particular, we propose fine-tuning existing BERT-based embedding models on annotated Latin corpora, using self-generated hard negatives to improve performance in detecting biblical references in early Christian literature in Latin. Experimental results demonstrate the ability of BERT-based models to identify citations of and allusions to the Bible(s) in ancient Christian commentaries while highlighting the complexities and challenges of this field. By integrating NLP techniques with humanistic expertise, this work provides a case study on intertextual analysis in Latin patristic works. It underscores the transformative potential of interdisciplinary approaches, advancing computational tools for sacred text studies and bridging the gap between philology and computational analysis.

2025 Relazione in Atti di Convegno

IRIS

BioGaze: a Framework for Evaluating the Photographic Requirements of the ISO/IEC 39794-5 Standard

Authors: Elatfi, Osama; Domenico, Nicolò Di; Borghi, Guido; Franco, Annalisa; Maltoni, Davide

2025 Relazione in Atti di Convegno

DOI IRIS

Bits2Bites: Intra-oral Scans Occlusal Classification

Authors: Borghi, Lorenzo; Lumetti, Luca; Cremonini, Francesca; Rizzo, Federico; Grana, Costantino; Lombardo, Luca; Bolelli, Federico

We introduce Bits2Bites, the first publicly available dataset for occlusal classification from intra-oral scans, comprising 200 paired upper and lower … (Read full abstract)

We introduce Bits2Bites, the first publicly available dataset for occlusal classification from intra-oral scans, comprising 200 paired upper and lower dental arches annotated across multiple clinically relevant dimensions (sagittal, vertical, transverse, and midline relationships). Leveraging this resource, we propose a multi-task learning benchmark that jointly predicts five occlusal traits from raw 3D point clouds using state-of-the-art point-based neural architectures. Our approach includes extensive ablation studies assessing the benefits of multi-task learning against single-task baselines, as well as the impact of automatically-predicted anatomical landmarks as input features. Results demonstrate the feasibility of directly inferring comprehensive occlusion information from unstructured 3D data, achieving promising performance across all tasks. Our entire dataset, code, and pretrained models are publicly released to foster further research in automated orthodontic diagnosis.

2025 Relazione in Atti di Convegno

IRIS

BRUM: Robust 3D Vehicle Reconstruction from 360° Sparse Images

Authors: Di Nucci, Davide; Tomei, Matteo; Borghi, Guido; Ciuffreda, Luca; Vezzani, Roberto; Cucchiara, Rita

2025 Relazione in Atti di Convegno

DOI IRIS

Causal Graphical Models for Vision-Language Compositional Understanding

Authors: Parascandolo, Fiorenzo; Moratelli, Nicholas; Sangineto, Enver; Baraldi, Lorenzo; Cucchiara, Rita

2025 Relazione in Atti di Convegno

IRIS