Publications by Davide Caffagni

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Davide Caffagni ✕

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

Authors: Caffagni, Davide; Cocchi, Federico; Moratelli, Nicholas; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

Multimodal LLMs are the natural evolution of LLMs and enlarge their capabilities so as to work beyond the pure textual … (Read full abstract)

Multimodal LLMs are the natural evolution of LLMs and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach termed Wiki-LLaVA aims at integrating an external knowledge source of multimodal documents which is accessed through a hierarchical retrieval pipeline. Relevant passages using this approach are retrieved from the external knowledge source and employed as additional context for the LLM augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.

2024 Relazione in Atti di Convegno

DOI IRIS

SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning

Authors: Caffagni, Davide; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Image captioning is a challenging task that combines Computer Vision and Natural Language Processing to generate descriptive and accurate textual … (Read full abstract)

Image captioning is a challenging task that combines Computer Vision and Natural Language Processing to generate descriptive and accurate textual descriptions for input images. Research efforts in this field mainly focus on developing novel architectural components to extend image captioning models and using large-scale image-text datasets crawled from the web to boost final performance. In this work, we explore an alternative to web-crawled data and augment the training dataset with synthetic images generated by a latent diffusion model. In particular, we propose a simple yet effective synthetic data augmentation framework that is capable of significantly improving the quality of captions generated by a standard Transformer-based model, leading to competitive results on the COCO dataset.

2023 Relazione in Atti di Convegno

DOI IRIS

« 1

Page 2 of 2 • Total publications: 12