Publications - AImageLab

Multimodal Understanding tramite Retrieval-Augmentation: dai Modelli alla Valutazione

Authors: Sarto, Sara

Nel campo dell’Intelligenza Artificiale (IA), l’introduzione del meccanismo di attention e dell’architettura Transformer ha reso possibili modelli in grado di … (Read full abstract)

Nel campo dell’Intelligenza Artificiale (IA), l’introduzione del meccanismo di attention e dell’architettura Transformer ha reso possibili modelli in grado di elaborare più modalità su scala senza precedenti. Questa svolta è dovuta alla flessibilità dell’operatore di attention e all’adattabilità dell’architettura, che hanno dato origine a una nuova generazione di sistemi visione-linguaggio. Tra i task all’intersezione tra Computer Vision, Natural Language Processing e Multimedia, l’image captioning, ovvero la generazione di descrizioni in linguaggio naturale a partire da contenuti visivi, ha svolto un ruolo centrale. Nell’era dei Multimodal Large Language Models (MLLMs), il captioning resta fondamentale, affiancato da task multimodali come il Visual Question Answering (VQA). Per potenziare tali modelli, la retrieval augmentation è emersa come strategia chiave. L’arricchimento con conoscenza esterna rilevante migliora l’adattabilità e consente risposte più accurate e sensibili al contesto, soprattutto in scenari complessi o specialistici. Questa tesi rappresenta l’evoluzione naturale della retrieval augmentation, passando dalle sue prime applicazioni nell’image captioning all'integrazione nei moderni MLLMs. Ogni fase si basa sulle intuizioni e sulle sfide incontrate, affrontando problemi aperti legati alla valutazione e all’efficacia del retrieval. La prima parte della tesi stabilisce le basi dei modelli visione-linguaggio con retrieval augmentation. Vengono analizzate tecniche classiche di cross-modal retrieval ed estese a scenari più complessi, inclusi query multimodali e collezioni documentali eterogenee. Un’intuizione centrale è che la qualità del retrieval influenzi in modo critico le prestazioni complessive. In risposta a ciò, vengono introdotti nuovi retriever multimodali, ReT e ReT-2, progettati per tali scenari. La tesi indaga anche architetture di captioning con retrieval augmentation attraverso l’introduzione del RA-Transformer, in cui la conoscenza esterna viene integrata direttamente nel processo di generazione, fornendo segnali utili a produrre caption più ricche e precise. Successivamente, il lavoro estende la retrieval augmentation ai MLLMs, motivato dal fatto che anche il pretraining su larga scala mostra limiti nell’affrontare query knowledge-intensive o specifiche di dominio. In particolare, WikiLLaVA introduce architetture MLLM con retrieval augmentation per il knowledge-based VQA, in cui i meccanismi di retrieval potenziano le capacità di ragionamento e l’adattabilità a query multimodali complesse. Nel corso della ricerca emerge come il progresso dei modelli di captioning sia limitato dalla mancanza di metriche di valutazione robuste e affidabili. Le metriche tradizionali, sebbene ampiamente utilizzate, spesso non riescono a catturare adeguatezza semantica, grounding fattuale e fluidità linguistica. Quindi, un contributo di questa tesi è la progettazione e l’analisi di nuove metriche di valutazione per l’image captioning, ovvero PAC-S, BRIDGE e una versione migliorata di PAC-S. Tali metriche sono progettate per allinearsi al giudizio umano e per catturare la qualità delle descrizioni. La tesi ne analizza anche l’applicazione su diversi benchmark e domini, inclusa la loro capacità di valutare caption generate da MLLMs, riflettendo il passaggio del captioning da compito autonomo a componente di sistemi di ragionamento multimodale più ampi. Nel complesso, attraverso nuove architetture di captioning con retrieval augmentation, retriever multimodali e metriche di valutazione, questa tesi fornisce metodologie, strumenti e contributi che avanzano lo stato dell’arte nell’ambito dell’Intelligenza Artificiale multimodale.

2026 Tesi di dottorato

IRIS

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

Authors: Mattioli, Gabriele; Turri, Evelyn; Sarto, Sara; Baraldi, Lorenzo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources — such as … (Read full abstract)

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources — such as APIs, computational utilities, and specialized models — to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.

2026 Relazione in Atti di Convegno

IRIS

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Authors: Compagnoni, Alberto; Morini, Marco; Sarto, Sara; Cocchi, Federico; Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual … (Read full abstract)

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.

2026 Relazione in Atti di Convegno

IRIS

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Authors: Sarto, Sara; Cornia, Marcella; Cucchiara, Rita

The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models … (Read full abstract)

The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggest promising directions for future research in image captioning assessment.

2025 Relazione in Atti di Convegno

DOI IRIS

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Authors: Cocchi, Federico; Moratelli, Nicholas; Caffagni, Davide; Sarto, Sara; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita

2025 Relazione in Atti di Convegno

IRIS

Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Authors: Sarto, Sara; Moratelli, Nicholas; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: INTERNATIONAL JOURNAL OF COMPUTER VISION

2025 Articolo su rivista

DOI IRIS

Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Authors: Caffagni, Davide; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning … (Read full abstract)

Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries -- composed of both an image and a text -- and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT.

2025 Relazione in Atti di Convegno

DOI IRIS

Semantically Conditioned Prompts for Visual Recognition under Missing Modality Scenarios

Authors: Pipoli, Vittorio; Bolelli, Federico; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita; Ficarra, Elisa

Published in: IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION

This paper tackles the domain of multimodal prompting for visual recognition, specifically when dealing with missing modalities through multimodal Transformers. … (Read full abstract)

This paper tackles the domain of multimodal prompting for visual recognition, specifically when dealing with missing modalities through multimodal Transformers. It presents two main contributions: (i) we introduce a novel prompt learning module which is designed to produce sample-specific prompts and (ii) we show that modality-agnostic prompts can effectively adjust to diverse missing modality scenarios. Our model, termed SCP, exploits the semantic representation of available modalities to query a learnable memory bank, which allows the generation of prompts based on the semantics of the input. Notably, SCP distinguishes itself from existing methodologies for its capacity of self-adjusting to both the missing modality scenario and the semantic context of the input, without prior knowledge about the specific missing modality and the number of modalities. Through extensive experiments, we show the effectiveness of the proposed prompt learning framework and demonstrate enhanced performance and robustness across a spectrum of missing modality cases.

2025 Relazione in Atti di Convegno

DOI IRIS

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Authors: Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like … (Read full abstract)

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score.

2024 Relazione in Atti di Convegno

IRIS

Multi-Class Unlearning for Image Classification via Weight Filtering

Authors: Poppi, Samuele; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTELLIGENT SYSTEMS

Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods … (Read full abstract)

Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods that target a limited subset or a single class, our framework unlearns all classes in a single round. We achieve this by modulating the network's components using memory matrices, enabling the network to demonstrate selective unlearning behavior for any class after training. By discovering weights that are specific to each class, our approach also recovers a representation of the classes which is explainable by design. We test the proposed framework on small- and medium-scale image classification datasets, with both convolution- and Transformer-based backbones, showcasing the potential for explainable solutions through unlearning.

2024 Articolo su rivista

DOI IRIS

Publications by Sara Sarto

Multimodal Understanding tramite Retrieval-Augmentation: dai Modelli alla Valutazione

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Semantically Conditioned Prompts for Visual Recognition under Missing Modality Scenarios

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Multi-Class Unlearning for Image Classification via Weight Filtering