Publications - AImageLab

MATE: Multimodal Agent that Talks and Empathizes

Authors: Rawal, Niyati; Xia, Matteo; Tessaro, David; Baraldi, Lorenzo; Cucchiara, Rita

2025 Relazione in Atti di Convegno

IRIS

Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Authors: Quattrini, F.; Pippi, V.; Cascianelli, S.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Diffusion models have become the State-of-the-Art for text-to-image generation, and increasing research effort has been dedicated to adapting the inference … (Read full abstract)

Diffusion models have become the State-of-the-Art for text-to-image generation, and increasing research effort has been dedicated to adapting the inference process of pretrained diffusion models to achieve zero-shot capabilities. An example is the generation of panorama images, which has been tackled in recent works by combining independent diffusion paths over overlapping latent features, which is referred to as joint diffusion, obtaining perceptually aligned panoramas. However, these methods often yield semantically incoherent outputs and trade-off diversity for uniformity. To overcome this limitation, we propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting to improve the perceptual and semantical coherence of the generated panorama images. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Extensive quantitative and qualitative experimental analysis, together with a user study, demonstrate that our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence. We release the code at https://github.com/aimagelab/MAD.

2025 Relazione in Atti di Convegno

DOI IRIS

MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

Authors: Pipoli, Vittorio; Saporita, Alessia; Bolelli, Federico; Cornia, Marcella; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita; Ficarra, Elisa

Recently, Multimodal Large Language Models (MLLMs) have emerged as a leading framework for enhancing the ability of Large Language Models … (Read full abstract)

Recently, Multimodal Large Language Models (MLLMs) have emerged as a leading framework for enhancing the ability of Large Language Models (LLMs) to interpret non-linguistic modalities. Despite their impressive capabilities, the robustness of MLLMs under conditions where one or more modalities are missing remains largely unexplored. In this paper, we investigate the extent to which MLLMs can maintain performance when faced with missing modality inputs. Moreover, we propose a novel framework to mitigate the aforementioned issue called Retrieval-Augmented Generation for missing modalities (MissRAG). It consists of a novel multimodal RAG technique alongside a tailored prompt engineering strategy designed to enhance model robustness by mitigating the impact of absent modalities while preventing the burden of additional instruction tuning. To demonstrate the effectiveness of our techniques, we conducted comprehensive evaluations across five diverse datasets, covering tasks such as audio-visual question answering, audio-visual captioning, and multimodal sentiment analysis.

2025 Relazione in Atti di Convegno

IRIS

Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization

Authors: Compagnoni, Alberto; Caffagni, Davide; Moratelli, Nicholas; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita

Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to … (Read full abstract)

Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to computer vision. Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate, that is to generate answers to the user's query that are not reflected in the visual input. In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations. In contrast to recent approaches that require complicated pipelines to build synthetic preference data for alignment training, often relying on proprietary models, we capitalize on the well-known CHAIR metric, originally proposed to gauge the degree of hallucinations in image captioning. Given a pair of generated answers, we leverage CHAIR to distinguish winner and loser options (i.e., non-hallucinated and hallucinated samples) and fine-tune off-the-shelf MLLMs via Direct Preference Optimization (DPO). The resulting method, which we refer to as CHAIR-DPO, effectively diminishes the amount of hallucinated answers on several hallucination benchmarks, demonstrating the effectiveness of fine-tuning the MLLM with a CHAIR-based reward.

2025 Relazione in Atti di Convegno

IRIS

Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

Authors: Cartella, Giuseppe; Cuculo, Vittorio; D'Amelio, Alessandro; Cornia, Marcella; Boccignone, Giuseppe; Cucchiara, Rita

Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. … (Read full abstract)

Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research.

2025 Relazione in Atti di Convegno

DOI IRIS

Monocular per-object distance estimation with Masked Object Modeling

Authors: Panariello, Aniello; Mancusi, Gianluca; Haj Ali, Fedy; Porrello, Angelo; Calderara, Simone; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

2025 Articolo su rivista

DOI IRIS

Multimodal Dialogue for Empathetic Human-Robot Interaction

Authors: Rawal, Niyati; Singh Maharjan, Rahul; Salici, Giacomo; Catalini, Riccardo; Romeo, Marta; Bigazzi, Roberto; Baraldi, Lorenzo; Vezzani, Roberto; Cucchiara, Rita; Cangelosi, Angelo

2025 Relazione in Atti di Convegno

IRIS

Multimodal Emotion Recognition in Conversation via Possible Speaker's Audio and Visual Sequence Selection

Authors: Singh Maharjan, Rahul; Rawal, Niyati; Romeo, Marta; Baraldi, Lorenzo; Cucchiara, Rita; Cangelosi, Angelo

Published in: PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING

2025 Relazione in Atti di Convegno

DOI IRIS

One transformer for all time series: representing and training with time-dependent heterogeneous tabular data

Authors: Luetto, S.; Garuti, F.; Sangineto, E.; Forni, L.; Cucchiara, R.

Published in: MACHINE LEARNING

There is a recent growing interest in applying Deep Learning techniques to tabular data in order to replicate the success … (Read full abstract)

There is a recent growing interest in applying Deep Learning techniques to tabular data in order to replicate the success of other Artificial Intelligence areas in this structured domain. Particularly interesting is the case in which tabular data have a time dependence, such as, for instance, financial transactions. However, the heterogeneity of the tabular values, in which categorical elements are mixed with numerical features, makes this adaptation difficult. In this paper we propose UniTTab, a Transformer based architecture whose goal is to uniformly represent heterogeneous time-dependent tabular data, in which both numerical and categorical features are described using continuous embedding vectors. Moreover, differently from common approaches, which use a combination of different loss functions for training with both numerical and categorical targets, UniTTab is uniformly trained with a unique Masked Token pretext task. Finally, UniTTab can also represent time series in which the individual row components have a variable internal structure with a variable number of fields, which is a common situation in many application domains, such as in real world transactional data. Using extensive experiments with five datasets of variable size and complexity, we empirically show that UniTTab consistently and significantly improves the prediction accuracy over several downstream tasks and with respect to both Deep Learning and more standard Machine Learning approaches. Our code and our models are available at: https://github.com/fabriziogaruti/UniTTab.

2025 Articolo su rivista

DOI IRIS

Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries

Authors: Amoroso, Roberto; Zhang, Gengyuan; Koner, Rajat; Baraldi, Lorenzo; Cucchiara, Rita; Tresp, Volker

Published in: IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION

Video Question Answering (Video QA) is a critical and challenging task in video understanding, necessitating models to comprehend entire videos, … (Read full abstract)

Video Question Answering (Video QA) is a critical and challenging task in video understanding, necessitating models to comprehend entire videos, identify the most pertinent information based on the contextual cues from the question, and reason accurately to provide answers. Initial endeavors in harnessing Multimodal Large Language Models (MLLMs) have cast new light on Visual QA, particularly highlighting their commonsense and temporal reasoning capacities. Models that effectively align visual and textual elements can offer more accurate answers tailored to visual inputs. Nevertheless, an unresolved question persists regarding video content: How can we efficiently extract the most relevant information from videos over time and space for enhanced VQA? In this study, we evaluate the efficacy of various temporal modeling techniques in conjunction with MLLMs and introduce a novel component, T-Former, designed as a question-guided temporal querying transformer. T-Former bridges frame-wise visual perception and the reasoning capabilities of LLMs. Our evaluation across various VideoQA benchmarks shows that T-Former, with its linear computational complexity, competes favorably with existing temporal modeling approaches and aligns with the latest advancements in Video QA tasks.

2025 Relazione in Atti di Convegno

DOI IRIS

Publications by Rita Cucchiara

MATE: Multimodal Agent that Talks and Empathizes

Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization

Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

Monocular per-object distance estimation with Masked Object Modeling

Multimodal Dialogue for Empathetic Human-Robot Interaction

Multimodal Emotion Recognition in Conversation via Possible Speaker's Audio and Visual Sequence Selection

One transformer for all time series: representing and training with time-dependent heterogeneous tabular data

Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries