Publications by Marcella Cornia
Explore our research publications: papers, articles, and conference proceedings from AImageLab.
Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
Authors: Caffagni, Davide; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning … (Read full abstract)
Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries -- composed of both an image and a text -- and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT.
Sanctuaria-Gaze: A Multimodal Egocentric Dataset for Human Attention Analysis in Religious Sites
Authors: Cartella, Giuseppe; Cuculo, Vittorio; Cornia, Marcella; Papasidero, Marco; Ruozzi, Federico; Cucchiara, Rita
Published in: ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE
We introduce Sanctuaria-Gaze, a multimodal dataset featuring egocentric recordings from 40 visits to four architecturally and culturally significant sanctuaries in … (Read full abstract)
We introduce Sanctuaria-Gaze, a multimodal dataset featuring egocentric recordings from 40 visits to four architecturally and culturally significant sanctuaries in Northern Italy. Collected using wearable devices with integrated eye trackers, the dataset offers RGB videos synchronized with streams of gaze coordinates, head motion, and environmental point cloud, resulting in over four hours of recordings. Along with the dataset, we provide a framework for automatic detection and analysis of Areas of Interest (AOIs). This framework fills a critical gap by offering an open-source, flexible tool for gaze-based research that adapts to dynamic settings without requiring manual intervention. Our study analyzes human visual attention to sacred, architectural, and cultural objects, providing insights into how visitors engage with these elements and how their background influences their interactions. By releasing both the dataset and the analysis framework, Sanctuaria-Gaze aims to advance interdisciplinary research on gaze behavior, human-computer interaction, and visual attention in real-world environments. Code and dataset are available at https://github.com/aimagelab/Sanctuaria-Gaze.
Semantically Conditioned Prompts for Visual Recognition under Missing Modality Scenarios
Authors: Pipoli, Vittorio; Bolelli, Federico; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita; Ficarra, Elisa
Published in: IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION
This paper tackles the domain of multimodal prompting for visual recognition, specifically when dealing with missing modalities through multimodal Transformers. … (Read full abstract)
This paper tackles the domain of multimodal prompting for visual recognition, specifically when dealing with missing modalities through multimodal Transformers. It presents two main contributions: (i) we introduce a novel prompt learning module which is designed to produce sample-specific prompts and (ii) we show that modality-agnostic prompts can effectively adjust to diverse missing modality scenarios. Our model, termed SCP, exploits the semantic representation of available modalities to query a learnable memory bank, which allows the generation of prompts based on the semantics of the input. Notably, SCP distinguishes itself from existing methodologies for its capacity of self-adjusting to both the missing modality scenario and the semantic context of the input, without prior knowledge about the specific missing modality and the number of modalities. Through extensive experiments, we show the effectiveness of the proposed prompt learning framework and demonstrate enhanced performance and robustness across a spectrum of missing modality cases.
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
Authors: Barsellotti, Luca; Bianchi, Lorenzo; Messina, Nicola; Carrara, Fabio; Cornia, Marcella; Baraldi, Lorenzo; Falchi, Fabrizio; Cucchiara, Rita
Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such … (Read full abstract)
Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks.
TPP-Gaze: Modelling Gaze Dynamics in Space and Time with Neural Temporal Point Processes
Authors: D'Amelio, Alessandro; Cartella, Giuseppe; Cuculo, Vittorio; Lucchi, Manuele; Cornia, Marcella; Cucchiara, Rita; Boccignone, Giuseppe
Attention guides our gaze to fixate the proper location of the scene and holds it in that location for the … (Read full abstract)
Attention guides our gaze to fixate the proper location of the scene and holds it in that location for the deserved amount of time given current processing demands, before shifting to the next one. As such, gaze deployment crucially is a temporal process. Existing computational models have made significant strides in predicting spatial aspects of observer's visual scanpaths (where to look), while often putting on the background the temporal facet of attention dynamics (when). In this paper we present TPP-Gaze, a novel and principled approach to model scanpath dynamics based on Neural Temporal Point Process (TPP), that jointly learns the temporal dynamics of fixations position and duration, integrating deep learning methodologies with point process theory. We conduct extensive experiments across five publicly available datasets. Our results show the overall superior performance of the proposed model compared to state-of-the-art approaches.
Verifier Matters: Enhancing Inference-Time Scaling for Video Diffusion Models
Authors: Baraldi, Lorenzo; Bucciarelli, Davide; Zeng, Zifan; Zhang, Chongzhe; Zhang, Qunli; Cornia, Marcella; Baraldi, Lorenzo; Liu, Feng; Hu, Zheng; Cucchiara, Rita
Inference-time scaling has recently gained attention as an effective strategy for improving the performance of generative models without requiring additional … (Read full abstract)
Inference-time scaling has recently gained attention as an effective strategy for improving the performance of generative models without requiring additional training. Although this paradigm has been successfully applied in text and image generation tasks, its extension to video diffusion models remains relatively underexplored. Indeed, video generation presents unique challenges due to its spatiotemporal complexity, particularly in evaluating intermediate generated samples, a procedure that is required by inference-time scaling algorithms. In this work, we systematically investigate the role of the verifier: the scoring mechanism used to guide sampling. We show that current verifiers, when applied at early diffusion steps, face significant reliability challenges due to noisy samples. We further demonstrate that fine-tuning verifiers on partially denoised samples significantly improves early-stage evaluation and leads to gains in generation quality across multiple inference-time scaling algorithms, including Greedy Search, Beam Search, and a novel Successive Halving baseline.
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models
Authors: Baraldi, Lorenzo; Bucciarelli, Davide; Betti, Federico; Cornia, Marcella; Baraldi, Lorenzo; Sebe, Nicu; Cucchiara, Rita
Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most … (Read full abstract)
Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.
Adapt to Scarcity: Few-Shot Deepfake Detection via Low-Rank Adaptation
Authors: Cappelletti, Silvia; Baraldi, Lorenzo; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
The boundary between AI-generated images and real photographs is becoming increasingly narrow, thanks to the realism provided by contemporary generative … (Read full abstract)
The boundary between AI-generated images and real photographs is becoming increasingly narrow, thanks to the realism provided by contemporary generative models. Such technological progress necessitates the evolution of existing deepfake detection algorithms to counter new threats and protect the integrity of perceived reality. Although the prevailing approach among deepfake detection methodologies relies on large collections of generated and real data, the efficacy of these methods in adapting to scenarios characterized by data scarcity remains uncertain. This obstacle arises due to the introduction of novel generation algorithms and proprietary generative models that impose restrictions on access to large-scale datasets, thereby constraining the availability of generated images. In this paper, we first analyze how the performance of current deepfake methodologies, based on the CLIP embedding space, adapt in a few-shot situation over four state-of-the-art generators. Being the CLIP embedding space not specifically tailored for the task, a fine-tuning stage is desirable, although the amount of data needed is often unavailable in a data scarcity scenario. To address this issue and limit possible overfitting, we introduce a novel approach through the Low-Rank Adaptation (LoRA) of the CLIP architecture, tailored for few-shot deepfake detection scenarios. Remarkably, the LoRA-modified CLIP, even when fine-tuned with merely 50 pairs of real and fake images, surpasses the performance of all evaluated deepfake detection models across the tested generators. Additionally, when LoRA CLIP is benchmarked against other models trained on 1,000 samples and evaluated on generative models not seen during training it exhibits superior generalization capabilities.