Publications - AImageLab

Verifier Matters: Enhancing Inference-Time Scaling for Video Diffusion Models

Authors: Baraldi, Lorenzo; Bucciarelli, Davide; Zeng, Zifan; Zhang, Chongzhe; Zhang, Qunli; Cornia, Marcella; Baraldi, Lorenzo; Liu, Feng; Hu, Zheng; Cucchiara, Rita

Inference-time scaling has recently gained attention as an effective strategy for improving the performance of generative models without requiring additional … (Read full abstract)

Inference-time scaling has recently gained attention as an effective strategy for improving the performance of generative models without requiring additional training. Although this paradigm has been successfully applied in text and image generation tasks, its extension to video diffusion models remains relatively underexplored. Indeed, video generation presents unique challenges due to its spatiotemporal complexity, particularly in evaluating intermediate generated samples, a procedure that is required by inference-time scaling algorithms. In this work, we systematically investigate the role of the verifier: the scoring mechanism used to guide sampling. We show that current verifiers, when applied at early diffusion steps, face significant reliability challenges due to noisy samples. We further demonstrate that fine-tuning verifiers on partially denoised samples significantly improves early-stage evaluation and leads to gains in generation quality across multiple inference-time scaling algorithms, including Greedy Search, Beam Search, and a novel Successive Halving baseline.

2025 Relazione in Atti di Convegno

IRIS

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

Authors: Baraldi, Lorenzo; Bucciarelli, Davide; Betti, Federico; Cornia, Marcella; Baraldi, Lorenzo; Sebe, Nicu; Cucchiara, Rita

Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most … (Read full abstract)

Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.

2025 Relazione in Atti di Convegno

IRIS

Zero-Shot Styled Text Image Generation, but Make It Autoregressive

Authors: Pippi, Vittorio; Quattrini, Fabio; Cascianelli, Silvia; Tonioni, Alessio; Cucchiara, Rita

Published in: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

2025 Relazione in Atti di Convegno

DOI IRIS

Adapt to Scarcity: Few-Shot Deepfake Detection via Low-Rank Adaptation

Authors: Cappelletti, Silvia; Baraldi, Lorenzo; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

The boundary between AI-generated images and real photographs is becoming increasingly narrow, thanks to the realism provided by contemporary generative … (Read full abstract)

The boundary between AI-generated images and real photographs is becoming increasingly narrow, thanks to the realism provided by contemporary generative models. Such technological progress necessitates the evolution of existing deepfake detection algorithms to counter new threats and protect the integrity of perceived reality. Although the prevailing approach among deepfake detection methodologies relies on large collections of generated and real data, the efficacy of these methods in adapting to scenarios characterized by data scarcity remains uncertain. This obstacle arises due to the introduction of novel generation algorithms and proprietary generative models that impose restrictions on access to large-scale datasets, thereby constraining the availability of generated images. In this paper, we first analyze how the performance of current deepfake methodologies, based on the CLIP embedding space, adapt in a few-shot situation over four state-of-the-art generators. Being the CLIP embedding space not specifically tailored for the task, a fine-tuning stage is desirable, although the amount of data needed is often unavailable in a data scarcity scenario. To address this issue and limit possible overfitting, we introduce a novel approach through the Low-Rank Adaptation (LoRA) of the CLIP architecture, tailored for few-shot deepfake detection scenarios. Remarkably, the LoRA-modified CLIP, even when fine-tuned with merely 50 pairs of real and fake images, surpasses the performance of all evaluated deepfake detection models across the tested generators. Additionally, when LoRA CLIP is benchmarked against other models trained on 1,000 samples and evaluated on generative models not seen during training it exhibits superior generalization capabilities.

2024 Relazione in Atti di Convegno

IRIS

AIGeN: An Adversarial Approach for Instruction Generation in VLN

Authors: Rawal, Niyati; Bigazzi, Roberto; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

2024 Relazione in Atti di Convegno

DOI IRIS

Are Learnable Prompts the Right Way of Prompting? Adapting Vision-and-Language Models with Memory Optimization

Authors: Moratelli, Nicholas; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTELLIGENT SYSTEMS

Few-shot learning (FSL) requires fine-tuning a pretrained model on a limited set of examples from novel classes. When applied to … (Read full abstract)

Few-shot learning (FSL) requires fine-tuning a pretrained model on a limited set of examples from novel classes. When applied to vision-and-language models, the dominant approach for FSL has been that of learning input prompts which can be concatenated to the input context of the model. Despite the considerable promise they hold, the effectiveness and expressive power of prompts are limited by the fact that they can only lie at the input of the architecture. In this article, we critically question the usage of learnable prompts, and instead leverage the concept of “implicit memory” to directly capture low- and high-level relationships within the attention mechanism at any layer of the architecture, thereby establishing an alternative to prompts in FSL. Our proposed approach, termed MemOp, exhibits superior performance across 11 widely recognized image classification datasets and a benchmark for contextual domain shift evaluation, effectively addressing the challenges associated with learnable prompts.

2024 Articolo su rivista

DOI IRIS

Binarizing Documents by Leveraging both Space and Frequency

Authors: Quattrini, F.; Pippi, V.; Cascianelli, S.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Document Image Binarization is a well-known problem in Document Analysis and Computer Vision, although it is far from being solved. … (Read full abstract)

Document Image Binarization is a well-known problem in Document Analysis and Computer Vision, although it is far from being solved. One of the main challenges of this task is that documents generally exhibit degradations and acquisition artifacts that can greatly vary throughout the page. Nonetheless, even when dealing with a local patch of the document, taking into account the overall appearance of a wide portion of the page can ease the prediction by enriching it with semantic information on the ink and background conditions. In this respect, approaches able to model both local and global information have been proven suitable for this task. In particular, recent applications of Vision Transformer (ViT)-based models, able to model short and long-range dependencies via the attention mechanism, have demonstrated their superiority over standard Convolution-based models, which instead struggle to model global dependencies. In this work, we propose an alternative solution based on the recently introduced Fast Fourier Convolutions, which overcomes the limitation of standard convolutions in modeling global information while requiring fewer parameters than ViTs. We validate the effectiveness of our approach via extensive experimental analysis considering different types of degradations.

2024 Relazione in Atti di Convegno

DOI IRIS

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Authors: Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like … (Read full abstract)

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score.

2024 Relazione in Atti di Convegno

IRIS

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Authors: Baraldi, Lorenzo; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Nicolosi, Alessandro; Cucchiara, Rita

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses … (Read full abstract)

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedding space is optimized for global image-to-text alignment and is not inherently designed for deepfake detection, neglecting the potential benefits of tailored training and local image features. In this study, we propose CoDE (Contrastive Deepfake Embeddings), a novel embedding space specifically designed for deepfake detection. CoDE is trained via contrastive learning by additionally enforcing global-local similarities. To sustain the training of our model, we generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9.2 million images produced by using four different generators. Experimental results demonstrate that CoDE achieves state-of-the-art accuracy on the newly collected dataset, while also showing excellent generalization capabilities to unseen image generators. Our source code, trained models, and collected dataset are publicly available at: https://github.com/aimagelab/CoDE.

2024 Relazione in Atti di Convegno

IRIS

Diffusion and Autoregressive Deep Learning models for Transactional Data Generation

Authors: Garuti, Fabrizio; Luetto, Simone; Sangineto Lorenzo Forni, Enver; Cucchiara, Rita

2024 Relazione in Atti di Convegno

IRIS

Publications by Rita Cucchiara

Verifier Matters: Enhancing Inference-Time Scaling for Video Diffusion Models

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

Zero-Shot Styled Text Image Generation, but Make It Autoregressive

Adapt to Scarcity: Few-Shot Deepfake Detection via Low-Rank Adaptation

AIGeN: An Adversarial Approach for Instruction Generation in VLN

Are Learnable Prompts the Right Way of Prompting? Adapting Vision-and-Language Models with Memory Optimization

Binarizing Documents by Leveraging both Space and Frequency

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Diffusion and Autoregressive Deep Learning models for Transactional Data Generation