Publications - AImageLab

Are Learnable Prompts the Right Way of Prompting? Adapting Vision-and-Language Models with Memory Optimization

Authors: Moratelli, Nicholas; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTELLIGENT SYSTEMS

Few-shot learning (FSL) requires fine-tuning a pretrained model on a limited set of examples from novel classes. When applied to … (Read full abstract)

Few-shot learning (FSL) requires fine-tuning a pretrained model on a limited set of examples from novel classes. When applied to vision-and-language models, the dominant approach for FSL has been that of learning input prompts which can be concatenated to the input context of the model. Despite the considerable promise they hold, the effectiveness and expressive power of prompts are limited by the fact that they can only lie at the input of the architecture. In this article, we critically question the usage of learnable prompts, and instead leverage the concept of “implicit memory” to directly capture low- and high-level relationships within the attention mechanism at any layer of the architecture, thereby establishing an alternative to prompts in FSL. Our proposed approach, termed MemOp, exhibits superior performance across 11 widely recognized image classification datasets and a benchmark for contextual domain shift evaluation, effectively addressing the challenges associated with learnable prompts.

2024 Articolo su rivista

DOI IRIS

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Authors: Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like … (Read full abstract)

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score.

2024 Relazione in Atti di Convegno

IRIS

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Authors: Baraldi, Lorenzo; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Nicolosi, Alessandro; Cucchiara, Rita

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses … (Read full abstract)

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedding space is optimized for global image-to-text alignment and is not inherently designed for deepfake detection, neglecting the potential benefits of tailored training and local image features. In this study, we propose CoDE (Contrastive Deepfake Embeddings), a novel embedding space specifically designed for deepfake detection. CoDE is trained via contrastive learning by additionally enforcing global-local similarities. To sustain the training of our model, we generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9.2 million images produced by using four different generators. Experimental results demonstrate that CoDE achieves state-of-the-art accuracy on the newly collected dataset, while also showing excellent generalization capabilities to unseen image generators. Our source code, trained models, and collected dataset are publicly available at: https://github.com/aimagelab/CoDE.

2024 Relazione in Atti di Convegno

IRIS

Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Authors: Moratelli, Nicholas; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality … (Read full abstract)

Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a CLIP-based reward. To this end, our discriminator directly incorporates negative samples from a frozen captioner, which significantly improves the quality and richness of the generated captions but also reduces the fine-tuning time in comparison to using the CIDEr score as the sole metric for optimization. Experimental results demonstrate the effectiveness of our training strategy on both standard and zero-shot image captioning datasets.

2024 Relazione in Atti di Convegno

IRIS

FOSSIL: Free Open-Vocabulary Semantic Segmentation through Synthetic References Retrieval

Authors: Barsellotti, Luca; Amoroso, Roberto; Baraldi, Lorenzo; Cucchiara, Rita

2024 Relazione in Atti di Convegno

DOI IRIS

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

Authors: Cornia, Marcella; Baraldi, Lorenzo; Fiameni, Giuseppe; Cucchiara, Rita

Published in: INTERNATIONAL JOURNAL OF COMPUTER VISION

This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both … (Read full abstract)

This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing real-world concepts and producing high-quality captions. Extensive experiments are performed on different image captioning datasets, including CC3M, nocaps, and the competitive COCO dataset, where our model consistently outperforms baselines and state-of-the-art approaches.

2024 Articolo su rivista

DOI IRIS

Intelligent Multimodal Artificial Agents that Talk and Express Emotions

Authors: Rawal, Niyati; Maharjan, Rahul Singh; Romeo, Marta; Bigazzi, Roberto; Baraldi, Lorenzo; Cucchiara, Rita; Cangelosi, Angelo

2024 Relazione in Atti di Convegno

IRIS

Mapping High-level Semantic Regions in Indoor Environments without Object Recognition

Authors: Bigazzi, Roberto; Baraldi, Lorenzo; Kousik, Shreyas; Cucchiara, Rita; Pavone, Marco

Published in: IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION

2024 Relazione in Atti di Convegno

DOI IRIS

Multi-Class Unlearning for Image Classification via Weight Filtering

Authors: Poppi, Samuele; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTELLIGENT SYSTEMS

Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods … (Read full abstract)

Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods that target a limited subset or a single class, our framework unlearns all classes in a single round. We achieve this by modulating the network's components using memory matrices, enabling the network to demonstrate selective unlearning behavior for any class after training. By discovering weights that are specific to each class, our approach also recovers a representation of the classes which is explainable by design. We test the proposed framework on small- and medium-scale image classification datasets, with both convolution- and Transformer-based backbones, showcasing the potential for explainable solutions through unlearning.

2024 Articolo su rivista

DOI IRIS

Optimizing Resource Consumption in Diffusion Models through Hallucination Early Detection

Authors: Betti, Federico; Baraldi, Lorenzo; Baraldi, Lorenzo; Cucchiara, Rita; Sebe, Nicu

2024 Relazione in Atti di Convegno

IRIS

Publications by Lorenzo Baraldi

Are Learnable Prompts the Right Way of Prompting? Adapting Vision-and-Language Models with Memory Optimization

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Fluent and Accurate Image Captioning with a Self-Trained Reward Model

FOSSIL: Free Open-Vocabulary Semantic Segmentation through Synthetic References Retrieval

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

Intelligent Multimodal Artificial Agents that Talk and Express Emotions

Mapping High-level Semantic Regions in Indoor Environments without Object Recognition

Multi-Class Unlearning for Image Classification via Weight Filtering

Optimizing Resource Consumption in Diffusion Models through Hallucination Early Detection