Publications - AImageLab

Beyond the Surface: Comprehensive Analysis of Implicit Bias in Vision-Language Models

Authors: Capitani, Giacomo; Lucarini, Alice; Bonicelli, Lorenzo; Bolelli, Federico; Calderara, Simone; Vezzali, Loris; Ficarra, Elisa

Implicit biases, subtle and unconscious attitudes, permeate various facets of human decision-making and are similarly pervasive in Artificial Intelligence (AI) … (Read full abstract)

Implicit biases, subtle and unconscious attitudes, permeate various facets of human decision-making and are similarly pervasive in Artificial Intelligence (AI) systems. These biases can stem from shortcut learning, where models rely on superficial patterns that do not capture the underlying phenomena. Inspired by social psychology studies, we introduce two novel metrics to analyze implicit biases in visual-language models. Our comprehensive analysis of 90 open-clip models reveals widespread anomalies related to ethnicity and gender. The first metric considers the cosine similarity between images and text prompts related to social stereotypes. The second metric adapts the Implicit Association Test (IAT), which evaluates prejudice and hidden discrimination within human behavior. Our findings illustrate that conventional text-based debiasing efforts can inadvertently amplify second-order biases instead of mitigating them. Furthermore, in expanding our evaluation to multimodal Large Language Models (LLMs), we demonstrate disparities in the tendency to generate semantically positive or negative outputs, depending on the ethnicity or gender of the individuals depicted in the input images.

2024 Relazione in Atti di Convegno

IRIS

Binarizing Documents by Leveraging both Space and Frequency

Authors: Quattrini, F.; Pippi, V.; Cascianelli, S.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Document Image Binarization is a well-known problem in Document Analysis and Computer Vision, although it is far from being solved. … (Read full abstract)

Document Image Binarization is a well-known problem in Document Analysis and Computer Vision, although it is far from being solved. One of the main challenges of this task is that documents generally exhibit degradations and acquisition artifacts that can greatly vary throughout the page. Nonetheless, even when dealing with a local patch of the document, taking into account the overall appearance of a wide portion of the page can ease the prediction by enriching it with semantic information on the ink and background conditions. In this respect, approaches able to model both local and global information have been proven suitable for this task. In particular, recent applications of Vision Transformer (ViT)-based models, able to model short and long-range dependencies via the attention mechanism, have demonstrated their superiority over standard Convolution-based models, which instead struggle to model global dependencies. In this work, we propose an alternative solution based on the recently introduced Fast Fourier Convolutions, which overcomes the limitation of standard convolutions in modeling global information while requiring fewer parameters than ViTs. We validate the effectiveness of our approach via extensive experimental analysis considering different types of degradations.

2024 Relazione in Atti di Convegno

DOI IRIS

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Authors: Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like … (Read full abstract)

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score.

2024 Relazione in Atti di Convegno

IRIS

CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning

Authors: Frascaroli, Emanuele; Panariello, Aniello; Buzzega, Pietro; Bonicelli, Lorenzo; Porrello, Angelo; Calderara, Simone

With the emergence of Transformers and Vision-Language Models (VLMs) such as CLIP, fine-tuning large pre-trained models has recently become a … (Read full abstract)

With the emergence of Transformers and Vision-Language Models (VLMs) such as CLIP, fine-tuning large pre-trained models has recently become a prevalent strategy in Continual Learning. This has led to the development of numerous prompting strategies to adapt transformer-based models without incurring catastrophic forgetting. However, these strategies often compromise the original zero-shot capabilities of the pre-trained CLIP model and struggle to adapt to domains that significantly deviate from the pre-training data. In this work, we propose Continual Generative training for Incremental prompt-Learning, a simple and novel approach to mitigate forgetting while adapting CLIP. Briefly, we employ Variational Autoencoders (VAEs) to learn class-conditioned distributions within the embedding space of the visual encoder. We then exploit these distributions to sample new synthetic visual embeddings and train the corresponding class-specific textual prompts during subsequent tasks. Through extensive experiments on different domains, we show that such a generative replay approach can adapt to new tasks while improving zero-shot capabilities, evaluated using a novel metric tailored for CL scenarios. Notably, further analysis reveals that our approach can bridge the gap with joint prompt tuning. The codebase is available at https://github.com/aimagelab/mammoth.

2024 Relazione in Atti di Convegno

IRIS

ClusterFix: A Cluster-Based Debiasing Approach without Protected-Group Supervision

Authors: Capitani, Giacomo; Bolelli, Federico; Porrello, Angelo; Calderara, Simone; Ficarra, Elisa

The failures of Deep Networks can sometimes be ascribed to biases in the data or algorithmic choices. Existing debiasing approaches … (Read full abstract)

The failures of Deep Networks can sometimes be ascribed to biases in the data or algorithmic choices. Existing debiasing approaches exploit prior knowledge to avoid unintended solutions; we acknowledge that, in real-world settings, it could be unfeasible to gather enough prior information to characterize the bias, or it could even raise ethical considerations. We hence propose a novel debiasing approach, termed ClusterFix, which does not require any external hint about the nature of biases. Such an approach alters the standard empirical risk minimization and introduces a per-example weight, encoding how critical and far from the majority an example is. Notably, the weights consider how difficult it is for the model to infer the correct pseudo-label, which is obtained in a self-supervised manner by dividing examples into multiple clusters. Extensive experiments show that the misclassification error incurred in identifying the correct cluster allows for identifying examples prone to bias-related issues. As a result, our approach outperforms existing methods on standard benchmarks for bias removal and fairness.

2024 Relazione in Atti di Convegno

DOI IRIS

Compact High-Resolution Multi-Wavelength LED Light Source for Eye Stimulation

Authors: Gibertoni, Giovanni; Borghi, Guido; Rovati, Luigi

Published in: ELECTRONICS

Eye stimulation research plays a critical role in advancing our understanding of visual processing and developing new therapies for visual … (Read full abstract)

Eye stimulation research plays a critical role in advancing our understanding of visual processing and developing new therapies for visual impairments. Despite its importance, researchers and clinicians still face challenges with the availability of cost-effective, precise, and versatile tools for conducting these studies. Therefore, this study introduces a high-resolution, compact, and budget-friendly multi-wavelength LED light source tailored for precise and versatile eye stimulation, addressing the aforementioned needs in medical research and visual science. Accommodating standard 3 mm or 5 mm package LEDs, the system boasts broad compatibility, while its integration with any microcontroller capable of PWM generation and supporting SPI and UART communication ensures adaptability across diverse applications. Operating at high resolution (18 bits or more) with great linearity, the LED light source offers nuanced control for sophisticated eye stimulation protocols. The simple 3D printable optical design allows the coupling of up to seven different wavelengths while ensuring the cost-effectiveness of the device. The system’s output has been designed to be fiber-coupled with standard SMA connectors to be compatible with most solutions. The proposed implementation significantly undercuts the cost of commercially available solutions, providing a viable, budget-friendly option for advancing eye stimulation research.

2024 Articolo su rivista

DOI IRIS

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Authors: Baraldi, Lorenzo; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Nicolosi, Alessandro; Cucchiara, Rita

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses … (Read full abstract)

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedding space is optimized for global image-to-text alignment and is not inherently designed for deepfake detection, neglecting the potential benefits of tailored training and local image features. In this study, we propose CoDE (Contrastive Deepfake Embeddings), a novel embedding space specifically designed for deepfake detection. CoDE is trained via contrastive learning by additionally enforcing global-local similarities. To sustain the training of our model, we generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9.2 million images produced by using four different generators. Experimental results demonstrate that CoDE achieves state-of-the-art accuracy on the newly collected dataset, while also showing excellent generalization capabilities to unseen image generators. Our source code, trained models, and collected dataset are publicly available at: https://github.com/aimagelab/CoDE.

2024 Relazione in Atti di Convegno

IRIS

D-SPDH: Improving 3D Robot Pose Estimation in Sim2Real Scenario via Depth Data

Authors: Simoni, A.; Borghi, G.; Garattoni, L.; Francesca, G.; Vezzani, R.

Published in: IEEE ACCESS

In recent years, there has been a notable surge in the significance attributed to technologies facilitating secure and efficient cohabitation … (Read full abstract)

In recent years, there has been a notable surge in the significance attributed to technologies facilitating secure and efficient cohabitation and collaboration between humans and machines, with a particular interest in robotic systems. A pivotal element in actualizing this novel and challenging collaborative paradigm involves different technical tasks, including the comprehension of 3D poses exhibited by both humans and robots through the utilization of non-intrusive systems, such as cameras. In this scenario, the availability of vision-based systems capable of detecting in real-time the robot's pose is needed as a first step towards a safe and effective interaction to, for instance, avoid collisions. Therefore, in this work, we propose a vision-based system, referred to as D-SPDH, able to estimate the 3D robot pose. The system is based on double-branch architecture and depth data as a single input; any additional information regarding the state of the internal encoders of the robot is not required. The working scenario is the Sim2Real, i.e., the system is trained only with synthetic data and then tested on real sequences, thus eliminating the time-consuming acquisition and annotation procedures of real data, common phases in deep learning algorithms. Moreover, we introduce SimBa++, a dataset featuring both synthetic and real sequences with new real-world double-arm movements, and that represents a challenging setting in which the proposed approach is tested. Experimental results show that our D-SPDH method achieves state-of-the-art and real-time performance, paving the way a possible future non-invasive systems to monitor human-robot interactions.

2024 Articolo su rivista

DOI IRIS

Differential Morphing Attack Detection via Triplet-Based Metric Learning and Artifact Extraction

Authors: Liu, Chengcheng; Ferrara, Matteo; Franco, Annalisa; Borghi, Guido; Zhong, Dexing

2024 Relazione in Atti di Convegno

DOI IRIS

Diffusion and Autoregressive Deep Learning models for Transactional Data Generation

Authors: Garuti, Fabrizio; Luetto, Simone; Sangineto Lorenzo Forni, Enver; Cucchiara, Rita

2024 Relazione in Atti di Convegno

IRIS