Publications - AImageLab

Towards Unbiased Continual Learning: Avoiding Forgetting in the Presence of Spurious Correlations

Authors: Capitani, Giacomo; Bonicelli, Lorenzo; Porrello, Angelo; Bolelli, Federico; Calderara, Simone; Ficarra, Elisa

Published in: IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION

2025 Relazione in Atti di Convegno

DOI IRIS

Tracing Information Flow in LLaMA Vision: A Step Toward Multimodal Understanding

Authors: Saporita, Alessia; Pipoli, Vittorio; Bolelli, Federico; Baraldi, Lorenzo; Acquaviva, Andrea; Ficarra, Elisa

Multimodal Large Language Models (MLLMs) have recently emerged as a powerful framework for extending the capabilities of Large Language Models … (Read full abstract)

Multimodal Large Language Models (MLLMs) have recently emerged as a powerful framework for extending the capabilities of Large Language Models (LLMs) to reason over non-textual modalities. However, despite their success, understanding how they integrate visual and textual information remains an open challenge. Among them, LLaMA~3.2-Vision represents a significant milestone in the development of open-source MLLMs, offering a reproducible and efficient architecture that competes with leading proprietary models, such as Claude 3 Haiku and GPT-4o mini. Motivated by these characteristics, we conduct the first systematic analysis of the information flow between vision and language in LLaMA~3.2-Vision. We analyze three visual question answering (VQA) benchmarks, covering the tasks of VQA on natural images---using both open-ended and multiple-choice question formats---as well as document VQA. These tasks require diverse reasoning capabilities, making them well-suited to reveal distinct patterns in multimodal reasoning. Our analysis unveils a four-stage reasoning strategy: an initial semantic interpretation of the question, an early-to-mid-layer multimodal fusion, a task-specific reasoning stage guided by the resulting multimodal embedding, and a final answer prediction stage. Furthermore, we reveal that multimodal fusion is task-dependent: in complex settings such as document VQA, the model postpones cross-modal integration until semantic reasoning over the question has been established. Overall, our findings offer new insights into the internal dynamics of MLLMs and contribute to advancing the interpretability of vision-language architectures. Our source code is available at https://github.com/AImageLab/MLLMs-FlowTracker.

2025 Relazione in Atti di Convegno

IRIS

U-Net Transplant: The Role of Pre-training for Model Merging in 3D Medical Segmentation

Authors: Lumetti, Luca; Capitani, Giacomo; Ficarra, Elisa; Grana, Costantino; Calderara, Simone; Porrello, Angelo; Bolelli, Federico

Despite their remarkable success in medical image segmentation, the life cycle of deep neural networks remains a challenge in clinical … (Read full abstract)

Despite their remarkable success in medical image segmentation, the life cycle of deep neural networks remains a challenge in clinical applications. These models must be regularly updated to integrate new medical data and customized to meet evolving diagnostic standards, regulatory requirements, commercial needs, and privacy constraints. Model merging offers a promising solution, as it allows working with multiple specialized networks that can be created and combined dynamically instead of relying on monolithic models. While extensively studied in standard 2D classification, the potential of model merging for 3D segmentation remains unexplored. This paper presents an efficient framework that allows effective model merging in the domain of 3D image segmentation. Our approach builds upon theoretical analysis and encourages wide minima during pre-training, which we demonstrate to facilitate subsequent model merging. Using U-Net 3D, we evaluate the method on distinct anatomical structures with the ToothFairy2 and BTCV Abdomen datasets. To support further research, we release the source code and all the model weights in a dedicated repository: https://github.com/LucaLumetti/UNetTransplant

2025 Relazione in Atti di Convegno

IRIS

Update Your Transformer to the Latest Release: Re-Basin of Task Vectors

Authors: Rinaldi, Filippo; Capitani, Giacomo; Bonicelli, Lorenzo; Crisostomi, Donato; Bolelli, Federico; Rodolà, Emanuele; Ficarra, Elisa; Calderara, Simone; Porrello, Angelo

Published in: PROCEEDINGS OF MACHINE LEARNING RESEARCH

Foundation models serve as the backbone for numerous specialized models developed through fine-tuning. However, when the underlying pretrained model is … (Read full abstract)

Foundation models serve as the backbone for numerous specialized models developed through fine-tuning. However, when the underlying pretrained model is updated or retrained (e.g., on larger and more curated datasets), the fine-tuned model becomes obsolete, losing its utility and requiring retraining. This raises the question: is it possible to transfer fine-tuning to a new release of the model? In this work, we investigate how to transfer fine-tuning to a new checkpoint without having to re-train, in a data-free manner. To do so, we draw principles from model re-basin and provide a recipe based on weight permutations to re-base the modifications made to the original base model, often called task vector. In particular, our approach tailors model re-basin for Transformer models, taking into account the challenges of residual connections and multi-head attention layers. Specifically, we propose a two-level method rooted in spectral theory, initially permuting the attention heads and subsequently adjusting parameters within select pairs of heads. Through extensive experiments on visual and textual tasks, we achieve the seamless transfer of fine-tuned knowledge to new pre-trained backbones without relying on a single training step or datapoint.

2025 Relazione in Atti di Convegno

IRIS

A Graph-Based Multi-Scale Approach with Knowledge Distillation for WSI Classification

Authors: Bontempo, Gianpaolo; Bolelli, Federico; Porrello, Angelo; Calderara, Simone; Ficarra, Elisa

Published in: IEEE TRANSACTIONS ON MEDICAL IMAGING

The usage of Multi Instance Learning (MIL) for classifying Whole Slide Images (WSIs) has recently increased. Due to their gigapixel … (Read full abstract)

The usage of Multi Instance Learning (MIL) for classifying Whole Slide Images (WSIs) has recently increased. Due to their gigapixel size, the pixel-level annotation of such data is extremely expensive and time-consuming, practically unfeasible. For this reason, multiple automatic approaches have been raised in the last years to support clinical practice and diagnosis. Unfortunately, most state-of-the-art proposals apply attention mechanisms without considering the spatial instance correlation and usually work on a single-scale resolution. To leverage the full potential of pyramidal structured WSI, we propose a graph-based multi-scale MIL approach, DAS-MIL. Our model comprises three modules: i) a self-supervised feature extractor, ii) a graph-based architecture that precedes the MIL mechanism and aims at creating a more contextualized representation of the WSI structure by considering the mutual (spatial) instance correlation both inter and intra-scale. Finally, iii) a (self) distillation loss between resolutions is introduced to compensate for their informative gap and significantly improve the final prediction. The effectiveness of the proposed framework is demonstrated on two well-known datasets, where we outperform SOTA on WSI classification, gaining a +2.7% AUC and +3.7% accuracy on the popular Camelyon16 benchmark.

2024 Articolo su rivista

DOI IRIS

A State-of-the-Art Review with Code about Connected Components Labeling on GPUs

Authors: Bolelli, Federico; Allegretti, Stefano; Lumetti, Luca; Grana, Costantino

Published in: IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

This article is about Connected Components Labeling (CCL) algorithms developed for GPU accelerators. The task itself is employed in many … (Read full abstract)

This article is about Connected Components Labeling (CCL) algorithms developed for GPU accelerators. The task itself is employed in many modern image-processing pipelines and represents a fundamental step in different scenarios, whenever object recognition is required. For this reason, a strong effort in the development of many different proposals devoted to improving algorithm performance using different kinds of hardware accelerators has been made. This paper focuses on GPU-based algorithmic solutions published in the last two decades, highlighting their distinctive traits and the improvements they leverage. The state-of-the-art review proposed is equipped with the source code, which allows to straightforwardly reproduce all the algorithms in different experimental settings. A comprehensive evaluation on multiple environments is also provided, including different operating systems, compilers, and GPUs. Our assessments are performed by means of several tests, including real-case images and synthetically generated ones, highlighting the strengths and weaknesses of each proposal. Overall, the experimental results revealed that block-based oriented algorithms outperform all the other algorithmic solutions on both 2D images and 3D volumes, regardless of the selected environment.

2024 Articolo su rivista

DOI IRIS

BarBeR: A Barcode Benchmarking Repository

Authors: Vezzali, Enrico; Bolelli, Federico; Santi, Stefano; Grana, Costantino

Since their invention in 1949, barcodes have remained the preferred method for automatic data capture, playing a crucial role in … (Read full abstract)

Since their invention in 1949, barcodes have remained the preferred method for automatic data capture, playing a crucial role in supply chain management. To detect a barcode in an image, multiple algorithms have been proposed in the literature, with a significant increase of interest in the topic since the rise of deep learning. However, research in the field suffers from many limitations, including the scarcity of public datasets and code implementations, which hampers the reproducibility and reliability of published results. For this reason, we developed "BarBeR" (Barcode Benchmark Repository), a benchmark designed for testing and comparing barcode detection algorithms. This benchmark includes the code implementation of various detection algorithms for barcodes, along with a suite of useful metrics. It offers a range of test setups and can be expanded to include any localization algorithm. In addition, we provide a large, annotated dataset of 8748 barcode images, combining multiple public barcode datasets with standardized annotation formats for both detection and segmentation tasks. Finally, we share the results obtained from running the benchmark on our dataset, offering valuable insights into the performance of different algorithms.

2024 Relazione in Atti di Convegno

IRIS

Beyond the Surface: Comprehensive Analysis of Implicit Bias in Vision-Language Models

Authors: Capitani, Giacomo; Lucarini, Alice; Bonicelli, Lorenzo; Bolelli, Federico; Calderara, Simone; Vezzali, Loris; Ficarra, Elisa

Implicit biases, subtle and unconscious attitudes, permeate various facets of human decision-making and are similarly pervasive in Artificial Intelligence (AI) … (Read full abstract)

Implicit biases, subtle and unconscious attitudes, permeate various facets of human decision-making and are similarly pervasive in Artificial Intelligence (AI) systems. These biases can stem from shortcut learning, where models rely on superficial patterns that do not capture the underlying phenomena. Inspired by social psychology studies, we introduce two novel metrics to analyze implicit biases in visual-language models. Our comprehensive analysis of 90 open-clip models reveals widespread anomalies related to ethnicity and gender. The first metric considers the cosine similarity between images and text prompts related to social stereotypes. The second metric adapts the Implicit Association Test (IAT), which evaluates prejudice and hidden discrimination within human behavior. Our findings illustrate that conventional text-based debiasing efforts can inadvertently amplify second-order biases instead of mitigating them. Furthermore, in expanding our evaluation to multimodal Large Language Models (LLMs), we demonstrate disparities in the tendency to generate semantically positive or negative outputs, depending on the ethnicity or gender of the individuals depicted in the input images.

2024 Relazione in Atti di Convegno

IRIS

ClusterFix: A Cluster-Based Debiasing Approach without Protected-Group Supervision

Authors: Capitani, Giacomo; Bolelli, Federico; Porrello, Angelo; Calderara, Simone; Ficarra, Elisa

The failures of Deep Networks can sometimes be ascribed to biases in the data or algorithmic choices. Existing debiasing approaches … (Read full abstract)

The failures of Deep Networks can sometimes be ascribed to biases in the data or algorithmic choices. Existing debiasing approaches exploit prior knowledge to avoid unintended solutions; we acknowledge that, in real-world settings, it could be unfeasible to gather enough prior information to characterize the bias, or it could even raise ethical considerations. We hence propose a novel debiasing approach, termed ClusterFix, which does not require any external hint about the nature of biases. Such an approach alters the standard empirical risk minimization and introduces a per-example weight, encoding how critical and far from the majority an example is. Notably, the weights consider how difficult it is for the model to infer the correct pseudo-label, which is obtained in a self-supervised manner by dividing examples into multiple clusters. Extensive experiments show that the misclassification error incurred in identifying the correct cluster allows for identifying examples prone to bias-related issues. As a result, our approach outperforms existing methods on standard benchmarks for bias removal and fairness.

2024 Relazione in Atti di Convegno

DOI IRIS

Enhancing Patch-Based Learning for the Segmentation of the Mandibular Canal

Authors: Lumetti, Luca; Pipoli, Vittorio; Bolelli, Federico; Ficarra, Elisa; Grana, Costantino

Published in: IEEE ACCESS

Segmentation of the Inferior Alveolar Canal (IAC) is a critical aspect of dentistry and maxillofacial imaging, garnering considerable attention in … (Read full abstract)

Segmentation of the Inferior Alveolar Canal (IAC) is a critical aspect of dentistry and maxillofacial imaging, garnering considerable attention in recent research endeavors. Deep learning techniques have shown promising results in this domain, yet their efficacy is still significantly hindered by the limited availability of 3D maxillofacial datasets. An inherent challenge is posed by the size of input volumes, which necessitates a patch-based processing approach that compromises the neural network performance due to the absence of global contextual information. This study introduces a novel approach that harnesses the spatial information within the extracted patches and incorporates it into a Transformer architecture, thereby enhancing the segmentation process through the use of prior knowledge about the patch location. Our method significantly improves the Dice score by a factor of 4 points, with respect to the previous work proposed by Cipriano et al., while also reducing the training steps required by the entire pipeline. By integrating spatial information and leveraging the power of Transformer architectures, this research not only advances the accuracy of IAC segmentation, but also streamlines the training process, offering a promising direction for improving dental and maxillofacial image analysis.

2024 Articolo su rivista

DOI IRIS

Publications by Federico Bolelli

Towards Unbiased Continual Learning: Avoiding Forgetting in the Presence of Spurious Correlations

Tracing Information Flow in LLaMA Vision: A Step Toward Multimodal Understanding

U-Net Transplant: The Role of Pre-training for Model Merging in 3D Medical Segmentation

Update Your Transformer to the Latest Release: Re-Basin of Task Vectors

A Graph-Based Multi-Scale Approach with Knowledge Distillation for WSI Classification

A State-of-the-Art Review with Code about Connected Components Labeling on GPUs

BarBeR: A Barcode Benchmarking Repository

Beyond the Surface: Comprehensive Analysis of Implicit Bias in Vision-Language Models

ClusterFix: A Cluster-Based Debiasing Approach without Protected-Group Supervision

Enhancing Patch-Based Learning for the Segmentation of the Mandibular Canal