Publications

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Intrinsic Training Signals for Federated Learning Aggregation

Authors: Fiorini, Cosimo; Mosconi, Matteo; Buzzega, Pietro; Salami, Riccardo; Calderara, Simone

Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy. While existing approaches for aggregating client-specific … (Read full abstract)

Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy. While existing approaches for aggregating client-specific classification heads and adapted backbone parameters require architectural modifications or loss function changes, our method uniquely leverages intrinsic training signals already available during standard optimization. We present LIVAR (Layer Importance and VARiance-based merging), which introduces: i) a variance-weighted classifier aggregation scheme using naturally emergent feature statistics, and ii) an explainability-driven LoRA merging technique based on SHAP analysis of existing update parameter patterns. Without any architectural overhead, LIVAR achieves state-of-the-art performance on multiple benchmarks while maintaining seamless integration with existing FL methods. This work demonstrates that effective model merging can be achieved solely through existing training signals, establishing a new paradigm for efficient federated model aggregation. The code is available at https://github.com/aimagelab/fed-mammoth

2025 Relazione in Atti di Convegno

Investigating the ABCDE Rule in Convolutional Neural Networks

Authors: Bolelli, Federico; Lumetti, Luca; Marchesini, Kevin; Candeloro, Ettore; Grana, Costantino

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Convolutional Neural Networks (CNNs) have been broadly employed in dermoscopic image analysis, mainly due to the large amount of data … (Read full abstract)

Convolutional Neural Networks (CNNs) have been broadly employed in dermoscopic image analysis, mainly due to the large amount of data gathered by the International Skin Imaging Collaboration (ISIC). But where do neural networks look? Several authors have claimed that the ISIC dataset is affected by strong biases, i.e. spurious correlations between samples that machine learning models unfairly exploit while discarding the useful patterns they are expected to learn. These strong claims have been supported by showing that deep learning models maintain excellent performance even when "no information about the lesion remains" in the debased input images. With this paper, we explore the interpretability of CNNs in dermoscopic image analysis by analyzing which characteristics are considered by autonomous classification algorithms. Starting from a standard setting, experiments presented in this paper gradually conceal well-known crucial dermoscopic features and thoroughly investigate how CNNs performance subsequently evolves. Experimental results carried out on two well-known CNNs, EfficientNet-B3, and ResNet-152, demonstrate that neural networks autonomously learn to extract features that are notoriously important for melanoma detection. Even when some of such features are removed, the others are still enough to achieve satisfactory classification performance. Obtained results demonstrate that literature claims on biases are not supported by carried-out experiments. Finally, to demonstrate the generalization capabilities of state-of-the-art CNN models for skin lesion classification, a large private dataset has been employed as an additional test set.

2025 Relazione in Atti di Convegno

Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training

Authors: Baraldi, Lorenzo; Amoroso, Roberto; Cornia, Marcella; Pilzer, Andrea; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of many different visual tasks. … (Read full abstract)

The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of many different visual tasks. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a backbone by reconstructing visual tokens associated with randomly masked image patches. This masking approach, however, introduces noise into the input data during pre-training, leading to discrepancies that can impair performance during the fine-tuning phase. Furthermore, input masking neglects the dependencies between corrupted patches, increasing the inconsistencies observed in downstream fine-tuning tasks. To overcome these issues, we propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT), that employs autoregressive and permuted predictions to capture intra-patch dependencies. In addition, MaPeT employs auxiliary positional information to reduce the disparity between the pre-training and fine-tuning phases. In our experiments, we employ a fair setting to ensure reliable and meaningful comparisons and conduct investigations on multiple visual tokenizers, including our proposed k-CLIP which directly employs discretized CLIP features. Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting. We release an implementation of our code and models at https://github.com/aimagelab/MaPeT.

2025 Articolo su rivista

Leveraging Digital Twin Technology with a Human-Centered Approach to Automate a Workstation in the Logistics Sector of Made in Italy: CHIMAR Use Case

Authors: Bertoli, Annalisa; Nini, Matteo; Cibrario, Valerio; Vargas, Manuela; Perona, Paolo; Rossi, Ludovico; Benedetti, Laura; Nicolinti, Alberto; Fantuzzi, Cesare

Published in: MACHINES

Industry 4.0 has driven the development of important technologies for industrial applications, but the focus has often been on technological … (Read full abstract)

Industry 4.0 has driven the development of important technologies for industrial applications, but the focus has often been on technological advancement rather than on how operators interact with these systems. With the emergence of Industry 5.0, attention has shifted toward the role of the operators and their interaction with emerging technologies. This paper explores the automation of a fully manual operation in the logistics field while adopting a human-centered approach to reduce risky tasks and enhance the operator’s well-being. A motion capture system and digital human simulation software are utilized to create a digital twin of a real-world industrial case study. This approach enables the virtual testing of various automation solutions to identify the optimal scenario that meets the performance indicator parameters. This study highlights the importance of integrating ergonomic considerations into automation strategies.

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Authors: Cocchi, Federico; Moratelli, Nicholas; Caffagni, Davide; Sarto, Sara; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita

2025 Relazione in Atti di Convegno

LLMs and Humanoid Robot Diversity: The Pose Generation Challenge

Authors: Catalini, Riccardo; Biagi, Federico; Salici, Giacomo; Borghi, Guido; Vezzani, Roberto; Biagiotti, Luigi

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Humanoid robots are increasingly being integrated into diverse scenarios, such as healthcare facilities, social settings, and workplaces. As the need … (Read full abstract)

Humanoid robots are increasingly being integrated into diverse scenarios, such as healthcare facilities, social settings, and workplaces. As the need for intuitive control by non-expert users grows, many studies have explored the use of Artificial Intelligence to enable communication and control. However, these approaches are often tailored to specific robots due to the absence of standardized conventions and notation. This study addresses the challenges posed by these inconsistencies and investigates their impact on the ability of Large Language Models (LLMs) to generate accurate 3D robot poses, even when detailed robot specifications are provided as input.

2025 Relazione in Atti di Convegno

LLMs as NAO Robot 3D Motion Planners

Authors: Catalini, Riccardo; Salici, Giacomo; Biagi, Federico; Borghi, Guido; Biagiotti, Luigi; Vezzani, Roberto

In this study, we demonstrate the capabilities of state-of-the-art Large Language Models (LLMs) in teaching social robots to perform specific … (Read full abstract)

In this study, we demonstrate the capabilities of state-of-the-art Large Language Models (LLMs) in teaching social robots to perform specific actions within a 3D environment. Specifically, we introduce the use of LLMs to generate sequences of 3D joint angles - in both zero-shot and one-shot prompting - that a humanoid robot must follow to perform a given action. This work is driven by the growing demand for intuitive interactions with social robots: indeed, LLMs could empower non-expert users to operate and benefit from robotic systems effectively. Additionally, this method leverages the possibility to generate synthetic data without effort, enabling privacy-focused use cases. To evaluate the output quality of seven different LLMs, we conducted a blind user study to compare the pose sequences. Participants were shown videos of the well-known NAO robot performing the generated actions and were asked to identify the intended action and choose the best match with the original instruction from a collection of candidates created by different LLMs. The results highlight that the majority of LLMs are indeed capable of planning correct and complete recognizable actions, showing a novel perspective of how AI can be applied to social robotics.

2025 Relazione in Atti di Convegno

Location Matters: Harnessing Spatial Information to Enhance the Segmentation of the Inferior Alveolar Canal in CBCTs

Authors: Lumetti, Luca; Pipoli, Vittorio; Bolelli, Federico; Ficarra, Elisa; Grana, Costantino

Published in: LECTURE NOTES IN COMPUTER SCIENCE

The segmentation of the Inferior Alveolar Canal (IAC) plays a central role in maxillofacial surgery, drawing significant attention in the … (Read full abstract)

The segmentation of the Inferior Alveolar Canal (IAC) plays a central role in maxillofacial surgery, drawing significant attention in the current research. Because of their outstanding results, deep learning methods are widely adopted in the segmentation of 3D medical volumes, including the IAC in Cone Beam Computed Tomography (CBCT) data. One of the main challenges when segmenting large volumes, including those obtained through CBCT scans, arises from the use of patch-based techniques, mandatory to fit memory constraints. Such training approaches compromise neural network performance due to a reduction in the global contextual information. Performance degradation is prominently evident when the target objects are small with respect to the background, as it happens with the inferior alveolar nerve that develops across the mandible, but involves only a few voxels of the entire scan. In order to target this issue and push state-of-the-art performance in the segmentation of the IAC, we propose an innovative approach that exploits spatial information of extracted patches and integrates it into a Transformer architecture. By incorporating prior knowledge about patch location, our model improves state of the art by ~2 points on the Dice score when integrated with the standard U-Net architecture. The source code of our proposal is publicly released.

2025 Relazione in Atti di Convegno

Machine Learning-Based Prediction of Emergency Department Prolonged Length of Stay: A Case Study from Italy

Authors: Perliti Scorzoni, Paolo; Giovanetti, Anita; Bolelli, Federico; Grana, Costantino

Overcrowding in Emergency Departments (EDs) is a pressing concern driven by high patient demand and limited resources. Prolonged Length of … (Read full abstract)

Overcrowding in Emergency Departments (EDs) is a pressing concern driven by high patient demand and limited resources. Prolonged Length of Stay (pLOS), a major contributor to this congestion, may lead to adverse outcomes, including patients leaving without being seen, suboptimal clinical care, increased mortality rates, provider burnout, and escalating healthcare costs. This study investigates the application of various Machine Learning (ML) algorithms to predict both LOS and pLOS. A retrospective analysis examined 32,967 accesses at a northern Italian hospital’s ED between 2022 and 2024. Twelve classification algorithms were evaluated in forecasting pLOS, using clinically relevant thresholds. Two data variants were employed for model comparison: one containing only structured data (e.g., demographics and clinical information), while a second one also including features extracted from free-text nursing notes. To enhance the accuracy of LOS prediction, novel queue-based variables capturing the real-time state of the ED were incorporated as additional dynamic predictors. Compared to single-algorithm models, ensemble models demonstrated superior robustness in forecasting both ED-LOS and ED-pLOS. These findings highlight the potential for integrating ML into EDs practices as auxiliary tools to provide valuable insights into patient flow. By identifying patients at high risk of pLOS, healthcare professionals can proactively implement strategies to expedite care, optimize resource allocation, and ultimately improve patient outcomes and ED efficiency, promoting a more effective and sustainable public healthcare delivery.

2025 Relazione in Atti di Convegno

Mask and Compress: Efficient Skeleton-based Action Recognition in Continual Learning

Authors: Mosconi, Matteo; Sorokin, Andriy; Panariello, Aniello; Porrello, Angelo; Bonato, Jacopo; Cotogni, Marco; Sabetta, Luigi; Calderara, Simone; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

The use of skeletal data allows deep learning models to perform action recognition efficiently and effectively. Herein, we believe that … (Read full abstract)

The use of skeletal data allows deep learning models to perform action recognition efficiently and effectively. Herein, we believe that exploring this problem within the context of Continual Learning is crucial. While numerous studies focus on skeleton-based action recognition from a traditional offline perspective, only a handful venture into online approaches. In this respect, we introduce CHARON (Continual Human Action Recognition On skeletoNs), which maintains consistent performance while operating within an efficient framework. Through techniques like uniform sampling, interpolation, and a memory-efficient training stage based on masking, we achieve improved recognition accuracy while minimizing computational overhead. Our experiments on Split NTU-60 and the proposed Split NTU-120 datasets demonstrate that CHARON sets a new benchmark in this domain. The code is available at https://github.com/Sperimental3/CHARON.

2025 Relazione in Atti di Convegno

Page 7 of 106 • Total publications: 1059