Publications - AImageLab

Depth-Based Privileged Information for Boosting 3D Human Pose Estimation on RGB

Authors: Simoni, A.; Marchetti, F.; Borghi, G.; Becattini, F.; Davoli, D.; Garattoni, L.; Francesca, G.; Seidenari, L.; Vezzani, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

2025 Relazione in Atti di Convegno

DOI IRIS

Improving Accomplice Detection in the Morphing Attack

Authors: Di Domenico, Nicolò; Borghi, Guido; Franco, Annalisa; Maltoni, Davide

Published in: MACHINE INTELLIGENCE RESEARCH

2025 Articolo su rivista

DOI IRIS

LLMs and Humanoid Robot Diversity: The Pose Generation Challenge

Authors: Catalini, Riccardo; Biagi, Federico; Salici, Giacomo; Borghi, Guido; Vezzani, Roberto; Biagiotti, Luigi

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Humanoid robots are increasingly being integrated into diverse scenarios, such as healthcare facilities, social settings, and workplaces. As the need … (Read full abstract)

Humanoid robots are increasingly being integrated into diverse scenarios, such as healthcare facilities, social settings, and workplaces. As the need for intuitive control by non-expert users grows, many studies have explored the use of Artificial Intelligence to enable communication and control. However, these approaches are often tailored to specific robots due to the absence of standardized conventions and notation. This study addresses the challenges posed by these inconsistencies and investigates their impact on the ability of Large Language Models (LLMs) to generate accurate 3D robot poses, even when detailed robot specifications are provided as input.

2025 Relazione in Atti di Convegno

DOI IRIS

LLMs as NAO Robot 3D Motion Planners

Authors: Catalini, Riccardo; Salici, Giacomo; Biagi, Federico; Borghi, Guido; Biagiotti, Luigi; Vezzani, Roberto

In this study, we demonstrate the capabilities of state-of-the-art Large Language Models (LLMs) in teaching social robots to perform specific … (Read full abstract)

In this study, we demonstrate the capabilities of state-of-the-art Large Language Models (LLMs) in teaching social robots to perform specific actions within a 3D environment. Specifically, we introduce the use of LLMs to generate sequences of 3D joint angles - in both zero-shot and one-shot prompting - that a humanoid robot must follow to perform a given action. This work is driven by the growing demand for intuitive interactions with social robots: indeed, LLMs could empower non-expert users to operate and benefit from robotic systems effectively. Additionally, this method leverages the possibility to generate synthetic data without effort, enabling privacy-focused use cases. To evaluate the output quality of seven different LLMs, we conducted a blind user study to compare the pose sequences. Participants were shown videos of the well-known NAO robot performing the generated actions and were asked to identify the intended action and choose the best match with the original instruction from a collection of candidates created by different LLMs. The results highlight that the majority of LLMs are indeed capable of planning correct and complete recognizable actions, showing a novel perspective of how AI can be applied to social robotics.

2025 Relazione in Atti di Convegno

IRIS

San Vitale Challenge: Automatic Reconstruction of Ancient Colored Glass Windows

Authors: Di Domenico, N.; Borghi, G.; Franco, A.; Boschetti, M.; Giacomini, F.; Barzaghi, S.; Ferucci, S.; Zambruno, S.; Mularoni, L.; Gao, Q.; Che, C.; Li, G.; Zu, Y.; Hao, J.; Zhang, J.; Ducz, A.; Gego, L.; Imeri, K.; Nemkin, V.; Rakhmatillaev, A.; Szatmari, S.; Rowan, W.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

The sixth-century Basilica of San Vitale in Ravenna, Italy, once featured intricate circular colored glass windows that illuminated its interior. … (Read full abstract)

The sixth-century Basilica of San Vitale in Ravenna, Italy, once featured intricate circular colored glass windows that illuminated its interior. Although these windows are now lost, several fragments were recovered during recent restorations. Unfortunately, reconstructing the original glass windows from these fragments is extremely complex and time-consuming, requiring the use of specialized expertise. Therefore, the development of automatic reconstruction techniques based on Artificial Intelligence is particularly important and challenging, due to, for instance, the presence of uniform color, damaged glass edges, and many fragment outliers. In this direction, the San Vitale Challenge was organized to gather the best methods and algorithms, as described and summarized in this paper. The challenge, split into several sub-tracks of increasing difficulty and realism, received the submission of several solutions, ranging from more classical computer vision algorithms to purely deep learning-based approaches, whose results are quantitatively evaluated and compared. In the last part of the paper, directions for future developments of such systems are discussed.

2025 Relazione in Atti di Convegno

DOI IRIS

TakuNet: an Energy-Efficient CNN for Real-Time Inference on Embedded UAV systems in Emergency Response Scenarios

Authors: Rossi, Daniel; Borghi, Guido; Vezzani, Roberto

Designing efficient neural networks for embedded devices is a critical challenge, particularly in applications requiring real-time performance, such as aerial … (Read full abstract)

Designing efficient neural networks for embedded devices is a critical challenge, particularly in applications requiring real-time performance, such as aerial imaging with drones and UAVs for emergency responses. In this work, we introduce TakuNet, a novel light-weight architecture which employs techniques such as depth-wise convolutions and an early downsampling stem to reduce computational complexity while maintaining high accuracy. It leverages dense connections for fast convergence during training and uses 16-bit floating-point precision for optimization on embedded hardware accelerators. Experimental evaluation on two public datasets shows that TakuNet achieves near-state-of-the-art accuracy in classifying aerial images of emergency situations, despite its minimal parameter count. Real-world tests on embedded devices, namely Jetson Orin Nano and Raspberry Pi, confirm TakuNet's efficiency, achieving more than 650 fps on the 15W Jetson board, making it suitable for real-time AI processing on resource-constrained platforms and advancing the applicability of drones in emergency scenarios. The code and implementation details are publicly released.

2025 Relazione in Atti di Convegno

DOI IRIS

TONO: A Synthetic Dataset for Face Image Compliance to ISO/ICAO Standard

Authors: Borghi, Guido; Franco, Annalisa; Di Domenico, Nicolò; Maltoni, Davide

Published in: LECTURE NOTES IN COMPUTER SCIENCE

2025 Relazione in Atti di Convegno

DOI IRIS

Towards on-device continual learning with Binary Neural Networks in industrial scenarios

Authors: Vorabbi, L.; Carraggi, A.; Maltoni, D.; Borghi, G.; Santi, S.

Published in: IMAGE AND VISION COMPUTING

This paper addresses the challenges of deploying deep learning models, specifically Binary Neural Networks (BNNs), on resource-constrained embedded devices within … (Read full abstract)

This paper addresses the challenges of deploying deep learning models, specifically Binary Neural Networks (BNNs), on resource-constrained embedded devices within the Internet of Things context. As deep learning continues to gain traction in IoT applications, the need for efficient models that can learn continuously from incremental data streams without requiring extensive computational resources has become more pressing. We propose a solution that integrates Continual Learning with BNNs, utilizing replay memory to prevent catastrophic forgetting. Our method focuses on quantized neural networks, introducing the quantization also for the backpropagation step, significantly reducing memory and computational requirements. Furthermore, we enhance the replay memory mechanism by storing intermediate feature maps (i.e. latent replay) with 1bit precision instead of raw data, enabling efficient memory usage. In addition to well-known benchmarks, we introduce the DL-Hazmat dataset, which consists of over 140k high-resolution grayscale images of 64 hazardous material symbols. Experimental results show a significant improvement in model accuracy and a substantial reduction in memory requirements, demonstrating the effectiveness of our method in enabling deep learning applications on embedded devices in real-world scenarios. Our work expands the application of Continual Learning and BNNs for efficient on-device training, offering a promising solution for IoT and other resource-constrained environments.

2025 Articolo su rivista

DOI IRIS

Towards Zero-Shot ISO/ICAO Face Compliance Verification via CLIP-IQA and Natural Language Prompting

Authors: Domenico, Nicolò Di; Borghi, Guido; Franco, Annalisa; Maltoni, Davide

Ensuring compliance of face images with ISO/ICAO quality standards is essential for boosting the document enrollment process. Indeed, traditional manual … (Read full abstract)

Ensuring compliance of face images with ISO/ICAO quality standards is essential for boosting the document enrollment process. Indeed, traditional manual checks are slow, subjective, and difficult to scale. Therefore, we propose a system that aims to fully automate compliance verification by directly analyzing the official requirements without relying on predefined hand-crafted features or manual thresholds. Our method combines a Large Language Model, a novel prompt learning procedure, and a contrastive learning framework to evaluate the adherence of a face image to quality requirements. Tested on a recent dataset, our proposed system achieves high accuracy, surpassing existing academic and commercial solutions. By streamlining the implementation and updates to the compliance rules, our approach represents a significant step toward simple, scalable, and regulation-driven image verification. Code and models are publicly available 1

2025 Relazione in Atti di Convegno

DOI IRIS

Adversarial Identity Injection for Semantic Face Image Synthesis

Authors: Tarollo, G.; Fontanini, T.; Ferrari, C.; Borghi, G.; Prati, A.

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

Nowadays, deep learning models have reached incredible performance in the task of image generation. Plenty of literature works address the … (Read full abstract)

Nowadays, deep learning models have reached incredible performance in the task of image generation. Plenty of literature works address the task of face generation and editing, with human and automatic systems that struggle to distinguish what's real from generated. Whereas most systems reached excellent visual generation quality, they still face difficulties in preserving the identity of the starting input subject. Among all the explored techniques, Semantic Image Synthesis (SIS) methods, whose goal is to generate an image conditioned on a semantic segmentation mask, are the most promising, even though preserving the perceived identity of the input subject is not their main concern. Therefore, in this paper, we investigate the problem of identity preservation in face image generation and present an SIS architecture that exploits a cross-attention mechanism to merge identity, style, and semantic features to generate faces whose identities are as similar as possible to the input ones. Experimental results reveal that the proposed method is not only suitable for preserving the identity but is also effective in the face recognition adversarial attack, i.e. hiding a second identity in the generated faces.

2024 Relazione in Atti di Convegno

DOI IRIS

Publications by Guido Borghi

Depth-Based Privileged Information for Boosting 3D Human Pose Estimation on RGB

Improving Accomplice Detection in the Morphing Attack

LLMs and Humanoid Robot Diversity: The Pose Generation Challenge

LLMs as NAO Robot 3D Motion Planners

San Vitale Challenge: Automatic Reconstruction of Ancient Colored Glass Windows

TakuNet: an Energy-Efficient CNN for Real-Time Inference on Embedded UAV systems in Emergency Response Scenarios

TONO: A Synthetic Dataset for Face Image Compliance to ISO/ICAO Standard

Towards on-device continual learning with Binary Neural Networks in industrial scenarios

Towards Zero-Shot ISO/ICAO Face Compliance Verification via CLIP-IQA and Natural Language Prompting

Adversarial Identity Injection for Semantic Face Image Synthesis