Publications - AImageLab

vHector and HeisenVec: Scalable Vector Graphics Generation Through Large Language Models

Authors: Zini, Leonardo; Frigieri, Elia; Aloscari, Sebastiano; Baraldi, Lorenzo

2025 Relazione in Atti di Convegno

IRIS

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

Authors: Baraldi, Lorenzo; Bucciarelli, Davide; Betti, Federico; Cornia, Marcella; Baraldi, Lorenzo; Sebe, Nicu; Cucchiara, Rita

Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most … (Read full abstract)

Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.

2025 Relazione in Atti di Convegno

IRIS

Zero-Shot Styled Text Image Generation, but Make It Autoregressive

Authors: Pippi, Vittorio; Quattrini, Fabio; Cascianelli, Silvia; Tonioni, Alessio; Cucchiara, Rita

Published in: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

2025 Relazione in Atti di Convegno

DOI IRIS

A Graph-Based Multi-Scale Approach with Knowledge Distillation for WSI Classification

Authors: Bontempo, Gianpaolo; Bolelli, Federico; Porrello, Angelo; Calderara, Simone; Ficarra, Elisa

Published in: IEEE TRANSACTIONS ON MEDICAL IMAGING

The usage of Multi Instance Learning (MIL) for classifying Whole Slide Images (WSIs) has recently increased. Due to their gigapixel … (Read full abstract)

The usage of Multi Instance Learning (MIL) for classifying Whole Slide Images (WSIs) has recently increased. Due to their gigapixel size, the pixel-level annotation of such data is extremely expensive and time-consuming, practically unfeasible. For this reason, multiple automatic approaches have been raised in the last years to support clinical practice and diagnosis. Unfortunately, most state-of-the-art proposals apply attention mechanisms without considering the spatial instance correlation and usually work on a single-scale resolution. To leverage the full potential of pyramidal structured WSI, we propose a graph-based multi-scale MIL approach, DAS-MIL. Our model comprises three modules: i) a self-supervised feature extractor, ii) a graph-based architecture that precedes the MIL mechanism and aims at creating a more contextualized representation of the WSI structure by considering the mutual (spatial) instance correlation both inter and intra-scale. Finally, iii) a (self) distillation loss between resolutions is introduced to compensate for their informative gap and significantly improve the final prediction. The effectiveness of the proposed framework is demonstrated on two well-known datasets, where we outperform SOTA on WSI classification, gaining a +2.7% AUC and +3.7% accuracy on the popular Camelyon16 benchmark.

2024 Articolo su rivista

DOI IRIS

A State-of-the-Art Review with Code about Connected Components Labeling on GPUs

Authors: Bolelli, Federico; Allegretti, Stefano; Lumetti, Luca; Grana, Costantino

Published in: IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

This article is about Connected Components Labeling (CCL) algorithms developed for GPU accelerators. The task itself is employed in many … (Read full abstract)

This article is about Connected Components Labeling (CCL) algorithms developed for GPU accelerators. The task itself is employed in many modern image-processing pipelines and represents a fundamental step in different scenarios, whenever object recognition is required. For this reason, a strong effort in the development of many different proposals devoted to improving algorithm performance using different kinds of hardware accelerators has been made. This paper focuses on GPU-based algorithmic solutions published in the last two decades, highlighting their distinctive traits and the improvements they leverage. The state-of-the-art review proposed is equipped with the source code, which allows to straightforwardly reproduce all the algorithms in different experimental settings. A comprehensive evaluation on multiple environments is also provided, including different operating systems, compilers, and GPUs. Our assessments are performed by means of several tests, including real-case images and synthetically generated ones, highlighting the strengths and weaknesses of each proposal. Overall, the experimental results revealed that block-based oriented algorithms outperform all the other algorithmic solutions on both 2D images and 3D volumes, regardless of the selected environment.

2024 Articolo su rivista

DOI IRIS

Adapt to Scarcity: Few-Shot Deepfake Detection via Low-Rank Adaptation

Authors: Cappelletti, Silvia; Baraldi, Lorenzo; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

The boundary between AI-generated images and real photographs is becoming increasingly narrow, thanks to the realism provided by contemporary generative … (Read full abstract)

The boundary between AI-generated images and real photographs is becoming increasingly narrow, thanks to the realism provided by contemporary generative models. Such technological progress necessitates the evolution of existing deepfake detection algorithms to counter new threats and protect the integrity of perceived reality. Although the prevailing approach among deepfake detection methodologies relies on large collections of generated and real data, the efficacy of these methods in adapting to scenarios characterized by data scarcity remains uncertain. This obstacle arises due to the introduction of novel generation algorithms and proprietary generative models that impose restrictions on access to large-scale datasets, thereby constraining the availability of generated images. In this paper, we first analyze how the performance of current deepfake methodologies, based on the CLIP embedding space, adapt in a few-shot situation over four state-of-the-art generators. Being the CLIP embedding space not specifically tailored for the task, a fine-tuning stage is desirable, although the amount of data needed is often unavailable in a data scarcity scenario. To address this issue and limit possible overfitting, we introduce a novel approach through the Low-Rank Adaptation (LoRA) of the CLIP architecture, tailored for few-shot deepfake detection scenarios. Remarkably, the LoRA-modified CLIP, even when fine-tuned with merely 50 pairs of real and fake images, surpasses the performance of all evaluated deepfake detection models across the tested generators. Additionally, when LoRA CLIP is benchmarked against other models trained on 1,000 samples and evaluated on generative models not seen during training it exhibits superior generalization capabilities.

2024 Relazione in Atti di Convegno

IRIS

Adversarial Identity Injection for Semantic Face Image Synthesis

Authors: Tarollo, G.; Fontanini, T.; Ferrari, C.; Borghi, G.; Prati, A.

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

Nowadays, deep learning models have reached incredible performance in the task of image generation. Plenty of literature works address the … (Read full abstract)

Nowadays, deep learning models have reached incredible performance in the task of image generation. Plenty of literature works address the task of face generation and editing, with human and automatic systems that struggle to distinguish what's real from generated. Whereas most systems reached excellent visual generation quality, they still face difficulties in preserving the identity of the starting input subject. Among all the explored techniques, Semantic Image Synthesis (SIS) methods, whose goal is to generate an image conditioned on a semantic segmentation mask, are the most promising, even though preserving the perceived identity of the input subject is not their main concern. Therefore, in this paper, we investigate the problem of identity preservation in face image generation and present an SIS architecture that exploits a cross-attention mechanism to merge identity, style, and semantic features to generate faces whose identities are as similar as possible to the input ones. Experimental results reveal that the proposed method is not only suitable for preserving the identity but is also effective in the face recognition adversarial attack, i.e. hiding a second identity in the generated faces.

2024 Relazione in Atti di Convegno

DOI IRIS

AIGeN: An Adversarial Approach for Instruction Generation in VLN

Authors: Rawal, Niyati; Bigazzi, Roberto; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

2024 Relazione in Atti di Convegno

DOI IRIS

An IoT-enabled Software Architecture for User-Friendly Fault Diagnosis and Identification: The Welding Cobot Use Case

Authors: Bertoli, Annalisa; Ferraguti, Federica; Fantuzzi, Cesare

This paper proposes a software architecture for monitoring and diagnostics of failures in industrial systems. The architecture aims to support … (Read full abstract)

This paper proposes a software architecture for monitoring and diagnostics of failures in industrial systems. The architecture aims to support the operator's decision-making process by enabling a real-time and intuitive understanding of system faults. The paper describes the methodology and implementation process applied to a real industrial case: a welding collaborative robotic application. However, the proposed software architecture can be easily extended to a broader number of industrial systems. The core of the idea is based on an ecosystem of Internet of Things (IoT) elements deployed in the automation systems that collect the system status and alarms to stream them to the cloud server. The industrial use case described in the paper is a collaborative robot-assisted welding solution for automated MIG/MAG welding produced by 'Indus-tria Tecnologica Italiana', an Italian SME company, with the brand name'MyWelder'. We investigated the system's impact on the operator's work and its effectiveness in supporting his/her decision-making process. Additionally, the validation process assessed the system's functionalities within this specific use case. The primary objective related to the use case is to establish a strategy that minimizes the production of defective parts, ultimately reducing waste.

2024

DOI IRIS

Are Learnable Prompts the Right Way of Prompting? Adapting Vision-and-Language Models with Memory Optimization

Authors: Moratelli, Nicholas; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTELLIGENT SYSTEMS

Few-shot learning (FSL) requires fine-tuning a pretrained model on a limited set of examples from novel classes. When applied to … (Read full abstract)

Few-shot learning (FSL) requires fine-tuning a pretrained model on a limited set of examples from novel classes. When applied to vision-and-language models, the dominant approach for FSL has been that of learning input prompts which can be concatenated to the input context of the model. Despite the considerable promise they hold, the effectiveness and expressive power of prompts are limited by the fact that they can only lie at the input of the architecture. In this article, we critically question the usage of learnable prompts, and instead leverage the concept of “implicit memory” to directly capture low- and high-level relationships within the attention mechanism at any layer of the architecture, thereby establishing an alternative to prompts in FSL. Our proposed approach, termed MemOp, exhibits superior performance across 11 widely recognized image classification datasets and a benchmark for contextual domain shift evaluation, effectively addressing the challenges associated with learnable prompts.

2024 Articolo su rivista

DOI IRIS