Publications - AImageLab

Metodi di Deep Learning Efficienti e Adattivi per Sistemi di Automatic Data Capture

Authors: Vezzali, Enrico

I sistemi di Automatic Data Capture (ADC) rappresentano una tecnologia fondamentale per la logistica, il commercio e la produzione moderna, … (Read full abstract)

I sistemi di Automatic Data Capture (ADC) rappresentano una tecnologia fondamentale per la logistica, il commercio e la produzione moderna, consentendo tracciabilità, automazione e monitoraggio dei processi tramite la rapida acquisizione di informazioni visive o codificate. Tra queste tecnologie, i codici a barre restano una delle soluzioni più diffuse ed economiche per l’identificazione dei prodotti. Tuttavia, nonostante la loro maturità, il riconoscimento di codici e simboli presenta ancora difficoltà in condizioni industriali reali, dove variazioni di illuminazione, sfocature, lunghe distanze o bassa risoluzione riducono la leggibilità. Gli algoritmi di visione artificiale tradizionale – basati su analisi geometriche, operatori morfologici o sulla trasformata di Hough – sono affidabili in contesti controllati, ma non quando le condizioni di acquisizione si discostano dai parametri nominali. Le tecniche di deep learning, invece, offrono maggiore flessibilità e robustezza, ma richiedono risorse computazionali elevate che ne limitano l’uso su piattaforme embedded. Colmare questo divario tra accuratezza ed efficienza è quindi essenziale per la prossima generazione di sistemi ADC intelligenti. La tesi analizza strategie di benchmarking, ottimizzazione e deployment di modelli di deep learning efficienti per applicazioni ADC industriali. Il lavoro, svolto in collaborazione con Datalogic S.p.A., si concentra sull’integrazione di architetture neurali adattive in ambienti vincolati e in tempo reale. La prima parte affronta la carenza di dati open source e benchmark riproducibili nella localizzazione di codici a barre. A tal fine è stato sviluppato BarBeR – Barcode Benchmark Repository, un framework pubblico con 8 748 immagini annotate che unifica approcci classici e metodi di deep learning sotto protocolli comuni, garantendo confronti equi e riproducibilità. I test hanno confermato che, sebbene i modelli deep superino quelli tradizionali in accuratezza, il loro costo computazionale resta un ostacolo per l’esecuzione in tempo reale su dispositivi embedded. Per superare tale limite è stato proposto BaFaLo, un localizzatore leggero basato sulla segmentazione, ottimizzato per operare su CPU senza acceleratori. Ispirato al paradigma Fast-SCNN, BaFaLo bilancia velocità e precisione, rilevando codici piccoli o degradati in condizioni difficili e mantenendo prestazioni real-time. Poiché la sola localizzazione non basta, e occorre leggere i codici anche in condizioni avverse, è stato introdotto Mosaic-SR, un metodo di super-risoluzione adattivo a più passaggi che alloca le risorse di calcolo alle regioni più complesse. Guidato da una stima di incertezza, Mosaic-SR migliora accuratezza e latenza rispetto agli approcci uniformi, consentendo ricostruzioni di alta qualità su hardware embedded. L’ultima parte, svolta presso l’Integrated Systems Laboratory dell’ETH Zurich, riguarda la quantizzazione e il deployment di modelli generativi. Combinando strategie avanzate come SVDQuant e la quantizzazione della cache, è stato possibile ridurre di oltre il 50 % la memoria richiesta senza compromettere qualità o stabilità. Questi risultati aprono la strada all’uso di modelli generativi su piattaforme a risorse limitate e alla creazione di dataset sintetici quando i dati reali o open source sono insufficienti. In sintesi, la tesi dimostra come il deep learning efficiente e adattivo renda accessibili capacità visive avanzate ai sistemi ADC in tempo reale. Attraverso benchmarking, ottimizzazione e deployment di architetture neurali per rilevamento, miglioramento e generazione, il lavoro contribuisce all’evoluzione della visione industriale: da pipeline rigide e basate su regole a soluzioni flessibili e guidate dai dati, affidabili anche in condizioni operative reali

2026 Tesi di dottorato

IRIS

Modulation of Aerobic Glycolysis Genes During the Progression of Retinitis Pigmentosa

Authors: Adani, E.; Vasquez, S. S. V.; Lovino, M.; Bighinati, A.; Cappellino, L.; D'Alessandro, S.; Kalatzis, V.; Marigo, V.

Published in: INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE

PURPOSE. Photoreceptors are retinal cells with a high glucose metabolism and retinal degeneration, specifically retinitis pigmentosa (RP), affects glycolysis. We … (Read full abstract)

PURPOSE. Photoreceptors are retinal cells with a high glucose metabolism and retinal degeneration, specifically retinitis pigmentosa (RP), affects glycolysis. We aimed to evaluate changes in the expression of genes related to glucose metabolism in rod photoreceptors at different stages of retinal degeneration in murine models and human retinal organoids. METHODS. RNA sequencing (RNA-seq) analysis was performed on a photoreceptor-like cell line induced to undergo degeneration and validated by real-time qPCR analysis of retinas from two murine models and one human organoid model of RP. Bioinformatic analysis was performed on published RNA-seq datasets from three murine RP models. Real-time qPCR analysis was also performed on retinas treated with an adeno-associated virus type 2 vector carrying the neurotrophic H105A peptide, derived from the pigment epithelium-derived factor. RESULTS. The aerobic glycolysis genes, Hk2, Pkm1, Pkm2, Ldha, and Slc6a6 and other glucose metabolism genes were found downregulated in the in vitro model of photoreceptor degeneration and in the in vivo RhoP23H/+, rd1, and rd10 models at early stages of the disease. The decreased expression of the aerobic glycolysis genes, except for PKM2, was confirmed in human organoids with mutations in the USH2A gene associated with RP. Expression was partially recovered in RhoP23H/+ retinas after treatment with the adeno-associated virus type 2 vector expressing the neurotrophic H105A peptide. CONCLUSIONS. Glucose metabolism gene expression was found altered during the progression of RP in murine and human models of the disease. Expression was partially recovered in a molecular response to the treatment with the neurotrophic factor H105A.

2026 Articolo su rivista

DOI IRIS

Multi-Structure Segmentation in CBCT Volumes: the ToothFairy2 Challenge

Authors: Bolelli, Federico; Lumetti, Luca; Van Nistelrooij, Niels; Vinayahalingam, Shankeeth; Di Bartolomeo, Mattia; Marchesini, Kevin; Pellacani, Arrigo; Candeloro, Ettore; Rosati, Gabriele; Xi, Tong; Isensee, Fabian; Kirchhoff, Yannick; Krämer, Lars; Rokuss, Maximilian; Ulrich, Constantin; Maier-Hein, Klaus; Jiang, Yuxian; Liu, Yusheng; Wang, Lisheng; Wang, Haoshen; Chen, Siyu; Cui, Zhiming; Shi, Pengcheng; Pan, Zhaohong; Liang, Xiaokun; Ma, Qi; Konukoglu, Ender; Wodzinski, Marek; Müller, Henning; Mai, Haipeng; Dang, Xiaobing; Bhandary, Shrajan; Grosu, Radu; Bergé, Stefaan; Anesi, Alexandre; Grana, Costantino

Published in: MEDICAL IMAGE ANALYSIS

Cone-beam computed tomography (CBCT) is widely used for dento-maxillofacial diagnostics and treatment planning, and comprehensive multi-structure segmentation remains time-consuming, limiting … (Read full abstract)

Cone-beam computed tomography (CBCT) is widely used for dento-maxillofacial diagnostics and treatment planning, and comprehensive multi-structure segmentation remains time-consuming, limiting large-scale, reproducible research. In this article, we present ToothFairy2, a MICCAI 2024 challenge on multi-structure segmentation in maxillofacial CBCT. The accompanying dataset comprises 530 CBCT volumes (480 public training, 50 hidden test) with expert 3D annotations of 42 classes, including maxilla, mandible, crowns, bridges, implants, inferior alveolar canals, maxillary sinuses, pharynx, and teeth using the International Tooth Numbering System (FDI). 26 international teams participated in ToothFairy2, and their methods were run and evaluated for voxel-wise multi-class segmentation using a standardized protocol. This report extends the evaluation of teeth to also investigate the current capabilities of tooth detection and FDI numbering. Furthermore, ranking stability was analyzed to assess the robustness of the final challenge outcome. Overall, challenge participants achieved consistently high performance for large, high-contrast structures such as jawbones, pharynx, and most teeth, while maxillary sinuses, dental restorations, and fine structures remain challenging due to class imbalance and metal artifacts. Analysis of tooth-related metrics further revealed that assigning correct FDI numbers was more challenging than delineating individual teeth. By releasing CBCT data, 3D annotations, baseline models, and evaluation code, ToothFairy2 establishes a long-term benchmark to drive the development of automated methods for robust, clinically meaningful multi-structure segmentation in maxillofacial CBCT.

2026 Articolo su rivista

DOI IRIS

Multimodal Understanding tramite Retrieval-Augmentation: dai Modelli alla Valutazione

Authors: Sarto, Sara

Nel campo dell’Intelligenza Artificiale (IA), l’introduzione del meccanismo di attention e dell’architettura Transformer ha reso possibili modelli in grado di … (Read full abstract)

Nel campo dell’Intelligenza Artificiale (IA), l’introduzione del meccanismo di attention e dell’architettura Transformer ha reso possibili modelli in grado di elaborare più modalità su scala senza precedenti. Questa svolta è dovuta alla flessibilità dell’operatore di attention e all’adattabilità dell’architettura, che hanno dato origine a una nuova generazione di sistemi visione-linguaggio. Tra i task all’intersezione tra Computer Vision, Natural Language Processing e Multimedia, l’image captioning, ovvero la generazione di descrizioni in linguaggio naturale a partire da contenuti visivi, ha svolto un ruolo centrale. Nell’era dei Multimodal Large Language Models (MLLMs), il captioning resta fondamentale, affiancato da task multimodali come il Visual Question Answering (VQA). Per potenziare tali modelli, la retrieval augmentation è emersa come strategia chiave. L’arricchimento con conoscenza esterna rilevante migliora l’adattabilità e consente risposte più accurate e sensibili al contesto, soprattutto in scenari complessi o specialistici. Questa tesi rappresenta l’evoluzione naturale della retrieval augmentation, passando dalle sue prime applicazioni nell’image captioning all'integrazione nei moderni MLLMs. Ogni fase si basa sulle intuizioni e sulle sfide incontrate, affrontando problemi aperti legati alla valutazione e all’efficacia del retrieval. La prima parte della tesi stabilisce le basi dei modelli visione-linguaggio con retrieval augmentation. Vengono analizzate tecniche classiche di cross-modal retrieval ed estese a scenari più complessi, inclusi query multimodali e collezioni documentali eterogenee. Un’intuizione centrale è che la qualità del retrieval influenzi in modo critico le prestazioni complessive. In risposta a ciò, vengono introdotti nuovi retriever multimodali, ReT e ReT-2, progettati per tali scenari. La tesi indaga anche architetture di captioning con retrieval augmentation attraverso l’introduzione del RA-Transformer, in cui la conoscenza esterna viene integrata direttamente nel processo di generazione, fornendo segnali utili a produrre caption più ricche e precise. Successivamente, il lavoro estende la retrieval augmentation ai MLLMs, motivato dal fatto che anche il pretraining su larga scala mostra limiti nell’affrontare query knowledge-intensive o specifiche di dominio. In particolare, WikiLLaVA introduce architetture MLLM con retrieval augmentation per il knowledge-based VQA, in cui i meccanismi di retrieval potenziano le capacità di ragionamento e l’adattabilità a query multimodali complesse. Nel corso della ricerca emerge come il progresso dei modelli di captioning sia limitato dalla mancanza di metriche di valutazione robuste e affidabili. Le metriche tradizionali, sebbene ampiamente utilizzate, spesso non riescono a catturare adeguatezza semantica, grounding fattuale e fluidità linguistica. Quindi, un contributo di questa tesi è la progettazione e l’analisi di nuove metriche di valutazione per l’image captioning, ovvero PAC-S, BRIDGE e una versione migliorata di PAC-S. Tali metriche sono progettate per allinearsi al giudizio umano e per catturare la qualità delle descrizioni. La tesi ne analizza anche l’applicazione su diversi benchmark e domini, inclusa la loro capacità di valutare caption generate da MLLMs, riflettendo il passaggio del captioning da compito autonomo a componente di sistemi di ragionamento multimodale più ampi. Nel complesso, attraverso nuove architetture di captioning con retrieval augmentation, retriever multimodali e metriche di valutazione, questa tesi fornisce metodologie, strumenti e contributi che avanzano lo stato dell’arte nell’ambito dell’Intelligenza Artificiale multimodale.

2026 Tesi di dottorato

IRIS

Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing

Authors: Baldrati, Alberto; Morelli, Davide; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita

Published in: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS

Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations … (Read full abstract)

Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.

2026 Articolo su rivista

IRIS

Multiomic integration reveals tumoral heterogeneity of lipid dependence within lethal group 3 medulloblastoma

Authors: Bernardi, F.; Torrejon, J.; Basili, I.; Van Ommeren, R.; Marsaud, V.; Yu, H.; Talbot, J.; Souphron, J.; Indersie, E.; Forget, A.; Bonneau, B.; Massiot, A.; Alcazar, C.; Figeac, L.; Bonerandi, E.; Cancila, G.; Sirbu, O.; Yadav, N.; Mohanakrishnan, D.; Lombard, B.; Loew, D.; Poullet, P.; Liva, S.; Lovino, M.; Lin, I. H.; Nakashima, T.; Gharsalli, T.; Nicolas, P. A.; Yubuki, N.; Ribas, R. A.; Colsch, B.; Chu-Van, E.; Castelli, F.; Sampaio, J. L.; Leboucher, S.; Lasgi, C.; Besse, L.; Soler, M. N.; Lo Re, V.; Planque, N.; Abeysundara, N.; Balin, P.; Wang, H.; Su, H.; Wu, X.; Cavalli, F. M. G.; Saulnier, O.; Ficarra, E.; Di Marcotullio, L.; Kumegawa, K.; Maruyama, R.; Kawauchi, D.; Picard, D.; Remke, M.; Riffaud, L.; Puiseux, C.; Bouchoucha, Y.; Huybrechts, S.; Simbozel, M.; Bourdeaut, F.; Varlet, P.; Puget, S.; Blauwblomme, T.; Andrianteranagna, M.; Planchon, J. M.; Dugourd, A.; Saez-Rodriguez, J.; Barillot, E.; Servant, N.; Martignetti, L.; Rich, J.; Kool, M.; Pfister, S. M.; Agnihotri, S.

Published in: CANCER CELL

Medulloblastoma, the most common malignant brain tumor of childhood, exhibits significant biological complexity that demands deeper exploration. Here, we present … (Read full abstract)

Medulloblastoma, the most common malignant brain tumor of childhood, exhibits significant biological complexity that demands deeper exploration. Here, we present a large multiomics dataset integrating data from 384 primary medulloblastoma patient samples across five omic layers: CpG methylome, transcriptome, proteome, phosphoproteome, and metabolome, paired with associated clinical metadata. Data integration revealed intertumoral heterogeneity of lipid metabolism across proteomic subtypes. Notably, while the MYC-FASN-SCD axis drives lipid biosynthesis, pathway inhibition elicits a compensatory escape mechanism in vivo through exogenous fatty acid uptake. Unexpectedly, we demonstrated that MYC triggers lipid storage, creating a unique dependency on lipid droplet-mitochondria communications to sustain tumor maintenance in vivo. Together, this comprehensive analysis reveals a targetable vulnerability downstream of MYC that constitutes a promising therapeutic approach to treat currently untreatable medulloblastoma subtypes.

2026 Articolo su rivista

DOI IRIS

PATHOS: Pathology attention framework for treatment response stratification in ovarian high-grade serous carcinomas following neoadjuvant chemotherapy on H&E images

Authors: Miccolis, F.; Lovino, M.; Lehtonen, O.; Hynninen, J.; Hautaniemi, S.; Virtanen, A.; Ficarra, E.

Published in: JOURNAL OF PATHOLOGY INFORMATICS

Ovarian high-grade serous carcinoma (ovarian HGSC) is a clinically challenging disease with a poor prognosis, particularly for patients receiving neoadjuvant … (Read full abstract)

Ovarian high-grade serous carcinoma (ovarian HGSC) is a clinically challenging disease with a poor prognosis, particularly for patients receiving neoadjuvant chemotherapy (NACT) before debulking surgery. In this study, we evaluate the progression-free interval (PFI) after NACT based on hematoxylin and eosin-stained whole-slide images (WSIs) of omental tumor tissue. Digital pathology tools are emerging, aiming at assisting pathologists in diagnosis and analysis; however, distinguishing features associated with response to NACT remain elusive. Multiple instance learning (MIL) coupled with attention mechanisms has shown promise in predicting treatment response from WSIs. Additionally, segmentation tools can identify and delineate regions in WSIs. Whereas some efforts have been made to develop explainable models for clinical outcome, there remains a need for genuinely interpretable models for pathologists. This article introduces the PATHOS framework, a novel approach to explaining crucial features of treatment response based on the PFI time in NACT treated patients from WSIs. PATHOS is composed of three blocks: (1) MIL block to identify informative regions, (2) panoptic segmentation and downstream analysis block for feature computation, and (3) classification block to predict the PFI. The results demonstrate that PATHOS enhances the interpretability of response to NACT in ovarian HGSC patients by highlighting pathologically significant features relevant to PFI prediction, such as tumor cell morphology, stromal abundance, and the spatial distribution of stromal regions. Furthermore, PATHOS identifies approximately 10% of the total WSI area as an informative region for clinical outcome.

2026 Articolo su rivista

DOI IRIS

PopEYE - Infrared Ocular Image Dataset for Eye State and Gaze-Direction Classification

Authors: Gibertoni, Giovanni; Borghi, Guido; Rovati, Luigi

The PopEYE dataset is a specialized collection of 14,976 near-infrared (NIR) images of the human eye region, specifically designed to … (Read full abstract)

The PopEYE dataset is a specialized collection of 14,976 near-infrared (NIR) images of the human eye region, specifically designed to support the development and benchmarking of computer vision algorithms for eye-state detection and coarse gaze-direction classification. Each image is provided in a fixed resolution of 772 × 520 pixels in 8-bit grayscale PNG format. The acquisition was performed frontally using a custom-developed Maxwellian-view optical configuration, consisting of a board-level CMOS camera and a specialized lens system where the subject's eye is precisely positioned at the focal point. This setup ensures a high-contrast representation of the anterior segment, making the pupil, iris, limbus, and portions of the sclera and eyelids clearly distinguishable under stable 850 nm infrared illumination. The dataset is categorized into six mutually exclusive classes identified through manual annotation supported by fixed visual aids and an expert system algorithm. The classification includes a correct positioning class for eyes open and properly aligned for clinical measurements (8,160 images), a closed class representing full eye closures such as blinks or sustained lid closure (1,790 images), and four directional classes representing gaze shifts relative to the central optical axis, specifically up (1,379 images), down (1,015 images), left (1,296 images), and right (1,336 images). The data captures the natural anatomical variability of 22 subjects and incorporates common real-world artifacts such as specular reflections from NIR sources and partial pupil occlusions by eyelashes or eyelids. By providing standardized labels and high-resolution NIR imagery, PopEYE serves as a robust resource for training machine learning models intended for real-time patient monitoring during ophthalmic examinations.

2026 Banca dati

DOI IRIS

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

Authors: Mattioli, Gabriele; Turri, Evelyn; Sarto, Sara; Baraldi, Lorenzo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources — such as … (Read full abstract)

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources — such as APIs, computational utilities, and specialized models — to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.

2026 Relazione in Atti di Convegno

IRIS

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Authors: Compagnoni, Alberto; Morini, Marco; Sarto, Sara; Cocchi, Federico; Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual … (Read full abstract)

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.

2026 Relazione in Atti di Convegno

IRIS