Publications

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Histological Brain Imaging Super-resolution with Frequency-guided Diffusion Models

Authors: Casari, Giovanni; Bolelli, Federico; Grana, Costantino

High-resolution histological imaging provides essential detail for quantitative brain modeling, yet acquiring whole-brain data at micrometer scale remains technically and … (Read full abstract)

High-resolution histological imaging provides essential detail for quantitative brain modeling, yet acquiring whole-brain data at micrometer scale remains technically and economically challenging. This work introduces Brain-SR, a diffusion-based super-resolution framework designed to reconstruct high-resolution cortical sections from low-resolution BigBrain data. Building upon the InvSR paradigm, our method performs resolution enhancement in the latent space of a pretrained variational autoencoder, guided by a task-specific noise-predictor network. A key contribution is a frequency-domain supervision term that compares the magnitude spectra of predicted and target patches, enforcing spectral consistency while remaining robust to local misalignments. Quantitative evaluations demonstrate that Brain-SR achieves substantial improvements in LPIPS (-27%) and FID (-58%) compared to baseline diffusion Super-Resolution, while spectral analysis confirms accurate recovery of the frequency distribution. The resulting reconstructions preserve neuronal structures consistent with high-resolution references, offering a practical step toward large-scale, morphologically faithful brain histology reconstruction. The code is publicly available to support reproducibility: https://github.com/AImageLab-zip/Brain-SR.

2026 Relazione in Atti di Convegno

Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling

Authors: Cappelletti, Silvia; Poppi, Tobia; Poppi, Samuele; Yong, Zheng-Xin; Garcia-Olano, Diego; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Large Language Models (LLMs) are traditionally evaluated on multiple-choice question answering (MCQA) tasks using First-Token Probability (FTP), which selects the … (Read full abstract)

Large Language Models (LLMs) are traditionally evaluated on multiple-choice question answering (MCQA) tasks using First-Token Probability (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (misalignment) or use a valid token merely as part of a generic preamble rather than as a clear answer choice (misinterpretation), undermining the reliability of symbolic evaluation. We propose a simple solution: output prefilling, a structured natural-language prefix (e.g., 'The correct option is:') prepended to the model output. Originally explored in AI safety as an attack strategy, we repurpose prefilling to steer the model to respond with a clean, valid option, without modifying its parameters. Through extensive evaluation, we find that the FTP with prefilling strategy substantially improves accuracy, calibration, and output consistency across a broad set of LLMs and MCQA benchmarks. It outperforms standard FTP and often matches the performance of open-ended generation approaches that require full decoding and external classifiers, while being significantly more efficient. Our analysis suggests that prefilling is a simple, robust, and zero-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.

2026 Relazione in Atti di Convegno

Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Authors: Lobba, Davide; Sanguigni, Fulvio; Ren, Bin; Cornia, Marcella; Cucchiara, Rita; Sebe, Nicu

Virtual try-on (VTON) has been widely explored for rendering garments onto person images, while its inverse task, virtual try-off (VTOFF), … (Read full abstract)

Virtual try-on (VTON) has been widely explored for rendering garments onto person images, while its inverse task, virtual try-off (VTOFF), remains largely overlooked. VTOFF aims to recover standardized product images of garments directly from photos of clothed individuals. This capability is of great practical importance for e-commerce platforms, large-scale dataset curation, and the training of foundation models. Unlike VTON, which must handle diverse poses and styles, VTOFF naturally benefits from a consistent output format in the form of flat garment images. However, existing methods face two major limitations: (i) exclusive reliance on visual cues from a single photo often leads to ambiguity, and (ii) generated images usually suffer from loss of fine details, limiting their real-world applicability. To address these challenges, we introduce TEMU-VTOFF, a Text-Enhanced MUlti-category framework for VTOFF. Our architecture is built on a dual DiT-based backbone equipped with a multimodal attention mechanism that jointly exploits image, text, and mask information to resolve visual ambiguities and enable robust feature learning across garment categories. To explicitly mitigate detail degradation, we further design an alignment module that refines garment structures and textures, ensuring high-quality outputs. Extensive experiments on VITON-HD and Dress Code show that TEMU-VTOFF achieves new state-of-the-art performance, substantially improving both visual realism and consistency with target garments. Code and models are available at: https://temu-vtoff-page.github.io/.

2026 Relazione in Atti di Convegno

Modulation of Aerobic Glycolysis Genes During the Progression of Retinitis Pigmentosa

Authors: Adani, E.; Vasquez, S. S. V.; Lovino, M.; Bighinati, A.; Cappellino, L.; D'Alessandro, S.; Kalatzis, V.; Marigo, V.

Published in: INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE

PURPOSE. Photoreceptors are retinal cells with a high glucose metabolism and retinal degeneration, specifically retinitis pigmentosa (RP), affects glycolysis. We … (Read full abstract)

PURPOSE. Photoreceptors are retinal cells with a high glucose metabolism and retinal degeneration, specifically retinitis pigmentosa (RP), affects glycolysis. We aimed to evaluate changes in the expression of genes related to glucose metabolism in rod photoreceptors at different stages of retinal degeneration in murine models and human retinal organoids. METHODS. RNA sequencing (RNA-seq) analysis was performed on a photoreceptor-like cell line induced to undergo degeneration and validated by real-time qPCR analysis of retinas from two murine models and one human organoid model of RP. Bioinformatic analysis was performed on published RNA-seq datasets from three murine RP models. Real-time qPCR analysis was also performed on retinas treated with an adeno-associated virus type 2 vector carrying the neurotrophic H105A peptide, derived from the pigment epithelium-derived factor. RESULTS. The aerobic glycolysis genes, Hk2, Pkm1, Pkm2, Ldha, and Slc6a6 and other glucose metabolism genes were found downregulated in the in vitro model of photoreceptor degeneration and in the in vivo RhoP23H/+, rd1, and rd10 models at early stages of the disease. The decreased expression of the aerobic glycolysis genes, except for PKM2, was confirmed in human organoids with mutations in the USH2A gene associated with RP. Expression was partially recovered in RhoP23H/+ retinas after treatment with the adeno-associated virus type 2 vector expressing the neurotrophic H105A peptide. CONCLUSIONS. Glucose metabolism gene expression was found altered during the progression of RP in murine and human models of the disease. Expression was partially recovered in a molecular response to the treatment with the neurotrophic factor H105A.

2026 Articolo su rivista

Multimodal Understanding tramite Retrieval-Augmentation: dai Modelli alla Valutazione

Authors: Sarto, Sara

Nel campo dell’Intelligenza Artificiale (IA), l’introduzione del meccanismo di attention e dell’architettura Transformer ha reso possibili modelli in grado di … (Read full abstract)

Nel campo dell’Intelligenza Artificiale (IA), l’introduzione del meccanismo di attention e dell’architettura Transformer ha reso possibili modelli in grado di elaborare più modalità su scala senza precedenti. Questa svolta è dovuta alla flessibilità dell’operatore di attention e all’adattabilità dell’architettura, che hanno dato origine a una nuova generazione di sistemi visione-linguaggio. Tra i task all’intersezione tra Computer Vision, Natural Language Processing e Multimedia, l’image captioning, ovvero la generazione di descrizioni in linguaggio naturale a partire da contenuti visivi, ha svolto un ruolo centrale. Nell’era dei Multimodal Large Language Models (MLLMs), il captioning resta fondamentale, affiancato da task multimodali come il Visual Question Answering (VQA). Per potenziare tali modelli, la retrieval augmentation è emersa come strategia chiave. L’arricchimento con conoscenza esterna rilevante migliora l’adattabilità e consente risposte più accurate e sensibili al contesto, soprattutto in scenari complessi o specialistici. Questa tesi rappresenta l’evoluzione naturale della retrieval augmentation, passando dalle sue prime applicazioni nell’image captioning all'integrazione nei moderni MLLMs. Ogni fase si basa sulle intuizioni e sulle sfide incontrate, affrontando problemi aperti legati alla valutazione e all’efficacia del retrieval. La prima parte della tesi stabilisce le basi dei modelli visione-linguaggio con retrieval augmentation. Vengono analizzate tecniche classiche di cross-modal retrieval ed estese a scenari più complessi, inclusi query multimodali e collezioni documentali eterogenee. Un’intuizione centrale è che la qualità del retrieval influenzi in modo critico le prestazioni complessive. In risposta a ciò, vengono introdotti nuovi retriever multimodali, ReT e ReT-2, progettati per tali scenari. La tesi indaga anche architetture di captioning con retrieval augmentation attraverso l’introduzione del RA-Transformer, in cui la conoscenza esterna viene integrata direttamente nel processo di generazione, fornendo segnali utili a produrre caption più ricche e precise. Successivamente, il lavoro estende la retrieval augmentation ai MLLMs, motivato dal fatto che anche il pretraining su larga scala mostra limiti nell’affrontare query knowledge-intensive o specifiche di dominio. In particolare, WikiLLaVA introduce architetture MLLM con retrieval augmentation per il knowledge-based VQA, in cui i meccanismi di retrieval potenziano le capacità di ragionamento e l’adattabilità a query multimodali complesse. Nel corso della ricerca emerge come il progresso dei modelli di captioning sia limitato dalla mancanza di metriche di valutazione robuste e affidabili. Le metriche tradizionali, sebbene ampiamente utilizzate, spesso non riescono a catturare adeguatezza semantica, grounding fattuale e fluidità linguistica. Quindi, un contributo di questa tesi è la progettazione e l’analisi di nuove metriche di valutazione per l’image captioning, ovvero PAC-S, BRIDGE e una versione migliorata di PAC-S. Tali metriche sono progettate per allinearsi al giudizio umano e per catturare la qualità delle descrizioni. La tesi ne analizza anche l’applicazione su diversi benchmark e domini, inclusa la loro capacità di valutare caption generate da MLLMs, riflettendo il passaggio del captioning da compito autonomo a componente di sistemi di ragionamento multimodale più ampi. Nel complesso, attraverso nuove architetture di captioning con retrieval augmentation, retriever multimodali e metriche di valutazione, questa tesi fornisce metodologie, strumenti e contributi che avanzano lo stato dell’arte nell’ambito dell’Intelligenza Artificiale multimodale.

2026 Tesi di dottorato

Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing

Authors: Baldrati, Alberto; Morelli, Davide; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita

Published in: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS

Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations … (Read full abstract)

Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.

2026 Articolo su rivista

PopEYE - Infrared Ocular Image Dataset for Eye State and Gaze-Direction Classification

Authors: Gibertoni, Giovanni; Borghi, Guido; Rovati, Luigi

The PopEYE dataset is a specialized collection of 14,976 near-infrared (NIR) images of the human eye region, specifically designed to … (Read full abstract)

The PopEYE dataset is a specialized collection of 14,976 near-infrared (NIR) images of the human eye region, specifically designed to support the development and benchmarking of computer vision algorithms for eye-state detection and coarse gaze-direction classification. Each image is provided in a fixed resolution of 772 × 520 pixels in 8-bit grayscale PNG format. The acquisition was performed frontally using a custom-developed Maxwellian-view optical configuration, consisting of a board-level CMOS camera and a specialized lens system where the subject's eye is precisely positioned at the focal point. This setup ensures a high-contrast representation of the anterior segment, making the pupil, iris, limbus, and portions of the sclera and eyelids clearly distinguishable under stable 850 nm infrared illumination. The dataset is categorized into six mutually exclusive classes identified through manual annotation supported by fixed visual aids and an expert system algorithm. The classification includes a correct positioning class for eyes open and properly aligned for clinical measurements (8,160 images), a closed class representing full eye closures such as blinks or sustained lid closure (1,790 images), and four directional classes representing gaze shifts relative to the central optical axis, specifically up (1,379 images), down (1,015 images), left (1,296 images), and right (1,336 images). The data captures the natural anatomical variability of 22 subjects and incorporates common real-world artifacts such as specular reflections from NIR sources and partial pupil occlusions by eyelashes or eyelids. By providing standardized labels and high-resolution NIR imagery, PopEYE serves as a robust resource for training machine learning models intended for real-time patient monitoring during ophthalmic examinations.

2026 Banca dati

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

Authors: Mattioli, Gabriele; Turri, Evelyn; Sarto, Sara; Baraldi, Lorenzo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources — such as … (Read full abstract)

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources — such as APIs, computational utilities, and specialized models — to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.

2026 Relazione in Atti di Convegno

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Authors: Compagnoni, Alberto; Morini, Marco; Sarto, Sara; Cocchi, Federico; Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual … (Read full abstract)

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.

2026 Relazione in Atti di Convegno

Searching for New Possible Peripheral Biomarkers of Cognitive Decline in Down Syndrome: The Role of IL-18 Pathway and its Interaction with TGF-β1 and TNF-α

Authors: Grasso, M.; Fidilio, A.; L'Episcopo, F.; Recupero, M.; Barone, C.; Lovino, M.; Alboni, S.; Bacalini, M. G.; Caruso, G.; Greco, D.; Buono, S.; De La Torre, R.; Tascedda, F.; Blom, J. M.; Benatti, C.; Caraci, F.

Published in: NEUROMOLECULAR MEDICINE

Down syndrome (DS) represents one of the most common genetic disorders attributable to a partial or complete trisomy of chromosome … (Read full abstract)

Down syndrome (DS) represents one of the most common genetic disorders attributable to a partial or complete trisomy of chromosome 21 that affects about 1 in 700 individuals at birth. The diagnosis of Alzheimer's Disease (AD)-correlated cognitive decline in this population requires new approaches and new biomarkers that comprehensively assess health status and early cognitive decline. In this observational study, we explored for the first time the relation of IL-18, a cytokine member of IL-1 family involved in both innate and acquired immune responses, with DS associated cognitive decline. We observed that plasma total IL-18, in subjects with DS over 35 with and without AD-related cognitive decline, and plasma concentrations of its binding protein in subjects with DS (19-35 years) were correlated with lower plasma concentrations of Transforming Growth Factor (TGF-beta 1), which are linked to an increased rate of cognitive decline in adults with DS. In addition, we found a significant association between low baseline concentrations of Free IL-18, the active form of the cytokine, and an increased rate of cognitive decline at 12 months, calculated as delta of the Test for Severe Impairment (dTSI), in individuals with DS (19-35 years). Finally, we demonstrated a reduction of Free IL-18/TNF-alpha ratio, considered as a new possible double biomarker, in both young and older adult DS subjects without AD-related cognitive decline (area under the receiver operating curve (AUC) was 0.82 and 0.71, respectively), suggesting the advantage of the composite biomarkers in the discrimination of patients from healthy people over single biomarkers.

2026 Articolo su rivista

Page 2 of 107 • Total publications: 1068