Publications

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Enabling 8B Bitwise Autoregressive Image Generation on Edge GPUs

Authors: Vezzali, Enrico; Bolelli, Federico; Grana, Costantino; Benini, Luca; Li, Yawei

Visual Autoregressive (VAR) models face a severe "Memory Wall" on edge devices due to large model size and substantial KV-cache … (Read full abstract)

Visual Autoregressive (VAR) models face a severe "Memory Wall" on edge devices due to large model size and substantial KV-cache requirements. In this work, we analyze the Infinity VAR family (2B and 8B) and propose a compression pipeline for deployment on constrained NVIDIA Jetson systems. We diagnose critical bottlenecks: activation outliers reaching 353x the median and channel-skewed cache variance. To address this, we propose a hybrid pipeline combining SVDQuant—to structurally decouple weight outliers—and Asymmetric Per-Channel KV8 quantization. Our approach reduces the Infinity-8B footprint by 64% (37.1GB →13.3GB), fitting it on the mid-range Orin NX with a 4.1x speedup over Flux.1-dev (W4A4), while achieving superior aesthetic alignment (ImageReward 1.13 vs 0.935). Crucially, we also unlock entry-level feasibility for the Infinity-2B, compressing it from 16.0 to 7.71 GB to enable deployment on the Orin Nano. These results establish a new efficiency standard for high-fidelity generative AI at the edge. The code is available at https://github.com/Henvezz95/deepcompressor.

2026 Relazione in Atti di Convegno

Evoluzione della Conoscenza nell’Intelligenza Artificiale: Verso Reti Neurali Profonde Robuste e Modulari

Authors: Capitani, Giacomo

Le reti neurali profonde sono diventate un pilastro fondamentale dell’Intelligenza Artificiale moderna grazie alla loro straordinaria efficacia e versatilità. Tuttavia, … (Read full abstract)

Le reti neurali profonde sono diventate un pilastro fondamentale dell’Intelligenza Artificiale moderna grazie alla loro straordinaria efficacia e versatilità. Tuttavia, le loro capacità di generalizzazione dipendono tipicamente dall’assunzione che i dati siano indipendenti e distribuiti in modo identico, una condizione raramente soddisfatta negli scenari reali, dinamici ed evolutivi. Quando le distribuzioni dei dati variano, i modelli tendono a sfruttare scorciatoie (inclusi bias spurî e impliciti), a soffrire di catastrophic forgetting e a mostrare capacità compositive limitate. La presente tesi esplora come i modelli neurali possano essere guidati ad adattare, preservare, trasferire e comporre le proprie capacità oltre il semplice data fitting. La prima parte si concentra sulla mitigazione del bias in assenza di attributi protetti espliciti. Si sfruttano cluster latenti per formare gruppi semantici proxy che orientano l’ottimizzazione lontano dall’apprendimento di scorciatoie, migliorando così la robustezza. L’analisi viene poi estesa al continual learning, dove le strategie basate su rehearsal possono introdurre o amplificare correlazioni spurie se i segnali di debiasing non vengono gestiti correttamente. Per affrontare tale problema, vengono proposti meccanismi di rehearsal bilanciati, capaci di mantenere l’equilibrio in termini di valori di loss e mitigare correlazioni spurie sotto cambiamenti di distribuzione. La seconda parte indaga i modelli multimodali visione–linguaggio, rivelando che architetture simili a CLIP manifestano bias impliciti analoghi a quelli umani. Si introducono tecniche leggere di prompt steering per ridurre i bias impliciti nei compiti di image retrieval e classificazione. Successivamente, viene analizzato lo spazio dei parametri per determinare quando i task vector mantengono conoscenza trasferibile tra modelli addestrati su dataset distinti, e vengono definite procedure di allineamento basate su permutazioni per consentire il trasporto di conoscenza tra modelli. Infine, si dimostra che le proprietà geometriche del loss landscape, in particolare la sua piattezza, predicono la compatibilità tra modelli fine-tuned derivati da un pretraining comune, con applicazioni pratiche nella segmentazione medica 3D. Analisi sperimentali approfondite su diversi dataset e paradigmi di apprendimento supportano questi risultati. Complessivamente, i contributi delineano un quadro a quattro assi della generalizzazione nelle reti neurali: (i) mitigazione dell’apprendimento di scorciatoie a livello di dati e feature; (ii) prevenzione delle correlazioni spurie nel continual learning; (iii) disambiguazione semantica nell’allineamento multimodale; (iv) manipolazione della geometria dello spazio dei parametri per il trasferimento di conoscenza e il model merging. Attraverso questa prospettiva, la tesi propone principi e metodologie per lo sviluppo di sistemi neurali adattivi le cui capacità possano essere mantenute, trasferite e composte in modo robusto.

2026 Tesi di dottorato

Experience-dependent modulation of neural mechanisms underlying olfactory identification: preliminary evidence

Authors: Ricci, F.; Casadio, C.; Zanelli, V.; Carpentiero, O.; Caselli, M.; Nandi, A.; Masino, F.; Lui, F.; Benuzzi, F

2026 Relazione in Atti di Convegno

FG-TRACER: Tracing Information Flow in Multimodal Large Language Models in Free-Form Generation

Authors: Saporita, Alessia; Pipoli, Vittorio; Bolelli, Federico; Baraldi, Lorenzo; Acquaviva, Andrea; Ficarra, Elisa

Multimodal Large Language Models (MLLMs) have achieved impressive performance across a variety of vision–language tasks. However, their internal working mechanisms … (Read full abstract)

Multimodal Large Language Models (MLLMs) have achieved impressive performance across a variety of vision–language tasks. However, their internal working mechanisms remain largely underexplored. In his work, we introduce FG-TRACER, a framework designed to analyze the information flow between visual and textual modalities in MLLMs in free-form generation. Notably, our numerically stabilized computational method enables the first systematic analysis of multimodal information flow in underexplored domains such as image captioning and chain-of-thought (CoT) reasoning. We apply FG-TRACER to two state-of-the-art MLLMs—LLaMA 3.2-Vision and LLaVA 1.5—across three vision–language benchmarks—TextVQA, COCO 2014, and ChartQA—and we conduct a word-level analysis of multimodal integration. Our findings uncover distinct patterns of multimodal fusion across models and tasks, demonstrating that fusion dynamics are both model- and task-dependent. Overall, FG-TRACER offers a robust methodology for probing the internal mechanisms of MLLMs in free-form settings, providing new insights into their multimodal reasoning strategies. Our source code is publicly available at https://anonymous.4open.science/r/FG-TRACER-CB5A/.

2026 Relazione in Atti di Convegno

Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval

Authors: Caffagni, Davide.; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella.; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual … (Read full abstract)

Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual analysis. However, retrieval systems often struggle when training data are scarce, as is the case for low-resource languages or specialized domains such as ancient texts. To address this challenge, we propose a novel paradigm for domain-specific sentence similarity search, where the embedding space is shaped by a combination of limited real data and a large amount of synthetic data generated by Large Language Models (LLMs). Specifically, we employ LLMs to generate domain-specific sentence pairs and fine-tune a sentence embedding model, effectively distilling knowledge from the LLM to the retrieval model. We validate our method through a case study on biblical intertextuality in Latin, demonstrating that synthetic data augmentation significantly improves retrieval effectiveness in a domain with scarce annotated resources. More broadly, our approach offers a scalable and adaptable framework for enhancing retrieval in domain-specific contexts. Source code and trained models are available at https://github.com/aimagelab/biblical-retrieval-synthesis.

2026 Relazione in Atti di Convegno

Gradient-sign Masking for Task Vector Transport Across Pre-Trained Models

Authors: Rinaldi, Filippo; Panariello, Aniello; Salici, Giacomo; Liu, Fengyuan; Ciccone, Marco; Porrello, Angelo; Calderara, Simone

When a new release of a foundation model is published, practitioners typically need to repeat fine-tuning, even if the same … (Read full abstract)

When a new release of a foundation model is published, practitioners typically need to repeat fine-tuning, even if the same task was already tackled in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, these vectors often fail to transfer across different pre-trained models because their parameter spaces are misaligned. In this work, we show that successful transfer depends strongly on the gradient-sign structure of the new model. Based on this insight, we propose GradFix, which approximates the ideal sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: we only compute a few target-model gradients without parameter updates and mask the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning. We further show that transporting task vectors improves multi-task and multi-source model merging. Code is available at https://github.com/fillo-rinaldi/GradFix.

2026 Relazione in Atti di Convegno

GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution

Authors: D'Oronzio, Fabio; Putamorsi, Federico; Zini, Leonardo; Cornia, Marcella; Baraldi, Lorenzo

Despite recent advances, single-image super-resolution (SR) remains challenging, especially in real-world scenarios with complex degradations. Diffusion-based SR methods, particularly those … (Read full abstract)

Despite recent advances, single-image super-resolution (SR) remains challenging, especially in real-world scenarios with complex degradations. Diffusion-based SR methods, particularly those built on Stable Diffusion, leverage strong generative priors but commonly rely on text conditioning derived from semantic captioning. Such textual descriptions provide only high-level semantics and lack the spatially aligned visual information required for faithful restoration, leading to a representation gap between abstract semantics and spatially aligned visual details. To address this limitation, we propose GramSR, a one-step diffusion-based SR framework that replaces text conditioning with dense visual features extracted from the low-resolution input using a pre-trained DINOv3 encoder. GramSR adopts a three-stage LoRA architecture, where pixel-level, semantic-level, and texture-level LoRA modules are trained sequentially. The pixel-level module focuses on degradation removal using L2 loss, the semantic-level module enhances perceptual details via LPIPS and CSD losses, and the texture-level module enforces feature correlation consistency through a Gram matrix loss computed from DINOv3 features. At inference, independent guidance scales enable flexible control over degradation removal, semantic enhancement, and texture preservation. Extensive experiments on standard SR benchmarks demonstrate that GramSR consistently outperforms existing one-step diffusion-based methods, achieving superior structural fidelity and texture realism.

2026 Relazione in Atti di Convegno

Histological Brain Imaging Super-resolution with Frequency-guided Diffusion Models

Authors: Casari, Giovanni; Bolelli, Federico; Grana, Costantino

High-resolution histological imaging provides essential detail for quantitative brain modeling, yet acquiring whole-brain data at micrometer scale remains technically and … (Read full abstract)

High-resolution histological imaging provides essential detail for quantitative brain modeling, yet acquiring whole-brain data at micrometer scale remains technically and economically challenging. This work introduces Brain-SR, a diffusion-based super-resolution framework designed to reconstruct high-resolution cortical sections from low-resolution BigBrain data. Building upon the InvSR paradigm, our method performs resolution enhancement in the latent space of a pretrained variational autoencoder, guided by a task-specific noise-predictor network. A key contribution is a frequency-domain supervision term that compares the magnitude spectra of predicted and target patches, enforcing spectral consistency while remaining robust to local misalignments. Quantitative evaluations demonstrate that Brain-SR achieves substantial improvements in LPIPS (-27%) and FID (-58%) compared to baseline diffusion Super-Resolution, while spectral analysis confirms accurate recovery of the frequency distribution. The resulting reconstructions preserve neuronal structures consistent with high-resolution references, offering a practical step toward large-scale, morphologically faithful brain histology reconstruction. The code is publicly available to support reproducibility: https://github.com/AImageLab-zip/Brain-SR.

2026 Relazione in Atti di Convegno

HyperMIL: Hypergraph-based channel reasoning for Multiple Instance Learning on Multivariate Time Series

Authors: Del Gaudio, Livia; Cuculo, Vittorio; Cucchiara, Rita

Multivariate time series classification often relies on Multiple Instance Learning (MIL) due to the scarcity of fine-grained labels. However, existing … (Read full abstract)

Multivariate time series classification often relies on Multiple Instance Learning (MIL) due to the scarcity of fine-grained labels. However, existing MIL methods typically ignore high-order dependencies between channels, which are critical for capturing coordinated sensor dynamics. We propose HyperMIL, a framework that leverages hypergraph-based reasoning to model these complex interactions. HyperMIL constructs dynamic hypergraphs by mapping multivariate signals to self-learned latent prototypes, allowing the model to group channels into high-order hyperedges without a predefined topology. These enriched representations are then aggregated via a MIL pooling mechanism for bag-level classification. Our experiments demonstrate that HyperMIL achieves state-of-the-art performance across several benchmarks and provides interpretability by identifying key coordinated channel patterns.

2026 Relazione in Atti di Convegno

Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling

Authors: Cappelletti, Silvia; Poppi, Tobia; Poppi, Samuele; Yong, Zheng-Xin; Garcia-Olano, Diego; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Large Language Models (LLMs) are traditionally evaluated on multiple-choice question answering (MCQA) tasks using First-Token Probability (FTP), which selects the … (Read full abstract)

Large Language Models (LLMs) are traditionally evaluated on multiple-choice question answering (MCQA) tasks using First-Token Probability (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (misalignment) or use a valid token merely as part of a generic preamble rather than as a clear answer choice (misinterpretation), undermining the reliability of symbolic evaluation. We propose a simple solution: output prefilling, a structured natural-language prefix (e.g., 'The correct option is:') prepended to the model output. Originally explored in AI safety as an attack strategy, we repurpose prefilling to steer the model to respond with a clean, valid option, without modifying its parameters. Through extensive evaluation, we find that the FTP with prefilling strategy substantially improves accuracy, calibration, and output consistency across a broad set of LLMs and MCQA benchmarks. It outperforms standard FTP and often matches the performance of open-ended generation approaches that require full decoding and external classifiers, while being significantly more efficient. Our analysis suggests that prefilling is a simple, robust, and zero-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.

2026 Relazione in Atti di Convegno

Page 2 of 110 • Total publications: 1091