Publications - AImageLab

Apprendere attraverso tempo, task e modelli: il trasferimento di conoscenza in sistemi in evoluzione

Authors: Panariello, Aniello

Con la crescente diffusione delle tecnologie di intelligenza artificiale, i moderni sistemi di apprendimento operano in ambienti sempre più dinamici, … (Read full abstract)

Con la crescente diffusione delle tecnologie di intelligenza artificiale, i moderni sistemi di apprendimento operano in ambienti sempre più dinamici, in cui distribuzioni dei dati, task e obiettivi evolvono nel tempo. I paradigmi statici tradizionali faticano a tenere il passo con tali mutamenti, con conseguente degrado delle prestazioni, perdita di conoscenze acquisite o riaddestramenti poco efficienti. Affrontare tali sfide richiede meccanismi capaci di trasferire conoscenza attraverso dimensioni temporali e strutturali, dai dati sequenziali ai flussi di task, fino al riuso e alla combinazione di interi modelli preesistenti. Questa tesi analizza come i sistemi di apprendimento possano evolvere insieme ai propri ambienti, sfruttando informazioni strutturate e conoscenze pregresse. Il lavoro si articola in tre direttrici principali: la comprensione dei dati temporali, l'apprendimento continuo e la composizione di modelli, con l'obiettivo di comprendere come le informazioni apprese in un contesto possano essere riutilizzate o adattate in un altro. La prima parte è dedicata all'apprendimento temporale da dati visivi, considerando i flussi video come serie temporali strutturate. Viene proposta una formulazione basata sulla coerenza temporale per la localizzazione di anomalie (CSL-TAL) in assenza di annotazioni a livello di frame; seguono modelli probabilistici e rappresentazioni basate sul flusso per il tracciamento multi-oggetto (TrackFlow), e un approccio per la stima della distanza degli oggetti da visione monoculare (DistFormer), che integra un ragionamento centrato sull'oggetto nei processi temporali. Nel loro insieme, questi studi mostrano come l'informazione temporale possa essere sfruttata per ottenere rappresentazioni visive più generali e interpretabili. La seconda parte affronta l'apprendimento continuo, in cui i dati si presentano come flusso. CHARON propone un framework efficiente per il riconoscimento di azioni basato su scheletri, che combina mascheramento e compressione per ottimizzare memoria e stabilità. CGIL introduce invece un metodo di adattamento continuo per modelli visione-linguaggio di grandi dimensioni mediante generative latent replay, mantenendo le capacità zero-shot e consentendo l'apprendimento incrementale dei prompt. Questi contributi reinterpretano l'apprendimento continuo come una progressione temporale strutturata, intesa come una sequenza di task in evoluzione. La parte finale esplora il trasferimento di conoscenza tra modelli attraverso fusione e aritmetica dei modelli. Invece di adattare un singolo modello nel tempo, l'obiettivo è combinare modelli pre-addestrati per generare nuove capacità. Il framework PASTA mostra come componenti specializzati possano essere composti nello spazio dei parametri per generalizzare tra domini. Successivamente, vengono analizzate tecniche a basso rango (MoDER e Core Space) e basate sul gradiente (GradFix) per fondere modelli, consentendo la creazione di nuove reti tramite operazioni dirette sui parametri. Questi approcci permettono di sintetizzare reti specifiche per task, rappresentando una nuova forma di evoluzione nello spazio dei modelli anziché in quello dei dati. Nel complesso, la tesi offre una prospettiva unificata sul trasferimento di conoscenza nei sistemi in evoluzione. Collegando apprendimento temporale, adattamento continuo e composizione di modelli, reinterpreta l'analisi delle serie temporali come principio generale di trasferimento tra rappresentazioni che cambiano nel tempo. Il quadro che emerge evidenzia il ruolo di struttura, modularità e riuso nella costruzione di sistemi di apprendimento scalabili, adattivi e resilienti, capaci non solo di interpretare un mondo in trasformazione, ma anche di trasformarsi in risposta a esso.

2026 Tesi di dottorato

IRIS

Gradient-sign Masking for Task Vector Transport Across Pre-Trained Models

Authors: Rinaldi, Filippo; Panariello, Aniello; Salici, Giacomo; Liu, Fengyuan; Ciccone, Marco; Porrello, Angelo; Calderara, Simone

When a new release of a foundation model is published, practitioners typically need to repeat fine-tuning, even if the same … (Read full abstract)

When a new release of a foundation model is published, practitioners typically need to repeat fine-tuning, even if the same task was already tackled in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, these vectors often fail to transfer across different pre-trained models because their parameter spaces are misaligned. In this work, we show that successful transfer depends strongly on the gradient-sign structure of the new model. Based on this insight, we propose GradFix, which approximates the ideal sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: we only compute a few target-model gradients without parameter updates and mask the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning. We further show that transporting task vectors improves multi-task and multi-source model merging. Code is available at https://github.com/fillo-rinaldi/GradFix.

2026 Relazione in Atti di Convegno

IRIS

Accurate and Efficient Low-Rank Model Merging in Core Space

Authors: Panariello, Aniello; Marczak, Daniel; Magistri, Simone; Porrello, Angelo; Twardowski, Bartłomiej; D Bagdanov, Andrew; Calderara, Simone; Van De Weijer, Joost

2025 Relazione in Atti di Convegno

IRIS

Mask and Compress: Efficient Skeleton-based Action Recognition in Continual Learning

Authors: Mosconi, Matteo; Sorokin, Andriy; Panariello, Aniello; Porrello, Angelo; Bonato, Jacopo; Cotogni, Marco; Sabetta, Luigi; Calderara, Simone; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

The use of skeletal data allows deep learning models to perform action recognition efficiently and effectively. Herein, we believe that … (Read full abstract)

The use of skeletal data allows deep learning models to perform action recognition efficiently and effectively. Herein, we believe that exploring this problem within the context of Continual Learning is crucial. While numerous studies focus on skeleton-based action recognition from a traditional offline perspective, only a handful venture into online approaches. In this respect, we introduce CHARON (Continual Human Action Recognition On skeletoNs), which maintains consistent performance while operating within an efficient framework. Through techniques like uniform sampling, interpolation, and a memory-efficient training stage based on masking, we achieve improved recognition accuracy while minimizing computational overhead. Our experiments on Split NTU-60 and the proposed Split NTU-120 datasets demonstrate that CHARON sets a new benchmark in this domain. The code is available at https://github.com/Sperimental3/CHARON.

2025 Relazione in Atti di Convegno

DOI IRIS

Modular embedding recomposition for incremental learning

Authors: Panariello, Aniello; Frascaroli, Emanuele; Buzzega, Pietro; Bonicelli, Lorenzo; Porrello, Angelo; Calderara, Simone

2025 Relazione in Atti di Convegno

IRIS

Monocular per-object distance estimation with Masked Object Modeling

Authors: Panariello, Aniello; Mancusi, Gianluca; Haj Ali, Fedy; Porrello, Angelo; Calderara, Simone; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

2025 Articolo su rivista

DOI IRIS

CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning

Authors: Frascaroli, Emanuele; Panariello, Aniello; Buzzega, Pietro; Bonicelli, Lorenzo; Porrello, Angelo; Calderara, Simone

With the emergence of Transformers and Vision-Language Models (VLMs) such as CLIP, fine-tuning large pre-trained models has recently become a … (Read full abstract)

With the emergence of Transformers and Vision-Language Models (VLMs) such as CLIP, fine-tuning large pre-trained models has recently become a prevalent strategy in Continual Learning. This has led to the development of numerous prompting strategies to adapt transformer-based models without incurring catastrophic forgetting. However, these strategies often compromise the original zero-shot capabilities of the pre-trained CLIP model and struggle to adapt to domains that significantly deviate from the pre-training data. In this work, we propose Continual Generative training for Incremental prompt-Learning, a simple and novel approach to mitigate forgetting while adapting CLIP. Briefly, we employ Variational Autoencoders (VAEs) to learn class-conditioned distributions within the embedding space of the visual encoder. We then exploit these distributions to sample new synthetic visual embeddings and train the corresponding class-specific textual prompts during subsequent tasks. Through extensive experiments on different domains, we show that such a generative replay approach can adapt to new tasks while improving zero-shot capabilities, evaluated using a novel metric tailored for CL scenarios. Notably, further analysis reveals that our approach can bridge the gap with joint prompt tuning. The codebase is available at https://github.com/aimagelab/mammoth.

2024 Relazione in Atti di Convegno

IRIS

Is Multiple Object Tracking a Matter of Specialization?

Authors: Mancusi, Gianluca; Bernardi, Mattia; Panariello, Aniello; Porrello, Angelo; Cucchiara, Rita; Calderara, Simone

Published in: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS

End-to-end transformer-based trackers have achieved remarkable performance on most human-related datasets. However, training these trackers in heterogeneous scenarios poses significant … (Read full abstract)

End-to-end transformer-based trackers have achieved remarkable performance on most human-related datasets. However, training these trackers in heterogeneous scenarios poses significant challenges, including negative interference - where the model learns conflicting scene-specific parameters - and limited domain generalization, which often necessitates expensive fine-tuning to adapt the models to new domains. In response to these challenges, we introduce Parameter-efficient Scenario-specific Tracking Architecture (PASTA), a novel framework that combines Parameter-Efficient Fine-Tuning (PEFT) and Modular Deep Learning (MDL). Specifically, we define key scenario attributes (e.g, camera-viewpoint, lighting condition) and train specialized PEFT modules for each attribute. These expert modules are combined in parameter space, enabling systematic generalization to new domains without increasing inference time. Extensive experiments on MOTSynth, along with zero-shot evaluations on MOT17 and PersonPath22 demonstrate that a neural tracker built from carefully selected modules surpasses its monolithic counterpart. We release models and code.

2024 Relazione in Atti di Convegno

IRIS

Consistency-Based Self-supervised Learning for Temporal Anomaly Localization

Authors: Panariello, A.; Porrello, A.; Calderara, S.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

2023 Relazione in Atti di Convegno

DOI IRIS

TrackFlow: Multi-Object Tracking with Normalizing Flows

Authors: Mancusi, Gianluca; Panariello, Aniello; Porrello, Angelo; Fabbri, Matteo; Calderara, Simone; Cucchiara, Rita

Published in: PROCEEDINGS IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION

The field of multi-object tracking has recently seen a renewed interest in the good old schema of tracking-by-detection, as its … (Read full abstract)

The field of multi-object tracking has recently seen a renewed interest in the good old schema of tracking-by-detection, as its simplicity and strong priors spare it from the complex design and painful babysitting of tracking-by-attention approaches. In view of this, we aim at extending tracking-by-detection to multi-modal settings, where a comprehensive cost has to be computed from heterogeneous information e.g., 2D motion cues, visual appearance, and pose estimates. More precisely, we follow a case study where a rough estimate of 3D information is also available and must be merged with other traditional metrics (e.g., the IoU). To achieve that, recent approaches resort to either simple rules or complex heuristics to balance the contribution of each cost. However, i) they require careful tuning of tailored hyperparameters on a hold-out set, and ii) they imply these costs to be independent, which does not hold in reality. We address these issues by building upon an elegant probabilistic formulation, which considers the cost of a candidate association as the negative log-likelihood yielded by a deep density estimator, trained to model the conditional joint probability distribution of correct associations. Our experiments, conducted on both simulated and real benchmarks, show that our approach consistently enhances the performance of several tracking-by-detection algorithms.

2023 Relazione in Atti di Convegno

DOI IRIS

Publications by Aniello Panariello

Apprendere attraverso tempo, task e modelli: il trasferimento di conoscenza in sistemi in evoluzione

Gradient-sign Masking for Task Vector Transport Across Pre-Trained Models

Accurate and Efficient Low-Rank Model Merging in Core Space

Mask and Compress: Efficient Skeleton-based Action Recognition in Continual Learning

Modular embedding recomposition for incremental learning

Monocular per-object distance estimation with Masked Object Modeling

CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning

Is Multiple Object Tracking a Matter of Specialization?

Consistency-Based Self-supervised Learning for Temporal Anomaly Localization

TrackFlow: Multi-Object Tracking with Normalizing Flows