Publications - AImageLab

Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Authors: Moratelli, Nicholas; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality … (Read full abstract)

Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a CLIP-based reward. To this end, our discriminator directly incorporates negative samples from a frozen captioner, which significantly improves the quality and richness of the generated captions but also reduces the fine-tuning time in comparison to using the CIDEr score as the sole metric for optimization. Experimental results demonstrate the effectiveness of our training strategy on both standard and zero-shot image captioning datasets.

2024 Relazione in Atti di Convegno

IRIS

FOSSIL: Free Open-Vocabulary Semantic Segmentation through Synthetic References Retrieval

Authors: Barsellotti, Luca; Amoroso, Roberto; Baraldi, Lorenzo; Cucchiara, Rita

2024 Relazione in Atti di Convegno

DOI IRIS

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

Authors: Cornia, Marcella; Baraldi, Lorenzo; Fiameni, Giuseppe; Cucchiara, Rita

Published in: INTERNATIONAL JOURNAL OF COMPUTER VISION

This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both … (Read full abstract)

This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing real-world concepts and producing high-quality captions. Extensive experiments are performed on different image captioning datasets, including CC3M, nocaps, and the competitive COCO dataset, where our model consistently outperforms baselines and state-of-the-art approaches.

2024 Articolo su rivista

DOI IRIS

Intelligent Multimodal Artificial Agents that Talk and Express Emotions

Authors: Rawal, Niyati; Maharjan, Rahul Singh; Romeo, Marta; Bigazzi, Roberto; Baraldi, Lorenzo; Cucchiara, Rita; Cangelosi, Angelo

2024 Relazione in Atti di Convegno

IRIS

Is Multiple Object Tracking a Matter of Specialization?

Authors: Mancusi, Gianluca; Bernardi, Mattia; Panariello, Aniello; Porrello, Angelo; Cucchiara, Rita; Calderara, Simone

Published in: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS

End-to-end transformer-based trackers have achieved remarkable performance on most human-related datasets. However, training these trackers in heterogeneous scenarios poses significant … (Read full abstract)

End-to-end transformer-based trackers have achieved remarkable performance on most human-related datasets. However, training these trackers in heterogeneous scenarios poses significant challenges, including negative interference - where the model learns conflicting scene-specific parameters - and limited domain generalization, which often necessitates expensive fine-tuning to adapt the models to new domains. In response to these challenges, we introduce Parameter-efficient Scenario-specific Tracking Architecture (PASTA), a novel framework that combines Parameter-Efficient Fine-Tuning (PEFT) and Modular Deep Learning (MDL). Specifically, we define key scenario attributes (e.g, camera-viewpoint, lighting condition) and train specialized PEFT modules for each attribute. These expert modules are combined in parameter space, enabling systematic generalization to new domains without increasing inference time. Extensive experiments on MOTSynth, along with zero-shot evaluations on MOT17 and PersonPath22 demonstrate that a neural tracker built from carefully selected modules surpasses its monolithic counterpart. We release models and code.

2024 Relazione in Atti di Convegno

IRIS

KRONC: Keypoint-based Robust Camera Optimization for 3D Car Reconstruction

Authors: Di Nucci, Davide; Simoni, Alessandro; Tomei, Matteo; Ciuffreda, Luca; Vezzani, Roberto; Cucchiara, Rita

2024 Relazione in Atti di Convegno

IRIS

Large-Scale Transformer models for Transactional Data

Authors: Garuti, F.; Luetto, S.; Sangineto, E.; Cucchiara, R.

Published in: CEUR WORKSHOP PROCEEDINGS

Following the spread of digital channels for everyday activities and electronic payments, huge collections of online transactions are available from … (Read full abstract)

Following the spread of digital channels for everyday activities and electronic payments, huge collections of online transactions are available from financial institutions. These transactions are usually organized as time series, i.e., a time-dependent sequence of tabular data, where each element of the series is a collection of heterogeneous fields (e.g., dates, amounts, categories, etc.). Transactions are usually evaluated by automated or semi-automated procedures to address financial tasks and gain insights into customers’ behavior. In the last years, many Trees-based Machine Learning methods (e.g., RandomForest, XGBoost) have been proposed for financial tasks, but they do not fully exploit in an end-to-end pipeline all the information richness of individual transactions, neither they fully model the underling temporal patterns. Instead, Deep Learning approaches have proven to be very effective in modeling complex data by representing them in a semantic latent space. In this paper, inspired by the multi-modal Deep Learning approaches used in Computer Vision and NLP, we propose UniTTab, an end-to-end Deep Learning Transformer model for transactional time series which can uniformly represent heterogeneous time-dependent data in a single embedding. Given the availability of large sets of tabular transactions, UniTTab defines a pre-training self-supervised phase to learn useful representations which can be employed to solve financial tasks such as churn prediction and loan default prediction. A strength of UniTTab is its flexibility since it can be adopted to represent time series of arbitrary length and composed of different data types in the fields. The flexibility of our model in solving different types of tasks (e.g., detection, classification, regression) and the possibility of varying the length of the input time series, from a few to hundreds of transactions, makes UniTTab a general-purpose Transformer architecture for bank transactions.

2024 Relazione in Atti di Convegno

IRIS

Mapping High-level Semantic Regions in Indoor Environments without Object Recognition

Authors: Bigazzi, Roberto; Baraldi, Lorenzo; Kousik, Shreyas; Cucchiara, Rita; Pavone, Marco

Published in: IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION

2024 Relazione in Atti di Convegno

DOI IRIS

Multi-Class Unlearning for Image Classification via Weight Filtering

Authors: Poppi, Samuele; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTELLIGENT SYSTEMS

Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods … (Read full abstract)

Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods that target a limited subset or a single class, our framework unlearns all classes in a single round. We achieve this by modulating the network's components using memory matrices, enabling the network to demonstrate selective unlearning behavior for any class after training. By discovering weights that are specific to each class, our approach also recovers a representation of the classes which is explainable by design. We test the proposed framework on small- and medium-scale image classification datasets, with both convolution- and Transformer-based backbones, showcasing the potential for explainable solutions through unlearning.

2024 Articolo su rivista

DOI IRIS

Optimizing Resource Consumption in Diffusion Models through Hallucination Early Detection

Authors: Betti, Federico; Baraldi, Lorenzo; Baraldi, Lorenzo; Cucchiara, Rita; Sebe, Nicu

2024 Relazione in Atti di Convegno

IRIS

Publications by Rita Cucchiara

Fluent and Accurate Image Captioning with a Self-Trained Reward Model

FOSSIL: Free Open-Vocabulary Semantic Segmentation through Synthetic References Retrieval

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

Intelligent Multimodal Artificial Agents that Talk and Express Emotions

Is Multiple Object Tracking a Matter of Specialization?

KRONC: Keypoint-based Robust Camera Optimization for 3D Car Reconstruction

Large-Scale Transformer models for Transactional Data

Mapping High-level Semantic Regions in Indoor Environments without Object Recognition

Multi-Class Unlearning for Image Classification via Weight Filtering

Optimizing Resource Consumption in Diffusion Models through Hallucination Early Detection