Publications by Lorenzo Baraldi
Explore our research publications: papers, articles, and conference proceedings from AImageLab.
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets
Authors: Cornia, Marcella; Baraldi, Lorenzo; Fiameni, Giuseppe; Cucchiara, Rita
Published in: INTERNATIONAL JOURNAL OF COMPUTER VISION
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both … (Read full abstract)
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing real-world concepts and producing high-quality captions. Extensive experiments are performed on different image captioning datasets, including CC3M, nocaps, and the competitive COCO dataset, where our model consistently outperforms baselines and state-of-the-art approaches.
Intelligent Multimodal Artificial Agents that Talk and Express Emotions
Authors: Rawal, Niyati; Maharjan, Rahul Singh; Romeo, Marta; Bigazzi, Roberto; Baraldi, Lorenzo; Cucchiara, Rita; Cangelosi, Angelo
Multi-Class Unlearning for Image Classification via Weight Filtering
Authors: Poppi, Samuele; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
Published in: IEEE INTELLIGENT SYSTEMS
Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods … (Read full abstract)
Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods that target a limited subset or a single class, our framework unlearns all classes in a single round. We achieve this by modulating the network's components using memory matrices, enabling the network to demonstrate selective unlearning behavior for any class after training. By discovering weights that are specific to each class, our approach also recovers a representation of the classes which is explainable by design. We test the proposed framework on small- and medium-scale image classification datasets, with both convolution- and Transformer-based backbones, showcasing the potential for explainable solutions through unlearning.
Optimizing Resource Consumption in Diffusion Models through Hallucination Early Detection
Authors: Betti, Federico; Baraldi, Lorenzo; Baraldi, Lorenzo; Cucchiara, Rita; Sebe, Nicu
Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images
Authors: Amoroso, Roberto; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Del Bimbo, Alberto; Cucchiara, Rita
Published in: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these … (Read full abstract)
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models. Firstly, we conduct a comprehensive analysis of the performance of contrastive and classification-based visual features, respectively, extracted from CLIP-based models and ResNet or Vision Transformer (ViT)-based architectures trained on image classification datasets. Our results demonstrate that fake images share common low-level cues, which render them easily recognizable. Further, we devise a multimodal setting wherein fake images are synthesized by different textual captions, which are used as seeds for a generator. Under this setting, we quantify the performance of fake detection strategies and introduce a contrastive-based disentangling method that lets us analyze the role of the semantics of textual descriptions and low-level perceptual cues. Finally, we release a new dataset, called COCOFake, containing about 1.2 million images generated from the original COCO image–caption pairs using two recent text-to-image diffusion models, namely Stable Diffusion v1.4 and v2.0.
Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments
Authors: Barsellotti, Luca; Bigazzi, Roberto; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
Published in: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS
In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth … (Read full abstract)
In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth can be attributed to the recent availability of large navigation datasets in photo-realistic simulated environments, like Gibson and Matterport3D. However, the navigation tasks supported by these datasets are often restricted to the objects present in the environment at acquisition time. Also, they fail to account for the realistic scenario in which the target object is a user-specific instance that can be easily confused with similar objects and may be found in multiple locations within the environment. To address these limitations, we propose a new task denominated Personalized Instance-based Navigation (PIN), in which an embodied agent is tasked with locating and reaching a specific personal object by distinguishing it among multiple instances of the same category. The task is accompanied by PInNED, a dedicated new dataset composed of photo-realistic scenes augmented with additional 3D objects. In each episode, the target object is presented to the agent using two modalities: a set of visual reference images on a neutral background and manually annotated textual descriptions. Through comprehensive evaluations and analyses, we showcase the challenges of the PIN task as well as the performance and shortcomings of currently available methods designed for object-driven navigation, considering modular and end-to-end agents.
Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis
Authors: Bucciarelli, Davide; Moratelli, Nicholas; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen … (Read full abstract)
The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen a convergence between image captioning research and the development of Large Language Models (LLMs) and Multimodal LLMs - like GPT-4V and Gemini - which extend the capabilities of text-only LLMs to multiple modalities. This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks. We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods, including prompt learning, prefix tuning, and low-rank adaptation. Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their generalization capabilities intact remains challenging. We discuss the implications of these findings for future research in image captioning and the development of more adaptable Multimodal LLMs.
Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization
Authors: Moratelli, Nicholas; Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence … (Read full abstract)
The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics.