Publications - AImageLab

Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images

Authors: Cartella, Giuseppe; Cuculo, Vittorio; Cornia, Marcella; Cucchiara, Rita

Published in: IEEE SIGNAL PROCESSING LETTERS

Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural … (Read full abstract)

Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural language of your desired output is all you need to obtain breathtaking results. However, as the use of generative models grows, so do concerns about the propagation of malicious content and misinformation. Consequently, the research community is actively working on the development of novel fake detection techniques, primarily focusing on low-level features and possible fingerprints left by generative models during the image generation process. In a different vein, in our work, we leverage human semantic knowledge to investigate the possibility of being included in frameworks of fake image detection. To achieve this, we collect a novel dataset of partially manipulated images using diffusion models and conduct an eye-tracking experiment to record the eye movements of different observers while viewing real and fake stimuli. A preliminary statistical analysis is conducted to explore the distinctive patterns in how humans perceive genuine and altered images. Statistical findings reveal that, when perceiving counterfeit samples, humans tend to focus on more confined regions of the image, in contrast to the more dispersed observational pattern observed when viewing genuine images. Our dataset is publicly available at: https://github.com/aimagelab/unveiling-the-truth.

2024 Articolo su rivista

DOI IRIS

Video Surveillance and Privacy: A Solvable Paradox?

Authors: Cucchiara, Rita; Baraldi, Lorenzo; Cornia, Marcella; Sarto, Sara

Published in: COMPUTER

Video Surveillance started decades ago to remotely monitor specific areas and allow control from human inspectors. Later, Computer Vision gradually … (Read full abstract)

Video Surveillance started decades ago to remotely monitor specific areas and allow control from human inspectors. Later, Computer Vision gradually replaced human monitoring, firstly through motion alerts and now with Deep Learning techniques. From the beginning of this journey, people have worried about the risk of privacy violations. This article surveys the main steps of Computer Vision in Video Surveillance, from early approaches for people detection and tracking to action analysis and language description, outlining the most relevant directions on the topic to deal with privacy concerns. We show how the relationship between Video Surveillance and privacy is a biased paradox since surveillance provides increased safety but does not necessarily require the people identification. Through experiments on action recognition and natural language description, we showcase that the paradox of surveillance and privacy can be solved by Artificial Intelligence and that the respect of human rights is not an impossible chimera.

2024 Articolo su rivista

DOI IRIS

What’s Outside the Intersection? Fine-grained Error Analysis for Semantic Segmentation Beyond IoU

Authors: Bernhard, Maximilian; Amoroso, Roberto; Kindermann, Yannic; Baraldi, Lorenzo; Cucchiara, Rita; Tresp, Volker; Schubert, Matthias

2024 Relazione in Atti di Convegno

IRIS

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

Authors: Caffagni, Davide; Cocchi, Federico; Moratelli, Nicholas; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

Multimodal LLMs are the natural evolution of LLMs and enlarge their capabilities so as to work beyond the pure textual … (Read full abstract)

Multimodal LLMs are the natural evolution of LLMs and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach termed Wiki-LLaVA aims at integrating an external knowledge source of multimodal documents which is accessed through a hierarchical retrieval pipeline. Relevant passages using this approach are retrieved from the external knowledge source and employed as additional context for the LLM augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.

2024 Relazione in Atti di Convegno

DOI IRIS

CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle Components

Authors: Di Nucci, D.; Simoni, A.; Tomei, M.; Ciuffreda, L.; Vezzani, R.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Neural Radiance Fields (NeRFs) have gained widespread recognition as a highly effective technique for representing 3D reconstructions of objects and … (Read full abstract)

Neural Radiance Fields (NeRFs) have gained widespread recognition as a highly effective technique for representing 3D reconstructions of objects and scenes derived from sets of images. Despite their efficiency, NeRF models can pose challenges in certain scenarios such as vehicle inspection, where the lack of sufficient data or the presence of challenging elements (e.g. reflections) strongly impact the accuracy of the reconstruction. To this aim, we introduce CarPatch, a novel synthetic benchmark of vehicles. In addition to a set of images annotated with their intrinsic and extrinsic camera parameters, the corresponding depth maps and semantic segmentation masks have been generated for each view. Global and part-based metrics have been defined and used to evaluate, compare, and better characterize some state-of-the-art techniques. The dataset is publicly released at https://aimagelab.ing.unimore.it/go/ carpatch and can be used as an evaluation guide and as a baseline for future work on this challenging topic.

2023 Relazione in Atti di Convegno

DOI IRIS

Consistency-Based Self-supervised Learning for Temporal Anomaly Localization

Authors: Panariello, A.; Porrello, A.; Calderara, S.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

2023 Relazione in Atti di Convegno

DOI IRIS

Deep Learning and Large Scale Models for Bank Transactions

Authors: Garuti, Fabrizio; Luetto, Simone; Cucchiara, Rita; Sangineto, Enver

Published in: CEUR WORKSHOP PROCEEDINGS

The success of Artificial Intelligence (AI) in different research and application areas has increased the interest in adopting Deep Learning … (Read full abstract)

The success of Artificial Intelligence (AI) in different research and application areas has increased the interest in adopting Deep Learning techniques also in the financial field. Particularly interesting is the case of financial transactional data, which represent one of the most valuable sources of information for banks and other financial institutes. However, the heterogeneity of the data, composed of both numerical and categorical attributes, makes the use of standard Deep Learning methods difficult. In this paper, we present UniTTAB, a Transformer network for transactional time series, which can uniformly represent heterogeneous time-dependent data, and which is trained on a very large scale of real transactional data. As far as we know, the dataset we used for training is the largest real bank transactions dataset used for Deep Learning methods in this field, being all the other common datasets either much smaller or synthetically generated. The use of this very large real training dataset, makes our UniTTAB the first foundation model for transactional data.

2023 Relazione in Atti di Convegno

IRIS

Depth-based 3D human pose refinement: Evaluating the refinet framework

Authors: D'Eusanio, A.; Simoni, A.; Pini, S.; Borghi, G.; Vezzani, R.; Cucchiara, R.

Published in: PATTERN RECOGNITION LETTERS

In recent years, Human Pose Estimation has achieved impressive results on RGB images. The advent of deep learning architectures and … (Read full abstract)

In recent years, Human Pose Estimation has achieved impressive results on RGB images. The advent of deep learning architectures and large annotated datasets have contributed to these achievements. However, little has been done towards estimating the human pose using depth maps, and especially towards obtaining a precise 3D body joint localization. To fill this gap, this paper presents RefiNet, a depth-based 3D human pose refinement framework. Given a depth map and an initial coarse 2D human pose, RefiNet regresses a fine 3D pose. The framework is composed of three modules, based on different data representations, i.e. 2D depth patches, 3D human skeletons, and point clouds. An extensive experimental evaluation is carried out to investigate the impact of the model hyper-parameters and to compare RefiNet with off-the-shelf 2D methods and literature approaches. Results confirm the effectiveness of the proposed framework and its limited computational requirements.

2023 Articolo su rivista

DOI IRIS

Embodied Agents for Efficient Exploration and Smart Scene Description

Authors: Bigazzi, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION

The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last … (Read full abstract)

The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments. As a step towards this objective, in this work, we tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment while portraying interesting scenes with natural language descriptions. To this end, we propose and evaluate an approach that combines recent advances in visual robotic exploration and image captioning on images generated through agent-environment interaction. Our approach can generate smart scene descriptions that maximize semantic knowledge of the environment and avoid repetitions. Further, such descriptions offer user-understandable insights into the robot's representation of the environment by high-lighting the prominent objects and the correlation between them as encountered during the exploration. To quantitatively assess the performance of the proposed approach, we also devise a specific score that takes into account both exploration and description skills. The experiments carried out on both photorealistic simulated environments and real-world ones demonstrate that our approach can effectively describe the robot's point of view during exploration, improving the human-friendly interpretability of its observations.

2023 Relazione in Atti di Convegno

DOI IRIS

Enhancing Open-Vocabulary Semantic Segmentation with Prototype Retrieval

Authors: Barsellotti, Luca; Amoroso, Roberto; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

2023 Relazione in Atti di Convegno

DOI IRIS

Publications by Rita Cucchiara

Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images

Video Surveillance and Privacy: A Solvable Paradox?

What’s Outside the Intersection? Fine-grained Error Analysis for Semantic Segmentation Beyond IoU

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle Components

Consistency-Based Self-supervised Learning for Temporal Anomaly Localization

Deep Learning and Large Scale Models for Bank Transactions

Depth-based 3D human pose refinement: Evaluating the refinet framework

Embodied Agents for Efficient Exploration and Smart Scene Description

Enhancing Open-Vocabulary Semantic Segmentation with Prototype Retrieval