Publications - AImageLab

DAS-MIL: Distilling Across Scales for MILClassification of Histological WSIs

Authors: Bontempo, Gianpaolo; Porrello, Angelo; Bolelli, Federico; Calderara, Simone; Ficarra, Elisa

Published in: LECTURE NOTES IN COMPUTER SCIENCE

The adoption of Multi-Instance Learning (MIL) for classifying Whole-Slide Images (WSIs) has increased in recent years. Indeed, pixel-level annotation of … (Read full abstract)

The adoption of Multi-Instance Learning (MIL) for classifying Whole-Slide Images (WSIs) has increased in recent years. Indeed, pixel-level annotation of gigapixel WSI is mostly unfeasible and time-consuming in practice. For this reason, MIL approaches have been profitably integrated with the most recent deep-learning solutions for WSI classification to support clinical practice and diagnosis. Nevertheless, the majority of such approaches overlook the multi-scale nature of the WSIs; the few existing hierarchical MIL proposals simply flatten the multi-scale representations by concatenation or summation of features vectors, neglecting the spatial structure of the WSI. Our work aims to unleash the full potential of pyramidal structured WSI; to do so, we propose a graph-based multi-scale MIL approach, termed DAS-MIL, that exploits message passing to let information flows across multiple scales. By means of a knowledge distillation schema, the alignment between the latent space representation at different resolutions is encouraged while preserving the diversity in the informative content. The effectiveness of the proposed framework is demonstrated on two well-known datasets, where we outperform SOTA on WSI classification, gaining a +1.9% AUC and +3.3¬curacy on the popular Camelyon16 benchmark.

2023 Relazione in Atti di Convegno

DOI IRIS

Deep Learning and Large Scale Models for Bank Transactions

Authors: Garuti, Fabrizio; Luetto, Simone; Cucchiara, Rita; Sangineto, Enver

Published in: CEUR WORKSHOP PROCEEDINGS

The success of Artificial Intelligence (AI) in different research and application areas has increased the interest in adopting Deep Learning … (Read full abstract)

The success of Artificial Intelligence (AI) in different research and application areas has increased the interest in adopting Deep Learning techniques also in the financial field. Particularly interesting is the case of financial transactional data, which represent one of the most valuable sources of information for banks and other financial institutes. However, the heterogeneity of the data, composed of both numerical and categorical attributes, makes the use of standard Deep Learning methods difficult. In this paper, we present UniTTAB, a Transformer network for transactional time series, which can uniformly represent heterogeneous time-dependent data, and which is trained on a very large scale of real transactional data. As far as we know, the dataset we used for training is the largest real bank transactions dataset used for Deep Learning methods in this field, being all the other common datasets either much smaller or synthetically generated. The use of this very large real training dataset, makes our UniTTAB the first foundation model for transactional data.

2023 Relazione in Atti di Convegno

IRIS

Depth-based 3D human pose refinement: Evaluating the refinet framework

Authors: D'Eusanio, A.; Simoni, A.; Pini, S.; Borghi, G.; Vezzani, R.; Cucchiara, R.

Published in: PATTERN RECOGNITION LETTERS

In recent years, Human Pose Estimation has achieved impressive results on RGB images. The advent of deep learning architectures and … (Read full abstract)

In recent years, Human Pose Estimation has achieved impressive results on RGB images. The advent of deep learning architectures and large annotated datasets have contributed to these achievements. However, little has been done towards estimating the human pose using depth maps, and especially towards obtaining a precise 3D body joint localization. To fill this gap, this paper presents RefiNet, a depth-based 3D human pose refinement framework. Given a depth map and an initial coarse 2D human pose, RefiNet regresses a fine 3D pose. The framework is composed of three modules, based on different data representations, i.e. 2D depth patches, 3D human skeletons, and point clouds. An extensive experimental evaluation is carried out to investigate the impact of the model hyper-parameters and to compare RefiNet with off-the-shelf 2D methods and literature approaches. Results confirm the effectiveness of the proposed framework and its limited computational requirements.

2023 Articolo su rivista

DOI IRIS

Detecting Morphing Attacks via Continual Incremental Training

Authors: Pellegrini, Lorenzo; Borghi, Guido; Franco, Annalisa; Maltoni, Davide

Scenarios in which restrictions in data transfer and storage limit the possibility to compose a single dataset – also exploiting … (Read full abstract)

Scenarios in which restrictions in data transfer and storage limit the possibility to compose a single dataset – also exploiting different data sources – to perform a batch-based training procedure, make the development of robust models particularly challenging. We hypothesize that the recent Continual Learning (CL) paradigm may represent an effective solution to enable incremental training, even through multiple sites. Indeed, a basic assumption of CL is that once a model has been trained, old data can no longer be used in successive training iterations and in principle can be deleted. Therefore, in this paper, we investigate the performance of different Continual Learning methods in this scenario, simulating a learning model that is updated every time a new chunk of data, even of variable size, is available. Experimental results reveal that a particular CL method, namely Learning without Forgetting (LwF), is one of the best-performing algorithms. Then, we investigate its usage and parametrization in Morphing Attack Detection and Object Classification tasks, specifically with respect to the amount of new training data that became available.

2023 Relazione in Atti di Convegno

DOI IRIS

Embodied Agents for Efficient Exploration and Smart Scene Description

Authors: Bigazzi, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION

The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last … (Read full abstract)

The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments. As a step towards this objective, in this work, we tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment while portraying interesting scenes with natural language descriptions. To this end, we propose and evaluate an approach that combines recent advances in visual robotic exploration and image captioning on images generated through agent-environment interaction. Our approach can generate smart scene descriptions that maximize semantic knowledge of the environment and avoid repetitions. Further, such descriptions offer user-understandable insights into the robot's representation of the environment by high-lighting the prominent objects and the correlation between them as encountered during the exploration. To quantitatively assess the performance of the proposed approach, we also devise a specific score that takes into account both exploration and description skills. The experiments carried out on both photorealistic simulated environments and real-world ones demonstrate that our approach can effectively describe the robot's point of view during exploration, improving the human-friendly interpretability of its observations.

2023 Relazione in Atti di Convegno

DOI IRIS

Enhancing Open-Vocabulary Semantic Segmentation with Prototype Retrieval

Authors: Barsellotti, Luca; Amoroso, Roberto; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

2023 Relazione in Atti di Convegno

DOI IRIS

Enhancing PFI Prediction with GDS-MIL: A Graph-based Dual Stream MIL Approach

Authors: Bontempo, Gianpaolo; Bartolini, Nicola; Lovino, Marta; Bolelli, Federico; Virtanen, Anni; Ficarra, Elisa

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Whole-Slide Images (WSI) are emerging as a promising resource for studying biological tissues, demonstrating a great potential in aiding cancer … (Read full abstract)

Whole-Slide Images (WSI) are emerging as a promising resource for studying biological tissues, demonstrating a great potential in aiding cancer diagnosis and improving patient treatment. However, the manual pixel-level annotation of WSIs is extremely time-consuming and practically unfeasible in real-world scenarios. Multi-Instance Learning (MIL) have gained attention as a weakly supervised approach able to address lack of annotation tasks. MIL models aggregate patches (e.g., cropping of a WSI) into bag-level representations (e.g., WSI label), but neglect spatial information of the WSIs, crucial for histological analysis. In the High-Grade Serous Ovarian Cancer (HGSOC) context, spatial information is essential to predict a prognosis indicator (the Platinum-Free Interval, PFI) from WSIs. Such a prediction would bring highly valuable insights both for patient treatment and prognosis of chemotherapy resistance. Indeed, NeoAdjuvant ChemoTherapy (NACT) induces changes in tumor tissue morphology and composition, making the prediction of PFI from WSIs extremely challenging. In this paper, we propose GDS-MIL, a method that integrates a state-of-the-art MIL model with a Graph ATtention layer (GAT in short) to inject a local context into each instance before MIL aggregation. Our approach achieves a significant improvement in accuracy on the ``Ome18'' PFI dataset. In summary, this paper presents a novel solution for enhancing PFI prediction in HGSOC, with the potential of significantly improving treatment decisions and patient outcomes.

2023 Relazione in Atti di Convegno

DOI IRIS

Evaluating synthetic pre-Training for handwriting processing tasks

Authors: Pippi, V.; Cascianelli, S.; Baraldi, L.; Cucchiara, R.

Published in: PATTERN RECOGNITION LETTERS

In this work, we explore massive pre-training on synthetic word images for enhancing the performance on four benchmark downstream handwriting … (Read full abstract)

In this work, we explore massive pre-training on synthetic word images for enhancing the performance on four benchmark downstream handwriting analysis tasks. To this end, we build a large synthetic dataset of word images rendered in several handwriting fonts, which offers a complete supervision sig-nal. We use it to train a simple convolutional neural network (ConvNet) with a fully supervised objective. The vector representations of the images obtained from the pre-trained ConvNet can then be consid-ered as encodings of the handwriting style. We exploit such representations for Writer Retrieval, Writer Identification, Writer Verification, and Writer Classification and demonstrate that our pre-training strat-egy allows extracting rich representations of the writers' style that enable the aforementioned tasks with competitive results with respect to task-specific State-of-the-Art approaches.& COPY; 2023 Elsevier B.V. All rights reserved.

2023 Articolo su rivista

DOI IRIS

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Authors: Moratelli, Nicholas; Barraco, Manuele; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: SENSORS

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this … (Read full abstract)

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.

2023 Articolo su rivista

DOI IRIS

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Authors: Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cascianelli, Silvia; Fiameni, Giuseppe; Cucchiara, Rita

Published in: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted … (Read full abstract)

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.

2023 Articolo su rivista

DOI IRIS