Publications - AImageLab

Imparare a descrivere gli oggetti salienti presenti nelle immagini tramite la visione e il linguaggio

Authors: Cornia, Marcella

Replicare l’abilità degli esseri umani di connettere la visione e il linguaggio ha recentemente ottenuto molta attenzione nella visione e … (Read full abstract)

Replicare l’abilità degli esseri umani di connettere la visione e il linguaggio ha recentemente ottenuto molta attenzione nella visione e intelligenza artificiale, risultando in nuovi modelli e architetture capaci di descrivere le immagini in modo automatico attraverso delle frasi testuali. Questa attività, chiamata “image captioning”, non solo richiede di riconoscere gli oggetti salienti in un’immagine e di comprendere le loro interazioni, ma anche di poterli esprimere attraverso il linguaggio naturale. In questa tesi, vengono presentate soluzioni stato dell’arte per questi problemi affrontando tutti gli aspetti coinvolti nella generazione di descrizioni testuali. Infatti, quando gli esseri umani descrivono una scena, osservano un oggetto prima di nominarlo all’interno della frase. Questo avviene grazie a dei meccanismi selettivi che attirano lo sguardo degli esseri umani sulle parti salienti e rilevanti della scena. Motivati dall’importanza di stimare in maniera automatica il focus dell’attenzione degli esseri umani su immagini, la prima parte di questa dissertazione introduce due differenti modelli di predizione della salienza basati su reti neurali. Nel primo modello, viene utilizzata una combinazione di caratteristiche visuali estratte a differenti livelli di una rete neurale convolutiva per stimare la salienza di un’immagine. Nel secondo modello, invece, viene utilizzata un’architettura ricorrente insieme a meccanismi neurali attentivi che si focalizzano sulle regioni più salienti dell’immagine in modo da rifinire iterativamente la mappa di salienza predetta. Nonostante la predizione della salienza identifichi le regioni più rilevanti di un’immagine, non è mai stata incorporata in un’architettura di descrizione automatica in linguaggio naturale. In questa tesi, viene quindi anche mostrato come incorporare la predizione della salienza per migliorare la qualità delle descrizioni di immagini e viene introdotto un modello che considera sia le regioni salienti che il contesto dell’immagine durante la generazione della descrizione testuale. Inspirati dalla recente diffusione di modelli completamente attentivi, viene inoltre investigato l’uso del modello Transformer nel contesto della generazione automatica di descrizioni di immagini e viene proposta una nuova architettura nella quale vengono completamente abbandonate le reti ricorrenti precedentemente usate in questo contesto. Gli approcci classici di descrizione automatica non forniscono alcun controllo su quali regioni dell’immagine vengono descritte e quale importanza è data a ciascuna di esse. Questa mancanza di controllabilità limita l’applicabilità degli algoritmi di descrizione automatica a scenari complessi in cui è necessaria una qualche forma di controllo sul processo di generazione. Per affrontare questi problemi, viene presentato un modello in grado di generare descrizioni in linguaggio naturale diversificate sulla base di un segnale di controllo dato nella forma di un insieme di regioni dell’immagine che devono essere descritte. Su una linea differente, viene anche esplorata la possibilità di nominare con il proprio nome i personaggi presenti nei film, necessitando anche in questo caso di un certo grado di controllabilità sul modello di descrizione automatica. Nell’ultima parte della tesi, vengono presentate soluzioni di “cross-modal retrieval”, un’altra attività che combina visione e linguaggio e che consiste nel trovare le immagini corrispondenti ad una query testuale e viceversa. Infine, viene mostrata l’applicazione di queste tecniche di retrieval nel contesto dei beni culturali e delle digital humanities, ottenendo risultati promettenti sia con modelli supervisionati che non supervisionati.

2020 Tesi di dottorato

IRIS

Meshed-Memory Transformer for Image Captioning

Authors: Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Cucchiara, Rita

Published in: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability … (Read full abstract)

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M² - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M² Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at :https://github.com/aimagelab/meshed-memory-transformer.

2020 Relazione in Atti di Convegno

DOI IRIS

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

Authors: Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION

The ability to generate natural language explanations conditioned on the visual perception is a crucial step towards autonomous agents which … (Read full abstract)

The ability to generate natural language explanations conditioned on the visual perception is a crucial step towards autonomous agents which can explain themselves and communicate with humans. While the research efforts in image and video captioning are giving promising results, this is often done at the expense of the computational requirements of the approaches, limiting their applicability to real contexts. In this paper, we propose a fully-attentive captioning algorithm which can provide state-of-the-art performances on language generation while restricting its computational demands. Our model is inspired by the Transformer model and employs only two Transformer layers in the encoding and decoding stages. Further, it incorporates a novel memory-aware encoding of image regions. Experiments demonstrate that our approach achieves competitive results in terms of caption quality while featuring reduced computational demands. Further, to evaluate its applicability on autonomous agents, we conduct experiments on simulated scenes taken from the perspective of domestic robots.

2020 Relazione in Atti di Convegno

DOI IRIS

Art2Real: Unfolding the Reality of Artworks via Semantically-Aware Image-to-Image Translation

Authors: Tomei, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: PROCEEDINGS - IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

The applicability of computer vision to real paintings and artworks has been rarely investigated, even though a vast heritage would … (Read full abstract)

The applicability of computer vision to real paintings and artworks has been rarely investigated, even though a vast heritage would greatly benefit from techniques which can understand and process data from the artistic domain. This is partially due to the small amount of annotated artistic data, which is not even comparable to that of natural images captured by cameras. In this paper, we propose a semantic-aware architecture which can translate artworks to photo-realistic visualizations, thus reducing the gap between visual features of artistic and realistic data. Our architecture can generate natural images by retrieving and learning details from real photos through a similarity matching strategy which leverages a weakly-supervised semantic understanding of the scene. Experimental results show that the proposed technique leads to increased realism and to a reduction in domain shift, which improves the performance of pre-trained architectures for classification, detection, and segmentation. Code is publicly available at: https://github.com/aimagelab/art2real.

2019 Relazione in Atti di Convegno

DOI IRIS

Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences in the Artistic Domain

Authors: Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

As vision and language techniques are widely applied to realistic images, there is a growing interest in designing visual-semantic models … (Read full abstract)

As vision and language techniques are widely applied to realistic images, there is a growing interest in designing visual-semantic models suitable for more complex and challenging scenarios. In this paper, we address the problem of cross-modal retrieval of images and sentences coming from the artistic domain. To this aim, we collect and manually annotate the Artpedia dataset that contains paintings and textual sentences describing both the visual content of the paintings and other contextual information. Thus, the problem is not only to match images and sentences, but also to identify which sentences actually describe the visual content of a given image. To this end, we devise a visual-semantic model that jointly addresses these two challenges by exploiting the latent alignment between visual and textual chunks. Experimental evaluations, obtained by comparing our model to different baselines, demonstrate the effectiveness of our solution and highlight the challenges of the proposed dataset. The Artpedia dataset is publicly available at: http://aimagelab.ing.unimore.it/artpedia.

2019 Relazione in Atti di Convegno

DOI IRIS

Image-to-Image Translation to Unfold the Reality of Artworks: an Empirical Analysis

Authors: Tomei, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

State-of-the-art Computer Vision pipelines show poor performances on artworks and data coming from the artistic domain, thus limiting the applicability … (Read full abstract)

State-of-the-art Computer Vision pipelines show poor performances on artworks and data coming from the artistic domain, thus limiting the applicability of current architectures to the automatic understanding of the cultural heritage. This is mainly due to the difference in texture and low-level feature distribution between artistic and real images, on which state-of-the-art approaches are usually trained. To enhance the applicability of pre-trained architectures on artistic data, we have recently proposed an unpaired domain translation approach which can translate artworks to photo-realistic visualizations. Our approach leverages semantically-aware memory banks of real patches, which are used to drive the generation of the translated image while improving its realism. In this paper, we provide additional analyses and experimental results which demonstrate the effectiveness of our approach. In particular, we evaluate the quality of generated results in the case of the translation of landscapes, portraits and of paintings coming from four different styles using automatic distance metrics. Also, we analyze the response of pre-trained architecture for classification, detection and segmentation both in terms of feature distribution and entropy of prediction, and show that our approach effectively reduces the domain shift of paintings. As an additional contribution, we also provide a qualitative analysis of the reduction of the domain shift for detection, segmentation and image captioning.

2019 Relazione in Atti di Convegno

DOI IRIS

M-VAD Names: a Dataset for Video Captioning with Naming

Authors: Pini, Stefano; Cornia, Marcella; Bolelli, Federico; Baraldi, Lorenzo; Cucchiara, Rita

Published in: MULTIMEDIA TOOLS AND APPLICATIONS

Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" … (Read full abstract)

Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" tag. The lack of movie description datasets with characters' visual annotations surely plays a relevant role in this shortage. Recently, we proposed to extend the M-VAD dataset by introducing such information. In this paper, we present an improved version of the dataset, namely M-VAD Names, and its semi-automatic annotation procedure. The resulting dataset contains 63k visual tracks and 34k textual mentions, all associated with character identities. To showcase the features of the dataset and quantify the complexity of the naming task, we investigate multimodal architectures to replace the "someone" tags with proper character names in existing video captions. The evaluation is further extended by testing this application on videos outside of the M-VAD Names dataset.

2019 Articolo su rivista

DOI IRIS

Recognizing social relationships from an egocentric vision perspective

Authors: Alletto, Stefano; Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita

In this chapter we address the problem of partitioning social gatherings into interacting groups in egocentric scenarios. People in the … (Read full abstract)

In this chapter we address the problem of partitioning social gatherings into interacting groups in egocentric scenarios. People in the scene are tracked, their head pose and 3D location are estimated. Following the formalism of the f-formation, we define with the orientation and distance an inherently social pairwise feature capable of describing how two people stand in relation to one another. We present a Structural SVM based approach to learn how to weight each component of the feature vector depending on the social situation is applied to. To better understand the social dynamics, we also estimate what we call social relevance of each subject in a group using a saliency attentive model. Extensive tests on two publicly available datasets show that our solution achieves encouraging results when detecting social groups and their relevant subjects in the challenging egocentric scenarios.

2019 Capitolo/Saggio

DOI IRIS

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions

Authors: Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As … (Read full abstract)

Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at hand, a higher degree of controllability is needed to apply captioning algorithms in complex scenarios. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability. Given a control signal in the form of a sequence or set of image regions, we generate the corresponding caption through a recurrent architecture which predicts textual chunks explicitly grounded on regions, following the constraints of the given control. Experiments are conducted on Flickr30k Entities and on COCO Entities, an extended version of COCO in which we add grounding annotations collected in a semi-automatic manner. Results demonstrate that our method achieves state of the art performances on controllable image captioning, in terms of caption quality and diversity. Code and annotations are publicly available at: https://github.com/aimagelab/show-control-and-tell.

2019 Relazione in Atti di Convegno

DOI IRIS

Towards Cycle-Consistent Models for Text and Image Retrieval

Authors: Cornia, Marcella; Baraldi, Lorenzo; Rezazadegan Tavakoli, Hamed; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Cross-modal retrieval has been recently becoming an hot-spot research, thanks to the development of deeply-learnable architectures. Such architectures generally learn … (Read full abstract)

Cross-modal retrieval has been recently becoming an hot-spot research, thanks to the development of deeply-learnable architectures. Such architectures generally learn a joint multi-modal embedding space in which text and images could be projected and compared. Here we investigate a different approach, and reformulate the problem of cross-modal retrieval as that of learning a translation between the textual and visual domain. In particular, we propose an end-to-end trainable model which can translate text into image features and vice versa, and regularizes this mapping with a cycle-consistency criterion. Preliminary experimental evaluations show promising results with respect to ordinary visual-semantic models.

2019 Relazione in Atti di Convegno

DOI IRIS

Publications by Marcella Cornia

Imparare a descrivere gli oggetti salienti presenti nelle immagini tramite la visione e il linguaggio

Meshed-Memory Transformer for Image Captioning

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

Art2Real: Unfolding the Reality of Artworks via Semantically-Aware Image-to-Image Translation

Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences in the Artistic Domain

Image-to-Image Translation to Unfold the Reality of Artworks: an Empirical Analysis

M-VAD Names: a Dataset for Video Captioning with Naming

Recognizing social relationships from an egocentric vision perspective

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions

Towards Cycle-Consistent Models for Text and Image Retrieval