Publications by Marcella Cornia

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Marcella Cornia

Image-to-Image Translation to Unfold the Reality of Artworks: an Empirical Analysis

Authors: Tomei, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

State-of-the-art Computer Vision pipelines show poor performances on artworks and data coming from the artistic domain, thus limiting the applicability … (Read full abstract)

State-of-the-art Computer Vision pipelines show poor performances on artworks and data coming from the artistic domain, thus limiting the applicability of current architectures to the automatic understanding of the cultural heritage. This is mainly due to the difference in texture and low-level feature distribution between artistic and real images, on which state-of-the-art approaches are usually trained. To enhance the applicability of pre-trained architectures on artistic data, we have recently proposed an unpaired domain translation approach which can translate artworks to photo-realistic visualizations. Our approach leverages semantically-aware memory banks of real patches, which are used to drive the generation of the translated image while improving its realism. In this paper, we provide additional analyses and experimental results which demonstrate the effectiveness of our approach. In particular, we evaluate the quality of generated results in the case of the translation of landscapes, portraits and of paintings coming from four different styles using automatic distance metrics. Also, we analyze the response of pre-trained architecture for classification, detection and segmentation both in terms of feature distribution and entropy of prediction, and show that our approach effectively reduces the domain shift of paintings. As an additional contribution, we also provide a qualitative analysis of the reduction of the domain shift for detection, segmentation and image captioning.

2019 Relazione in Atti di Convegno

M-VAD Names: a Dataset for Video Captioning with Naming

Authors: Pini, Stefano; Cornia, Marcella; Bolelli, Federico; Baraldi, Lorenzo; Cucchiara, Rita

Published in: MULTIMEDIA TOOLS AND APPLICATIONS

Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" … (Read full abstract)

Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" tag. The lack of movie description datasets with characters' visual annotations surely plays a relevant role in this shortage. Recently, we proposed to extend the M-VAD dataset by introducing such information. In this paper, we present an improved version of the dataset, namely M-VAD Names, and its semi-automatic annotation procedure. The resulting dataset contains 63k visual tracks and 34k textual mentions, all associated with character identities. To showcase the features of the dataset and quantify the complexity of the naming task, we investigate multimodal architectures to replace the "someone" tags with proper character names in existing video captions. The evaluation is further extended by testing this application on videos outside of the M-VAD Names dataset.

2019 Articolo su rivista

Recognizing social relationships from an egocentric vision perspective

Authors: Alletto, Stefano; Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita

In this chapter we address the problem of partitioning social gatherings into interacting groups in egocentric scenarios. People in the … (Read full abstract)

In this chapter we address the problem of partitioning social gatherings into interacting groups in egocentric scenarios. People in the scene are tracked, their head pose and 3D location are estimated. Following the formalism of the f-formation, we define with the orientation and distance an inherently social pairwise feature capable of describing how two people stand in relation to one another. We present a Structural SVM based approach to learn how to weight each component of the feature vector depending on the social situation is applied to. To better understand the social dynamics, we also estimate what we call social relevance of each subject in a group using a saliency attentive model. Extensive tests on two publicly available datasets show that our solution achieves encouraging results when detecting social groups and their relevant subjects in the challenging egocentric scenarios.

2019 Capitolo/Saggio

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions

Authors: Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As … (Read full abstract)

Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at hand, a higher degree of controllability is needed to apply captioning algorithms in complex scenarios. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability. Given a control signal in the form of a sequence or set of image regions, we generate the corresponding caption through a recurrent architecture which predicts textual chunks explicitly grounded on regions, following the constraints of the given control. Experiments are conducted on Flickr30k Entities and on COCO Entities, an extended version of COCO in which we add grounding annotations collected in a semi-automatic manner. Results demonstrate that our method achieves state of the art performances on controllable image captioning, in terms of caption quality and diversity. Code and annotations are publicly available at: https://github.com/aimagelab/show-control-and-tell.

2019 Relazione in Atti di Convegno

Towards Cycle-Consistent Models for Text and Image Retrieval

Authors: Cornia, Marcella; Baraldi, Lorenzo; Rezazadegan Tavakoli, Hamed; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Cross-modal retrieval has been recently becoming an hot-spot research, thanks to the development of deeply-learnable architectures. Such architectures generally learn … (Read full abstract)

Cross-modal retrieval has been recently becoming an hot-spot research, thanks to the development of deeply-learnable architectures. Such architectures generally learn a joint multi-modal embedding space in which text and images could be projected and compared. Here we investigate a different approach, and reformulate the problem of cross-modal retrieval as that of learning a translation between the textual and visual domain. In particular, we propose an end-to-end trainable model which can translate text into image features and vice versa, and regularizes this mapping with a cycle-consistency criterion. Preliminary experimental evaluations show promising results with respect to ordinary visual-semantic models.

2019 Relazione in Atti di Convegno

Visual-Semantic Alignment Across Domains Using a Semi-Supervised Approach

Authors: Carraggi, Angelo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval of images and sentences. In this setting, … (Read full abstract)

Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval of images and sentences. In this setting, data coming from different modalities can be projected in a common embedding space, in which distances can be used to infer the similarity between pairs of images and sentences. While this approach has shown impressive performances on fully supervised settings, its application to semi-supervised scenarios has been rarely investigated. In this paper we propose a domain adaptation model for cross-modal retrieval, in which the knowledge learned from a supervised dataset can be transferred on a target dataset in which the pairing between images and sentences is not known, or not useful for training due to the limited size of the set. Experiments are performed on two target unsupervised scenarios, respectively related to the fashion and cultural heritage domain. Results show that our model is able to effectively transfer the knowledge learned on ordinary visual-semantic datasets, achieving promising results. As an additional contribution, we collect and release the dataset used for the cultural heritage domain.

2019 Relazione in Atti di Convegno

What was Monet seeing while painting? Translating artworks to photo-realistic images

Authors: Tomei, Matteo; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

State of the art Computer Vision techniques exploit the availability of large-scale datasets, most of which consist of images captured … (Read full abstract)

State of the art Computer Vision techniques exploit the availability of large-scale datasets, most of which consist of images captured from the world as it is. This brings to an incompatibility between such methods and digital data from the artistic domain, on which current techniques under-perform. A possible solution is to reduce the domain shift at the pixel level, thus translating artistic images to realistic copies. In this paper, we present a model capable of translating paintings to photo-realistic images, trained without paired examples. The idea is to enforce a patch level similarity between real and generated images, aiming to reproduce photo-realistic details from a memory bank of real images. This is subsequently adopted in the context of an unpaired image-to-image translation framework, mapping each image from one distribution to a new one belonging to the other distribution. Qualitative and quantitative results are presented on Monet, Cezanne and Van Gogh paintings translation tasks, showing that our approach increases the realism of generated images with respect to the CycleGAN approach.

2019 Relazione in Atti di Convegno

Aligning Text and Document Illustrations: towards Visually Explainable Digital Humanities

Authors: Baraldi, Lorenzo; Cornia, Marcella; Grana, Costantino; Cucchiara, Rita

While several approaches to bring vision and language together are emerging, none of them has yet addressed the digital humanities … (Read full abstract)

While several approaches to bring vision and language together are emerging, none of them has yet addressed the digital humanities domain, which, nevertheless, is a rich source of visual and textual data. To foster research in this direction, we investigate the learning of visual-semantic embeddings for historical document illustrations, devising both supervised and semi-supervised approaches. We exploit the joint visual-semantic embeddings to automatically align illustrations and textual elements, thus providing an automatic annotation of the visual content of a manuscript. Experiments are performed on the Borso d'Este Holy Bible, one of the most sophisticated illuminated manuscript from the Renaissance, which we manually annotate aligning every illustration with textual commentaries written by experts. Experimental results quantify the domain shift between ordinary visual-semantic datasets and the proposed one, validate the proposed strategies, and devise future works on the same line.

2018 Relazione in Atti di Convegno

Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era

Authors: Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita

Published in: INTELLIGENZA ARTIFICIALE

Estimating the focus of attention of a person looking at an image or a video is a crucial step which … (Read full abstract)

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades both in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we discuss the effectiveness of convolutional neural networks (CNNs) models in saliency prediction. We present a set of Deep Learning architectures developed by us, which can combine both bottom-up cues and higher-level semantics, and extract spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. We will show how these deep networks closely recall the early saliency models, although improved with the semantics learned from the human ground-truth. Eventually, we will present a use-case in which saliency prediction is used to improve the automatic description of images.

2018 Articolo su rivista

Automatic Image Cropping and Selection using Saliency: an Application to Historical Manuscripts

Authors: Cornia, Marcella; Pini, Stefano; Baraldi, Lorenzo; Cucchiara, Rita

Published in: COMMUNICATIONS IN COMPUTER AND INFORMATION SCIENCE

Automatic image cropping techniques are particularly important to improve the visual quality of cropped images and can be applied to … (Read full abstract)

Automatic image cropping techniques are particularly important to improve the visual quality of cropped images and can be applied to a wide range of applications such as photo-editing, image compression, and thumbnail selection. In this paper, we propose a saliency-based image cropping method which produces significant cropped images by only relying on the corresponding saliency maps. Experiments on standard image cropping datasets demonstrate the benefit of the proposed solution with respect to other cropping methods. Moreover, we present an image selection method that can be effectively applied to automatically select the most representative pages of historical manuscripts thus improving the navigation of historical digital libraries.

2018 Relazione in Atti di Convegno

Page 10 of 11 • Total publications: 109