Publications - AImageLab

POSEidon: Face-from-Depth for Driver Pose Estimation

Authors: Borghi, Guido; Venturelli, Marco; Vezzani, Roberto; Cucchiara, Rita

Published in: PROCEEDINGS - IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

Fast and accurate upper-body and head pose estimation is a key task for automatic monitoring of driver attention, a challenging … (Read full abstract)

Fast and accurate upper-body and head pose estimation is a key task for automatic monitoring of driver attention, a challenging context characterized by severe illumination changes, occlusions and extreme poses. In this work, we present a new deep learning framework for head localization and pose estimation on depth images. The core of the proposal is a regression neural network, called POSEidon, which is composed of three independent convolutional nets followed by a fusion layer, specially conceived for understanding the pose by depth. In addition, to recover the intrinsic value of face appearance for understanding head position and orientation, we propose a new Face-from-Depth approach for learning image faces from depth. Results in face reconstruction are qualitatively impressive. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Results show that our method overcomes all recent state-of-art works, running in real time at more than 30 frames per second.

2017 Relazione in Atti di Convegno

DOI IRIS

Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks

Authors: Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita

Published in: IEEE TRANSACTIONS ON MULTIMEDIA

In this paper, we propose a novel scene detection algorithm which employs semantic, visual, textual and audio cues. We also … (Read full abstract)

In this paper, we propose a novel scene detection algorithm which employs semantic, visual, textual and audio cues. We also show how the hierarchical decomposition of the storytelling video structure can improve retrieval results presentation with semantically and aesthetically effective thumbnails. Our method is built upon two advancements of the state of the art: 1) semantic feature extraction which builds video specific concept detectors; 2) multimodal feature embedding learning, that maps the feature vector of a shot to a space in which the Euclidean distance has task specific semantic properties. The proposed method is able to decompose the video in annotated temporal segments which allow for a query specific thumbnail extraction. Extensive experiments are performed on different data sets to demonstrate the effectiveness of our algorithm. An in-depth discussion on how to deal with the subjectivity of the task is conducted and a strategy to overcome the problem is suggested.

2017 Articolo su rivista

DOI IRIS

Segmentation models diversity for object proposals

Authors: Manfredi, Marco; Grana, Costantino; Cucchiara, Rita; Smeulders, Arnold W. M.

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

In this paper we present a segmentation proposal method which employs a box-hypotheses generation step followed by a lightweight segmentation … (Read full abstract)

In this paper we present a segmentation proposal method which employs a box-hypotheses generation step followed by a lightweight segmentation strategy. Inspired by interactive segmentation, for each automatically placed bounding-box we compute a precise segmentation mask. We introduce diversity in segmentation strategies enhancing a generic model performance exploiting class-independent regional appearance features. Foreground probability scores are learned from groups of objects with peculiar characteristics to specialize segmentation models. We demonstrate results comparable to the state-of-the-art on PASCAL VOC 2012 and a further improvement by merging our proposals with those of a recent solution. The ability to generalize to unseen object categories is demonstrated on Microsoft COCO 2014.

2017 Articolo su rivista

DOI IRIS

Towards Video Captioning with Naming: a Novel Dataset and a Multi-Modal Approach

Authors: Pini, Stefano; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people … (Read full abstract)

Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people with a generic "someone" tag. In this paper we present two contributions towards the development of video description architectures with naming capabilities: firstly, we collect and release an extension of the popular Montreal Video Annotation Dataset in which the visual appearance of each character is linked both through time and to textual mentions in captions. We annotate, in a semi-automatic manner, a total of 53k face tracks and 29k textual mentions on 92 movies. Moreover, to underline and quantify the challenges of the task of generating captions with names, we present different multi-modal approaches to solve the problem on already generated captions.

2017 Relazione in Atti di Convegno

DOI IRIS

Tracking social groups within and across cameras

Authors: Solera, Francesco; Calderara, Simone; Ristani, Ergys; Tomasi, Carlo; Cucchiara, Rita

Published in: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

We propose a method for tracking groups from single and multiple cameras with disjoint fields of view. Our formulation follows … (Read full abstract)

We propose a method for tracking groups from single and multiple cameras with disjoint fields of view. Our formulation follows the tracking-by-detection paradigm where groups are the atomic entities and are linked over time to form long and consistent trajectories. To this end, we formulate the problem as a supervised clustering problem where a Structural SVM classifier learns a similarity measure appropriate for group entities. Multi-camera group tracking is handled inside the framework by adopting an orthogonal feature encoding that allows the classifier to learn inter- and intra-camera feature weights differently. Experiments were carried out on a novel annotated group tracking data set, the DukeMTMC-Groups data set. Since this is the first data set on the problem it comes with the proposal of a suitable evaluation measure. Results of adopting learning for the task are encouraging, scoring a +15% improvement in F1 measure over a non-learning based clustering baseline. To our knowledge this is the first proposal of this kind dealing with multi-camera group tracking.

2017 Articolo su rivista

DOI IRIS

Video registration in egocentric vision under day and night illumination changes

Authors: Alletto, Stefano; Serra, Giuseppe; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

With the spread of wearable devices and head mounted cameras, a wide range of application requiring precise user localization is … (Read full abstract)

With the spread of wearable devices and head mounted cameras, a wide range of application requiring precise user localization is now possible. In this paper we propose to treat the problem of obtaining the user position with respect to a known environment as a video registration problem. Video registration, i.e. the task of aligning an input video sequence to a pre-built 3D model, relies on a matching process of local keypoints extracted on the query sequence to a 3D point cloud. The overall registration performance is strictly tied to the actual quality of this 2D-3D matching, and can degrade if environmental conditions such as steep changes in lighting like the ones between day and night occur. To effectively register an egocentric video sequence under these conditions, we propose to tackle the source of the problem: the matching process. To overcome the shortcomings of standard matching techniques, we introduce a novel embedding space that allows us to obtain robust matches by jointly taking into account local descriptors, their spatial arrangement and their temporal robustness. The proposal is evaluated using unconstrained egocentric video sequences both in terms of matching quality and resulting registration performance using different 3D models of historical landmarks. The results show that the proposed method can outperform state of the art registration algorithms, in particular when dealing with the challenges of night and day sequences.

2017 Articolo su rivista

DOI IRIS

Visual Saliency for Image Captioning in New Multimedia Services

Authors: Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita

Image and video captioning are important tasks in visual data analytics, as they concern the capability of describing visual content … (Read full abstract)

Image and video captioning are important tasks in visual data analytics, as they concern the capability of describing visual content in natural language. They are the pillars of query answering systems, improve indexing and search and allow a natural form of human-machine interaction. Even though promising deep learning strategies are becoming popular, the heterogeneity of large image archives makes this task still far from being solved. In this paper we explore how visual saliency prediction can support image captioning. Recently, some forms of unsupervised machine attention mechanisms have been spreading, but the role of human attention prediction has never been examined extensively for captioning. We propose a machine attention model driven by saliency prediction to provide captions in images, which can be exploited for many services on cloud and on multimedia data. Experimental evaluations are conducted on the SALICON dataset, which provides groundtruths for both saliency and captioning, and on the large Microsoft COCO dataset, the most widely used for image captioning.

2017 Relazione in Atti di Convegno

DOI IRIS

A Browsing and Retrieval System for Broadcast Videos using Scene Detection and Automatic Annotation

Authors: Baraldi, Lorenzo; Grana, Costantino; Messina, Alberto; Cucchiara, Rita

This paper presents a novel video access and retrieval system for edited videos. The key element of the proposal is … (Read full abstract)

This paper presents a novel video access and retrieval system for edited videos. The key element of the proposal is that videos are automatically decomposed into semantically coherent parts (called scenes) to provide a more manageable unit for browsing, tagging and searching. The system features an automatic annotation pipeline, with which videos are tagged by exploiting both the transcript and the video itself. Scenes can also be retrieved with textual queries; the best thumbnail for a query is selected according to both semantics and aesthetics criteria.

2016 Relazione in Atti di Convegno

DOI IRIS

A Deep Multi-Level Network for Saliency Prediction

Authors: Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

This paper presents a novel deep architecture for saliency prediction. Current state of the art models for saliency prediction employ … (Read full abstract)

This paper presents a novel deep architecture for saliency prediction. Current state of the art models for saliency prediction employ Fully Convolutional networks that perform a non-linear combination of features extracted from the last convolutional layer to predict saliency maps. We propose an architecture which, instead, combines features extracted at different levels of a Convolutional Neural Network (CNN). Our model is composed of three main blocks: a feature extraction CNN, a feature encoding network, that weights low and high level feature maps, and a prior learning network. We compare our solution with state of the art saliency models on two public benchmarks datasets. Results show that our model outperforms under all evaluation metrics on the SALICON dataset, which is currently the largest public dataset for saliency prediction, and achieves competitive results on the MIT300 benchmark.

2016 Relazione in Atti di Convegno

DOI IRIS

A location-aware architecture for an IoT-based smart museum

Authors: Fiore, Giuseppe Del; Mainetti, Luca; Mighali, Vincenzo; Patrono, Luigi; Alletto, Stefano; Cucchiara, Rita; Serra, Giuseppe

Published in: INTERNATIONAL JOURNAL OF ELECTRONIC GOVERNMENT RESEARCH

The Internet of Things, whose main goal is to automatically predict users' desires, can find very interesting opportunities in the … (Read full abstract)

The Internet of Things, whose main goal is to automatically predict users' desires, can find very interesting opportunities in the art and culture field, as the tourism is one of the main driving engines of the modern society. Currently, the innovation process in this field is growing at a slower pace, so the cultural heritage is a prerogative of a restricted category of users. To address this issue, a significant technological improvement is necessary in the culture-dedicated locations, which do not usually allow the installation of hardware infrastructures. In this paper, we design and validate a no-invasive indoor location-aware architecture able to enhance the user experience in a museum. The system relies on the user's smartphone and a wearable device (with image recognition and localization capabilities) to automatically deliver personalized cultural contents related to the observed artworks. The proposal was validated in the MUST museum in Lecce (Italy).

2016 Articolo su rivista

DOI IRIS

Publications by Rita Cucchiara

POSEidon: Face-from-Depth for Driver Pose Estimation

Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks

Segmentation models diversity for object proposals

Towards Video Captioning with Naming: a Novel Dataset and a Multi-Modal Approach

Tracking social groups within and across cameras

Video registration in egocentric vision under day and night illumination changes

Visual Saliency for Image Captioning in New Multimedia Services

A Browsing and Retrieval System for Broadcast Videos using Scene Detection and Automatic Annotation

A Deep Multi-Level Network for Saliency Prediction

A location-aware architecture for an IoT-based smart museum