Publications - AImageLab

Unsupervised vehicle re-identification using triplet networks

Authors: Marin-Reyes, P. A.; Bergamini, L.; Lorenzo-Navarro, J.; Palazzi, A.; Calderara, S.; Cucchiara, R.

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

Vehicle re-identification plays a major role in modern smart surveillance systems. Specifically, the task requires the capability to predict the … (Read full abstract)

Vehicle re-identification plays a major role in modern smart surveillance systems. Specifically, the task requires the capability to predict the identity of a given vehicle, given a dataset of known associations, collected from different views and surveillance cameras. Generally, it can be cast as a ranking problem: given a probe image of a vehicle, the model needs to rank all database images based on their similarities w.r.t the probe image. In line with recent research, we devise a metric learning model that employs a supervision based on local constraints. In particular, we leverage pairwise and triplet constraints for training a network capable of assigning a high degree of similarity to samples sharing the same identity, while keeping different identities distant in feature space. Eventually, we show how vehicle tracking can be exploited to automatically generate a weakly labelled dataset that can be used to train the deep network for the task of vehicle re-identification. Learning and evaluation is carried out on the NVIDIA AI city challenge videos.

2018 Relazione in Atti di Convegno

DOI IRIS

Using Kinect camera for investigating intergroup non-verbal human interactions

Authors: Vezzali, Loris; Di Bernardo, Gian Antonio; Cadamuro, Alessia; Cocco, Veronica Margherita; Crapolicchio, Eleonora; Bicocchi, Nicola; Calderara, Simone; Giovannini, Dino; Zambonelli, Franco; Cucchiara, Rita

A long tradition in social psychology focused on nonverbal behaviour displayed during dyadic interactions generally relying on evaluations from external … (Read full abstract)

A long tradition in social psychology focused on nonverbal behaviour displayed during dyadic interactions generally relying on evaluations from external coders. However, in addition to the fact that external coders may be biased, they may not capture certain type of behavioural indices. We designed three studies examining explicit and implicit prejudice as predictors of nonberval behaviour as reflected in objective indices provided by Kinect cameras. In the first study, we considered White-Black relations from the perspective of 36 White participants. Results revealed that implicit prejudice was associated with a reduction in interpersonal distance and in the volume of space between Whites and Blacks (vs. Whites and Whites), which in turn were associated with evaluations by collaborators taking part in the interaction. In the second study, 37 non-HIV participants interacted with HIV individuals. We found that implicit prejudice was associated with reduced volume of space between interactants over time (a process of bias overcorrection) only when they tried hard to control their behaviour (as captured by a stroop test). In the third study 35 non-disabled children interacted with disabled children. Results revealed that implicit prejudice was associated with reduced interpersonal distance over time.

2018 Abstract in Atti di Convegno

IRIS

A new era in the study of intergroup nonverbal behaviour: Studying intergroup dyadic interactions “online”

Authors: Di Bernardo, Gian Antonio; Vezzali, Loris; Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Zambonelli, Franco; Cucchiara, Rita; Cadamuro, Alessia

We examined predictors and consequences of intergroup nonverbal behaviour by relying on new technologies and new objective indices. In three … (Read full abstract)

We examined predictors and consequences of intergroup nonverbal behaviour by relying on new technologies and new objective indices. In three studies, both in the laboratory and in the field with children, behaviour was a function of implicit prejudice.

2017 Abstract in Atti di Convegno

IRIS

A Video Library System Using Scene Detection and Automatic Tagging

Authors: Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita

We present a novel video browsing and retrieval system for edited videos, in which videos are automatically decomposed into meaningful … (Read full abstract)

We present a novel video browsing and retrieval system for edited videos, in which videos are automatically decomposed into meaningful and storytelling parts (i.e. scenes) and tagged according to their transcript. The system relies on a Triplet Deep Neural Network which exploits multimodal features, and has been implemented as a set of extensions to the eXo Platform Enterprise Content Management System (ECMS). This set of extensions enable the interactive visualization of a video, its automatic and semi-automatic annotation, as well as a keyword-based search inside the video collection. The platform also allows a natural integration with third-party add-ons, so that automatic annotations can be exploited outside the proposed platform.

2017 Relazione in Atti di Convegno

DOI IRIS

Affective level design for a role-playing videogame evaluated by a brain–computer interface and machine learning methods

Authors: Balducci, Fabrizio; Grana, Costantino; Cucchiara, Rita

Published in: THE VISUAL COMPUTER

Game science has become a research field, which attracts industry attention due to a worldwide rich sell-market. To understand the … (Read full abstract)

Game science has become a research field, which attracts industry attention due to a worldwide rich sell-market. To understand the player experience, concepts like flow or boredom mental states require formalization and empirical investigation, taking advantage of the objective data that psychophysiological methods like electroencephalography (EEG) can provide. This work studies the affective ludology and shows two different game levels for Neverwinter Nights 2 developed with the aim to manipulate emotions; two sets of affective design guidelines are presented, with a rigorous formalization that considers the characteristics of role-playing genre and its specific gameplay. An empirical investigation with a brain–computer interface headset has been conducted: by extracting numerical data features, machine learning techniques classify the different activities of the gaming sessions (task and events) to verify if their design differentiation coincides with the affective one. The observed results, also supported by subjective questionnaires data, confirm the goodness of the proposed guidelines, suggesting that this evaluation methodology could be extended to other evaluation tasks.

2017 Articolo su rivista

DOI IRIS

Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era

Authors: Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Estimating the focus of attention of a person looking at an image or a video is a crucial step which … (Read full abstract)

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we propose a discussion on why convolutional neural networks (CNNs) are so accurate in saliency prediction. We present our DL architectures which combine both bottom-up cues and higher-level semantics, and incorporate the concept of time in the attentional process through LSTM recurrent architectures. Eventually, we present a video-specific architecture based on the C3D network, which can extracts spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. The merit of this work is to show how these deep networks are not mere brute-force methods tuned on massive amount of data, but represent well-defined architectures which recall very closely the early saliency models, although improved with the semantics learned by human ground-thuth.

2017 Relazione in Atti di Convegno

DOI IRIS

Editorial Message from the Program Chairs

Authors: Cucchiara, R.; Matsushita, Y.; Sebe, N.; Soatto, S.

Published in: PROCEEDINGS IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION

2017 Relazione in Atti di Convegno

DOI IRIS

Embedded Recurrent Network for Head Pose Estimation in Car

Authors: Borghi, Guido; Gasparini, Riccardo; Vezzani, Roberto; Cucchiara, Rita

An accurate and fast driver's head pose estimation is a rich source of information, in particular in the automotive context. … (Read full abstract)

An accurate and fast driver's head pose estimation is a rich source of information, in particular in the automotive context. Head pose is a key element for driver's behavior investigation, pose analysis, attention monitoring and also a useful component to improve the efficacy of Human-Car Interaction systems. In this paper, a Recurrent Neural Network is exploited to tackle the problem of driver head pose estimation, directly and only working on depth images to be more reliable in presence of varying or insufficient illumination. Experimental results, obtained from two public dataset, namely Biwi Kinect Head Pose and ICT-3DHP Database, prove the efficacy of the proposed method that overcomes state-of-art works. Besides, the entire system is implemented and tested on two embedded boards with real time performance.

2017 Relazione in Atti di Convegno

IRIS

Fast and Accurate Facial Landmark Localization in Depth Images for In-car Applications

Authors: Frigieri, Elia; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita

A correct and reliable localization of facial landmark enables several applications in many fields, ranging from Human Computer Interaction to … (Read full abstract)

A correct and reliable localization of facial landmark enables several applications in many fields, ranging from Human Computer Interaction to video surveillance. For instance, it can provide a valuable input to monitor the driver physical state and attention level in automotive context. In this paper, we tackle the problem of facial landmark localization through a deep approach. The developed system runs in real time and, in particular, is more reliable than state-of-the-art competitors specially in presence of light changes and poor illumination, thanks to the use of depth images as input. We also collected and shared a new realistic dataset inside a car, called MotorMark, to train and test the system. In addition, we exploited the public Eurecom Kinect Face Dataset for the evaluation phase, achieving promising results both in terms of accuracy and computational speed.

2017 Relazione in Atti di Convegno

DOI IRIS

From Depth Data to Head Pose Estimation: a Siamese approach

Authors: Venturelli, Marco; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita

The correct estimation of the head pose is a problem of the great importance for many applications. For instance, it … (Read full abstract)

The correct estimation of the head pose is a problem of the great importance for many applications. For instance, it is an enabling technology in automotive for driver attention monitoring. In this paper, we tackle the pose estimation problem through a deep learning network working in regression manner. Traditional methods usually rely on visual facial features, such as facial landmarks or nose tip position. In contrast, we exploit a Convolutional Neural Network (CNN) to perform head pose estimation directly from depth data. We exploit a Siamese architecture and we propose a novel loss function to improve the learning of the regression network layer. The system has been tested on two public datasets, Biwi Kinect Head Pose and ICT-3DHP database. The reported results demonstrate the improvement in accuracy with respect to current state-of-the-art approaches and the real time capabilities of the overall framework.

2017 Relazione in Atti di Convegno

DOI IRIS

Publications by Rita Cucchiara

Unsupervised vehicle re-identification using triplet networks

Using Kinect camera for investigating intergroup non-verbal human interactions

A new era in the study of intergroup nonverbal behaviour: Studying intergroup dyadic interactions “online”

A Video Library System Using Scene Detection and Automatic Tagging

Affective level design for a role-playing videogame evaluated by a brain–computer interface and machine learning methods

Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era

Editorial Message from the Program Chairs

Embedded Recurrent Network for Head Pose Estimation in Car

Fast and Accurate Facial Landmark Localization in Depth Images for In-car Applications

From Depth Data to Head Pose Estimation: a Siamese approach