Publications - AImageLab

Signal Processing and Machine Learning for Diplegia Classification

Authors: Bergamini, Luca; Calderara, Simone; Bicocchi, Nicola; Ferrari, Alberto; Vitetta, Giorgio

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Diplegia is one of the most common forms of a broad family of motion disorders named cerebral palsy (CP) affecting … (Read full abstract)

Diplegia is one of the most common forms of a broad family of motion disorders named cerebral palsy (CP) affecting the voluntary muscular system. In recent years, various classification criteria have been proposed for CP, to assist in diagnosis, clinical decision-making and communication. In this manuscript, we divide the spastic forms of CP into 4 other categories according to a previous classification criterion and propose a machine learning approach for automatically classifying patients. Training and validation of our approach are based on data about 200 patients acquired using 19 markers and high frequency VICON cameras in an Italian hospital. Our approach makes use of the latest deep learning techniques. More specifically, it involves a multi-layer perceptron network (MLP), combined with Fourier analysis. An encouraging classification performance is obtained for two of the four classes.

2017 Relazione in Atti di Convegno

DOI IRIS

Taking the Hidden Route: Deep Mapping of Affect via 3D Neural Networks

Authors: Ceruti, C.; Cuculo, V.; D’Amelio, A.; Grossi, G.; Lanzarotti, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

In this note we address the problem of providing a fast, automatic, and coarse processing of the early mapping from … (Read full abstract)

In this note we address the problem of providing a fast, automatic, and coarse processing of the early mapping from emotional facial expression stimuli to the basic continuous dimensions of the core affect representation of emotions, namely valence and arousal. Taking stock of results in affective neuroscience, such mapping is assumed to be the earliest stage of a complex unfolding of processes that eventually entail detailed perception and emotional reaction involving the proper body. Thus, differently from the vast majority of approaches in the field of affective facial expression processing, we assume and design such a feedforward mechanism as a preliminary step to provide a suitable prior to the subsequent core affect dynamics, in which recognition is actually grounded. To this end we conceive and exploit a 3D spatiotemporal deep network as a suitable architecture to instantiate such early component, and experiments on the MAHNOB dataset prove the rationality of this approach.

2017 Relazione in Atti di Convegno

DOI IRIS

Towards Video Captioning with Naming: a Novel Dataset and a Multi-Modal Approach

Authors: Pini, Stefano; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people … (Read full abstract)

Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people with a generic "someone" tag. In this paper we present two contributions towards the development of video description architectures with naming capabilities: firstly, we collect and release an extension of the popular Montreal Video Annotation Dataset in which the visual appearance of each character is linked both through time and to textual mentions in captions. We annotate, in a semi-automatic manner, a total of 53k face tracks and 29k textual mentions on 92 movies. Moreover, to underline and quantify the challenges of the task of generating captions with names, we present different multi-modal approaches to solve the problem on already generated captions.

2017 Relazione in Atti di Convegno

DOI IRIS

Tracking social groups within and across cameras

Authors: Solera, Francesco; Calderara, Simone; Ristani, Ergys; Tomasi, Carlo; Cucchiara, Rita

Published in: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

We propose a method for tracking groups from single and multiple cameras with disjoint fields of view. Our formulation follows … (Read full abstract)

We propose a method for tracking groups from single and multiple cameras with disjoint fields of view. Our formulation follows the tracking-by-detection paradigm where groups are the atomic entities and are linked over time to form long and consistent trajectories. To this end, we formulate the problem as a supervised clustering problem where a Structural SVM classifier learns a similarity measure appropriate for group entities. Multi-camera group tracking is handled inside the framework by adopting an orthogonal feature encoding that allows the classifier to learn inter- and intra-camera feature weights differently. Experiments were carried out on a novel annotated group tracking data set, the DukeMTMC-Groups data set. Since this is the first data set on the problem it comes with the proposal of a suitable evaluation measure. Results of adopting learning for the task are encouraging, scoring a +15% improvement in F1 measure over a non-learning based clustering baseline. To our knowledge this is the first proposal of this kind dealing with multi-camera group tracking.

2017 Articolo su rivista

DOI IRIS

Two More Strategies to Speed Up Connected Components Labeling Algorithms

Authors: Bolelli, Federico; Cancilla, Michele; Grana, Costantino

Published in: LECTURE NOTES IN COMPUTER SCIENCE

This paper presents two strategies that can be used to improve the speed of Connected Components Labeling algorithms. The first … (Read full abstract)

This paper presents two strategies that can be used to improve the speed of Connected Components Labeling algorithms. The first one operates on optimal decision trees considering image patterns occurrences, while the second one articulates how two scan algorithms can be parallelized using multi-threading. Experimental results demonstrate that the proposed methodologies reduce the total execution time of state-of-the-art two scan algorithms.

2017 Relazione in Atti di Convegno

DOI IRIS

Video registration in egocentric vision under day and night illumination changes

Authors: Alletto, Stefano; Serra, Giuseppe; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

With the spread of wearable devices and head mounted cameras, a wide range of application requiring precise user localization is … (Read full abstract)

With the spread of wearable devices and head mounted cameras, a wide range of application requiring precise user localization is now possible. In this paper we propose to treat the problem of obtaining the user position with respect to a known environment as a video registration problem. Video registration, i.e. the task of aligning an input video sequence to a pre-built 3D model, relies on a matching process of local keypoints extracted on the query sequence to a 3D point cloud. The overall registration performance is strictly tied to the actual quality of this 2D-3D matching, and can degrade if environmental conditions such as steep changes in lighting like the ones between day and night occur. To effectively register an egocentric video sequence under these conditions, we propose to tackle the source of the problem: the matching process. To overcome the shortcomings of standard matching techniques, we introduce a novel embedding space that allows us to obtain robust matches by jointly taking into account local descriptors, their spatial arrangement and their temporal robustness. The proposal is evaluated using unconstrained egocentric video sequences both in terms of matching quality and resulting registration performance using different 3D models of historical landmarks. The results show that the proposed method can outperform state of the art registration algorithms, in particular when dealing with the challenges of night and day sequences.

2017 Articolo su rivista

DOI IRIS

Virtual EMG via Facial Video Analysis

Authors: Boccignone, G.; Cuculo, V.; Grossi, G.; Lanzarotti, R.; Migliaccio, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

In this note, we address the problem of simulating electromyographic signals arising from muscles involved in facial expressions - markedly … (Read full abstract)

In this note, we address the problem of simulating electromyographic signals arising from muscles involved in facial expressions - markedly those conveying affective information -, by relying solely on facial landmarks detected on video sequences. We propose a method that uses the framework of Gaussian Process regression to predict the facial electromyographic signal from videos where people display non-posed affective expressions. To such end, experiments have been conducted on the OPEN EmoRec II multimodal corpus.

2017 Relazione in Atti di Convegno

DOI IRIS

Vision and language integration: Moving beyond objects

Authors: Shekhar, R.; Pezzelle, S.; Herbelot, A.; Nabi, M.; Sangineto, E.; Bernardi, R.

The last years have seen an explosion of work on the integration of vision and language data. New tasks like … (Read full abstract)

The last years have seen an explosion of work on the integration of vision and language data. New tasks like Image Captioning and Visual Questions Answering have been proposed and impressive results have been achieved. There is now a shared desire to gain an in-depth understanding of the strengths and weaknesses of those models. To this end, several datasets have been proposed to try and challenge the state-of-the-art. Those datasets, however, mostly focus on the interpretation of objects (as denoted by nouns in the corresponding captions). In this paper, we reuse a previously proposed methodology to evaluate the ability of current systems to move beyond objects and deal with attributes (as denoted by adjectives), actions (verbs), manner (adverbs) and spatial relations (prepositions). We show that the coarse representations given by current approaches are not informative enough to interpret attributes or actions, whilst spatial relations somewhat fare better, but only in attention models.

2017 Relazione in Atti di Convegno

IRIS

Visual Saliency for Image Captioning in New Multimedia Services

Authors: Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita

Image and video captioning are important tasks in visual data analytics, as they concern the capability of describing visual content … (Read full abstract)

Image and video captioning are important tasks in visual data analytics, as they concern the capability of describing visual content in natural language. They are the pillars of query answering systems, improve indexing and search and allow a natural form of human-machine interaction. Even though promising deep learning strategies are becoming popular, the heterogeneity of large image archives makes this task still far from being solved. In this paper we explore how visual saliency prediction can support image captioning. Recently, some forms of unsupervised machine attention mechanisms have been spreading, but the role of human attention prediction has never been examined extensively for captioning. We propose a machine attention model driven by saliency prediction to provide captions in images, which can be exploited for many services on cloud and on multimedia data. Experimental evaluations are conducted on the SALICON dataset, which provides groundtruths for both saliency and captioning, and on the large Microsoft COCO dataset, the most widely used for image captioning.

2017 Relazione in Atti di Convegno

DOI IRIS

A Browsing and Retrieval System for Broadcast Videos using Scene Detection and Automatic Annotation

Authors: Baraldi, Lorenzo; Grana, Costantino; Messina, Alberto; Cucchiara, Rita

This paper presents a novel video access and retrieval system for edited videos. The key element of the proposal is … (Read full abstract)

This paper presents a novel video access and retrieval system for edited videos. The key element of the proposal is that videos are automatically decomposed into semantically coherent parts (called scenes) to provide a more manageable unit for browsing, tagging and searching. The system features an automatic annotation pipeline, with which videos are tagged by exploiting both the transcript and the video itself. Scenes can also be retrieved with textual queries; the best thumbnail for a query is selected according to both semantics and aesthetics criteria.

2016 Relazione in Atti di Convegno

DOI IRIS