Publications - AImageLab

Towards Video Captioning with Naming: a Novel Dataset and a Multi-Modal Approach

Authors: Pini, Stefano; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people … (Read full abstract)

Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people with a generic "someone" tag. In this paper we present two contributions towards the development of video description architectures with naming capabilities: firstly, we collect and release an extension of the popular Montreal Video Annotation Dataset in which the visual appearance of each character is linked both through time and to textual mentions in captions. We annotate, in a semi-automatic manner, a total of 53k face tracks and 29k textual mentions on 92 movies. Moreover, to underline and quantify the challenges of the task of generating captions with names, we present different multi-modal approaches to solve the problem on already generated captions.

2017 Relazione in Atti di Convegno

DOI IRIS

Visual Saliency for Image Captioning in New Multimedia Services

Authors: Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita

Image and video captioning are important tasks in visual data analytics, as they concern the capability of describing visual content … (Read full abstract)

Image and video captioning are important tasks in visual data analytics, as they concern the capability of describing visual content in natural language. They are the pillars of query answering systems, improve indexing and search and allow a natural form of human-machine interaction. Even though promising deep learning strategies are becoming popular, the heterogeneity of large image archives makes this task still far from being solved. In this paper we explore how visual saliency prediction can support image captioning. Recently, some forms of unsupervised machine attention mechanisms have been spreading, but the role of human attention prediction has never been examined extensively for captioning. We propose a machine attention model driven by saliency prediction to provide captions in images, which can be exploited for many services on cloud and on multimedia data. Experimental evaluations are conducted on the SALICON dataset, which provides groundtruths for both saliency and captioning, and on the large Microsoft COCO dataset, the most widely used for image captioning.

2017 Relazione in Atti di Convegno

DOI IRIS

A Browsing and Retrieval System for Broadcast Videos using Scene Detection and Automatic Annotation

Authors: Baraldi, Lorenzo; Grana, Costantino; Messina, Alberto; Cucchiara, Rita

This paper presents a novel video access and retrieval system for edited videos. The key element of the proposal is … (Read full abstract)

This paper presents a novel video access and retrieval system for edited videos. The key element of the proposal is that videos are automatically decomposed into semantically coherent parts (called scenes) to provide a more manageable unit for browsing, tagging and searching. The system features an automatic annotation pipeline, with which videos are tagged by exploiting both the transcript and the video itself. Scenes can also be retrieved with textual queries; the best thumbnail for a query is selected according to both semantics and aesthetics criteria.

2016 Relazione in Atti di Convegno

DOI IRIS

A Deep Multi-Level Network for Saliency Prediction

Authors: Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

This paper presents a novel deep architecture for saliency prediction. Current state of the art models for saliency prediction employ … (Read full abstract)

This paper presents a novel deep architecture for saliency prediction. Current state of the art models for saliency prediction employ Fully Convolutional networks that perform a non-linear combination of features extracted from the last convolutional layer to predict saliency maps. We propose an architecture which, instead, combines features extracted at different levels of a Convolutional Neural Network (CNN). Our model is composed of three main blocks: a feature extraction CNN, a feature encoding network, that weights low and high level feature maps, and a prior learning network. We compare our solution with state of the art saliency models on two public benchmarks datasets. Results show that our model outperforms under all evaluation metrics on the SALICON dataset, which is currently the largest public dataset for saliency prediction, and achieves competitive results on the MIT300 benchmark.

2016 Relazione in Atti di Convegno

DOI IRIS

Analysis and Re-use of Videos in Educational Digital Libraries with Automatic Scene Detection

Authors: Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita

Published in: COMMUNICATIONS IN COMPUTER AND INFORMATION SCIENCE

The advent of modern approaches to education, like Massive Open Online Courses (MOOC), made video the basic media for educating … (Read full abstract)

The advent of modern approaches to education, like Massive Open Online Courses (MOOC), made video the basic media for educating and transmitting knowledge. However, IT tools are still not adequate to allow video content re-use, tagging, annotation and personalization. In this paper we analyze the problem of identifying coherent sequences, called scenes, in order to provide the users with a more manageable editing unit. A simple spectral clustering technique is proposed and compared with state-of-the-art results. We also discuss correct ways to evaluate the performance of automatic scene detection algorithms.

2016 Relazione in Atti di Convegno

DOI IRIS

Context Change Detection for an Ultra-Low Power Low-Resolution Ego-Vision Imager

Authors: Paci, Francesco; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita; Benini, Luca

Published in: LECTURE NOTES IN COMPUTER SCIENCE

With the increasing popularity of wearable cameras, such as GoPro or Narrative Clip, research on continuous activity monitoring from egocentric … (Read full abstract)

With the increasing popularity of wearable cameras, such as GoPro or Narrative Clip, research on continuous activity monitoring from egocentric cameras has received a lot of attention. Research in hardware and software is devoted to find new efficient, stable and long-time running solutions; however, devices are too power-hungry for truly always-on operation, and are aggressively duty-cycled to achieve acceptable lifetimes. In this paper we present a wearable system for context change detection based on an egocentric camera with ultra-low power consumption that can collect data 24/7. Although the resolution of the captured images is low, experimental results in real scenarios demonstrate how our approach, based on Siamese Neural Networks, can achieve visual context awareness. In particular, we compare our solution with hand-crafted features and with state of art technique and propose a novel and challenging dataset composed of roughly 30000 low-resolution images.

2016 Relazione in Atti di Convegno

DOI IRIS

Historical Document Digitization through Layout Analysis and Deep Content Classification

Authors: Corbelli, Andrea; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita

Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with … (Read full abstract)

Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents. This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features, while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the ``Enciclopedia Treccani'', a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia.

2016 Relazione in Atti di Convegno

DOI IRIS

Multi-Level Net: a Visual Saliency Prediction Model

Authors: Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

State of the art approaches for saliency prediction are based on Full Convolutional Networks, in which saliency maps are built … (Read full abstract)

State of the art approaches for saliency prediction are based on Full Convolutional Networks, in which saliency maps are built using the last layer. In contrast, we here present a novel model that predicts saliency maps exploiting a non-linear combination of features coming from different layers of the network. We also present a new loss function to deal with the imbalance issue on saliency masks. Extensive results on three public datasets demonstrate the robustness of our solution. Our model outperforms the state of the art on SALICON, which is the largest and unconstrained dataset available, and obtains competitive results on MIT300 and CAT2000 benchmarks.

2016 Relazione in Atti di Convegno

DOI IRIS

Optimized Connected Components Labeling with Pixel Prediction

Authors: Grana, Costantino; Baraldi, Lorenzo; Bolelli, Federico

Published in: LECTURE NOTES IN COMPUTER SCIENCE

In this paper we propose a new paradigm for connected components labeling, which employs a general approach to minimize the … (Read full abstract)

In this paper we propose a new paradigm for connected components labeling, which employs a general approach to minimize the number of memory accesses, by exploiting the information provided by already seen pixels, removing the need to check them again. The scan phase of our proposed algorithm is ruled by a forest of decision trees connected into a single graph. Every tree derives from a reduction of the complete optimal decision tree. Experimental results demonstrated that on low density images our method is slightly faster than the fastest conventional labeling algorithms.

2016 Relazione in Atti di Convegno

DOI IRIS

Scene-driven Retrieval in Edited Videos using Aesthetic and Semantic Deep Features

Authors: Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita

This paper presents a novel retrieval pipeline for video collections, which aims to retrieve the most significant parts of an … (Read full abstract)

This paper presents a novel retrieval pipeline for video collections, which aims to retrieve the most significant parts of an edited video for a given query, and represent them with thumbnails which are at the same time semantically meaningful and aesthetically remarkable. Videos are first segmented into coherent and story-telling scenes, then a retrieval algorithm based on deep learning is proposed to retrieve the most significant scenes for a textual query. A ranking strategy based on deep features is finally used to tackle the problem of visualizing the best thumbnail. Qualitative and quantitative experiments are conducted on a collection of edited videos to demonstrate the effectiveness of our approach.

2016 Relazione in Atti di Convegno

DOI IRIS

Publications by Lorenzo Baraldi

Towards Video Captioning with Naming: a Novel Dataset and a Multi-Modal Approach

Visual Saliency for Image Captioning in New Multimedia Services

A Browsing and Retrieval System for Broadcast Videos using Scene Detection and Automatic Annotation

A Deep Multi-Level Network for Saliency Prediction

Analysis and Re-use of Videos in Educational Digital Libraries with Automatic Scene Detection

Context Change Detection for an Ultra-Low Power Low-Resolution Ego-Vision Imager

Historical Document Digitization through Layout Analysis and Deep Content Classification

Multi-Level Net: a Visual Saliency Prediction Model

Optimized Connected Components Labeling with Pixel Prediction

Scene-driven Retrieval in Edited Videos using Aesthetic and Semantic Deep Features