Publications

SeeFar: Vehicle Speed Estimation and Flow Analysis from a Moving UAV

Authors: Ning, M.; Ma, X.; Lu, Y.; Calderara, S.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Visual perception from drones has been largely investigated for Intelligent Traffic Monitoring System (ITMS) recently. In this paper, we introduce … (Read full abstract)

Visual perception from drones has been largely investigated for Intelligent Traffic Monitoring System (ITMS) recently. In this paper, we introduce SeeFar to achieve vehicle speed estimation and traffic flow analysis based on YOLOv5 and DeepSORT from a moving drone. SeeFar differs from previous works in three key ways: the speed estimation and flow analysis components are integrated into a unified framework; our method of predicting car speed has the least constraints while maintaining a high accuracy; our flow analysor is direction-aware and outlier-aware. Specifically, we design the speed estimator only using the camera imaging geometry, where the transformation between world space and image space is completed by the variable Ground Sampling Distance. Besides, previous papers do not evaluate their speed estimators at scale due to the difficulty of obtaining the ground truth, we therefore propose a simple yet efficient approach to estimate the true speeds of vehicles via the prior size of the road signs. We evaluate SeeFar on our ten videos that contain 929 vehicle samples. Experiments on these sequences demonstrate the effectiveness of SeeFar by achieving 98.0% accuracy of speed estimation and 99.1% accuracy of traffic volume prediction, respectively.

2022 Relazione in Atti di Convegno

DOI IRIS

Self-configuring BLE deep sleep network for fault tolerant WSN

Authors: Rosati, C. A.; Cervo, A.; Bertoli, A.; Santacaterina, M.; Battilani, N.; Fantuzzi, C.

Published in: IFAC-PAPERSONLINE

This paper is focused on Wireless Sensor Network (WSN) leveraging on Bluetooth Low Energy (BLE) connectivity for low energy applications … (Read full abstract)

This paper is focused on Wireless Sensor Network (WSN) leveraging on Bluetooth Low Energy (BLE) connectivity for low energy applications which is fault tolerant versus communication path failures. The topic is important to create a robust sensorized environment to be applied in industrial context or smart infrastructure to enable scheduled monitoring with low power consumption applications. Currently BLE applications are mainly thought for smart home solutions, health care and positioning systems. In those applications the BLE nodes are continuously supplied by external power suppliers. Our goal is to design a self-configuring network with a synchronized deep sleep behavior, aimed to optimize the energy consumption, with an overall active time interval constraint optimized with a data-driven method. The aim is to find a tradeoff between the on time and the ability to collect all the nodes data, pursuing a low power consumption. Our research is based on BLE protocols, interaction between edge systems for data collection and cloud system for data analysis and software agent optimization system. The paper analyses different configurations and describes the possible optimization algorithm to be used for the software agent design, in order to reach a fine-tuned control to improve the fault tolerance and fault diagnosis of the system. Finally experimental results are compared with the estimates obtained via a software simulation tool implemented for this architectural pattern.

2022

DOI IRIS

Semi-Perspective Decoupled Heatmaps for 3D Robot Pose Estimation from Depth Maps

Authors: Simoni, Alessandro; Pini, Stefano; Borghi, Guido; Vezzani, Roberto

Published in: IEEE ROBOTICS AND AUTOMATION LETTERS

Knowing the exact 3D location of workers and robots in a collaborative environment enables several real applications, such as the … (Read full abstract)

Knowing the exact 3D location of workers and robots in a collaborative environment enables several real applications, such as the detection of unsafe situations or the study of mutual interactions for statistical and social purposes. In this paper, we propose a non-invasive and light-invariant framework based on depth devices and deep neural networks to estimate the 3D pose of robots from an external camera. The method can be applied to any robot without requiring hardware access to the internal states. We introduce a novel representation of the predicted pose, namely Semi-Perspective Decoupled Heatmaps (SPDH), to accurately compute 3D joint locations in world coordinates adapting efficient deep networks designed for the 2D Human Pose Estimation. The proposed approach, which takes as input a depth representation based on XYZ coordinates, can be trained on synthetic depth data and applied to real-world settings without the need for domain adaptation techniques. To this end, we present the SimBa dataset, based on both synthetic and real depth images, and use it for the experimental evaluation. Results show that the proposed approach, made of a specific depth map representation and the SPDH, overcomes the current state of the art.

2022 Articolo su rivista

DOI IRIS

Sfruttare e Trasferire conoscenza a priori nelle Architetture di Deep Learning

Authors: Porrello, Angelo

Nell'ultimo decennio, il Deep Learning è diventato un argomento caldo oltre che uno strumento dirompente nel contesto del Machine Learning … (Read full abstract)

Nell'ultimo decennio, il Deep Learning è diventato un argomento caldo oltre che uno strumento dirompente nel contesto del Machine Learning e della Computer Vision. Si basa su un paradigma di apprendimento in cui i dati (ad esempio, i video acquisiti da telecamere di video-sorveglianza poste su una strada pubblica) giocano un ruolo cruciale. Sfruttando un gran numero di esempi, è possibile imparare compiti complessi e simili a quelli svolti da esseri umani (ad esempio, riconoscere azioni anomale in un video-stream) con risultati impressionanti. Tuttavia, se la disponibilità di dati rappresenta la più grande forza delle tecniche di Deep Learning, essa nasconde anche la più grande debolezza: lo sviluppo di applicazioni e servizi è, infatti, spesso limitato da tale requisito, poiché l'acquisizione e il mantenimento di una enorme quantità di dati sono attività costose che richiedono personale esperto e attrezzature idonee. Tuttavia, la progettazione delle moderne architetture di Deep Learning offre diversi gradi di libertà, i quali possono essere sfruttati per mitigare la mancanza di dati di allenamento, sia essa parziale che completa. L'idea di fondo è quella di compensare tale mancanza incorporando una conoscenza preliminare che gli umani (in particolare, colore che controllano e guidano il processo di apprendimento) detengono sul dominio in questione. Infatti, le regole e le proprietà intrinseche si estendono ben oltre i dati di formazione e spesso possono essere identificate e imposte al modello di learning. Se prendiamo in considerazione la classificazione delle immagini, il successo delle Reti Neurali Convoluzionali (CNN) rispetto alle soluzioni del passato (come le Reti Neurali Multistrato) può essere attribuito principalmente a tale pratica. Infatti, i principi di progettazione del suo elemento costitutivo fondamentale (cioè la convoluzione tra due segnali 2D) riflettono naturalmente ciò che sapevamo sulle immagini naturali: le correlazioni che sussistono tra le regioni vicine dell'immagine hanno fornito pertanto una potente intuizione per lo sviluppo di modelli efficienti ed efficaci come lo sono ancora le CNN. Lo scopo di questa tesi riguarda l'indagine e la proposta di nuovi modi di modellare e iniettare la conoscenza a priori nelle architetture di Deep Learning. È importante sottolineare che tale discussione è trasversale: infatti, si concentra su diversi domini di dati (ad esempio, immagini, video, dati strutturati mediante un grafo, ecc.) e coinvolge diversi livelli della pipeline complessiva. Su quest'ultimo punto, il lettore viene guidato in questa ricerca attraverso la seguente triplice categorizzazione: i) approcci basati sui parametri, che limitano lo spazio delle soluzioni possibili a quelle regioni che riflettono le proprietà geometriche dei dati; ii) approcci goal-driven, che guidano il processo di apprendimento verso soluzioni che incarnano alcune proprietà vantaggiose; iii) approcci data-driven, che sfruttano i dati per estrarre la conoscenza da utilizzare successivamente per condizionare l'algoritmo di training. Insieme a una descrizione completa di entrambe le impostazioni e degli strumenti coinvolti, presentiamo ampi risultati sperimentali e studi di ablazione che dimostrano il valore delle tecniche proposte in questa ricerca.

2022 Tesi di dottorato

IRIS

SHREC 2022 track on online detection of heterogeneous gestures

Authors: Emporio, M.; Caputo, A.; Giachetti, A.; Cristani, M.; Borghi, G.; D'Eusanio, A.; Le, M. -Q.; Nguyen, H. -D.; Tran, M. -T.; Ambellan, F.; Hanik, M.; Nava-Yazdani, E.; Von Tycowicz, C.

Published in: COMPUTERS & GRAPHICS

This paper presents the outcomes of a contest organized to evaluate methods for the online recognition of heterogeneous gestures from … (Read full abstract)

This paper presents the outcomes of a contest organized to evaluate methods for the online recognition of heterogeneous gestures from sequences of 3D hand poses. The task is the detection of gestures belonging to a dictionary of 16 classes characterized by different pose and motion features. The dataset features continuous sequences of hand tracking data where the gestures are interleaved with non-significant motions. The data have been captured using the Hololens 2 finger tracking system in a realistic use-case of mixed reality interaction. The evaluation is based not only on the detection performances but also on the latency and the false positives, making it possible to understand the feasibility of practical interaction tools based on the algorithms proposed. The outcomes of the contest's evaluation demonstrate the necessity of further research to reduce recognition errors, while the computational cost of the algorithms proposed is sufficiently low.

2022 Articolo su rivista

DOI IRIS

Special Section on AI-empowered Multimedia Data Analytics for Smart Healthcare

Authors: Hossain, M. S.; Cucchiara, R.; Muhammad, G.; Tobon, D. P.; El Saddik, A.

Published in: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS

2022 Articolo su rivista

DOI IRIS

Spot the Difference: A Novel Task for Embodied Agents in Changing Environments

Authors: Landi, Federico; Bigazzi, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

Embodied AI is a recent research area that aims at creating intelligent agents that can move and operate inside an … (Read full abstract)

Embodied AI is a recent research area that aims at creating intelligent agents that can move and operate inside an environment. Existing approaches in this field demand the agents to act in completely new and unexplored scenes. However, this setting is far from realistic use cases that instead require executing multiple tasks in the same environment. Even if the environment changes over time, the agent could still count on its global knowledge about the scene while trying to adapt its internal representation to the current state of the environment. To make a step towards this setting, we propose Spot the Difference: a novel task for Embodied AI where the agent has access to an outdated map of the environment and needs to recover the correct layout in a fixed time budget. To this end, we collect a new dataset of occupancy maps starting from existing datasets of 3D spaces and generating a number of possible layouts for a single environment. This dataset can be employed in the popular Habitat simulator and is fully compliant with existing methods that employ reconstructed occupancy maps during navigation. Furthermore, we propose an exploration policy that can take advantage of previous knowledge of the environment and identify changes in the scene faster and more effectively than existing agents. Experimental results show that the proposed architecture outperforms existing state-of-the-art models for exploration on this new setting.

2022 Relazione in Atti di Convegno

DOI IRIS

Temporal Alignment for History Representation in Reinforcement Learning

Authors: Ermolov, A.; Sangineto, E.; Sebe, N.

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

Environments in Reinforcement Learning are usually only partially observable. To address this problem, a possible solution is to provide the … (Read full abstract)

Environments in Reinforcement Learning are usually only partially observable. To address this problem, a possible solution is to provide the agent with information about the past. However, providing complete observations of numerous steps can be excessive. Inspired by human memory, we propose to represent history with only important changes in the environment and, in our approach, to obtain automatically this representation using self-supervision. Our method (TempAl) aligns temporally-close frames, revealing a general, slowly varying state of the environment. This procedure is based on contrastive loss, which pulls embeddings of nearby observations to each other while pushing away other samples from the batch. It can be interpreted as a metric that captures the temporal relations of observations. We propose to combine both common instantaneous and our history representation and we evaluate TempAl on all available Atari games from the Arcade Learning Environment. TempAl surpasses the instantaneous-only baseline in 35 environments out of 49. The source code of the method and of all the experiments is available at https://github.com/htdt/tempal.

2022 Relazione in Atti di Convegno

DOI IRIS

The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition

Authors: Cascianelli, Silvia; Pippi, Vittorio; Maarand, Martin; Cornia, Marcella; Baraldi, Lorenzo; Kermorvant, Christopher; Cucchiara, Rita

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main … (Read full abstract)

Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting – even of the same author over a wide time-span – and the scarcity of data from ancient, poorly represented languages. With the aim of fostering the research on this topic, in this paper we present the Ludovico Antonio Muratori (LAM) dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years. The dataset comes in two configurations: a basic splitting and a date-based splitting which takes into account the age of the author. The first setting is intended to study HTR on ancient documents in Italian, while the second focuses on the ability of HTR systems to recognize text written by the same writer in time periods for which training data are not available. For both configurations, we analyze quantitative and qualitative characteristics, also with respect to other line-level HTR benchmarks, and present the recognition performance of state-of-the-art HTR architectures. The dataset is available for download at https://aimagelab.ing.unimore.it/go/lam.

2022 Relazione in Atti di Convegno

DOI IRIS

The Unreasonable Effectiveness of CLIP features for Image Captioning: an Experimental Analysis

Authors: Barraco, Manuele; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

Generating textual descriptions from visual inputs is a fundamental step towards machine intelligence, as it entails modeling the connections between … (Read full abstract)

Generating textual descriptions from visual inputs is a fundamental step towards machine intelligence, as it entails modeling the connections between the visual and textual modalities. For years, image captioning models have relied on pre-trained visual encoders and object detectors, trained on relatively small sets of data. Recently, it has been observed that large-scale multi-modal approaches like CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, provide a strong zero-shot capability on various vision tasks. In this paper, we study the advantage brought by CLIP in image captioning, employing it as a visual encoder. Through extensive experiments, we show how CLIP can significantly outperform widely-used visual encoders and quantify its role under different architectures, variants, and evaluation protocols, ranging from classical captioning performance to zero-shot transfer.

2022 Relazione in Atti di Convegno

DOI IRIS

SeeFar: Vehicle Speed Estimation and&nbsp;Flow Analysis from&nbsp;a&nbsp;Moving UAV

Self-configuring BLE deep sleep network for fault tolerant WSN

Semi-Perspective Decoupled Heatmaps for 3D Robot Pose Estimation from Depth Maps

Sfruttare e Trasferire conoscenza a priori nelle Architetture di Deep Learning

SHREC 2022 track on online detection of heterogeneous gestures

Special Section on AI-empowered Multimedia Data Analytics for Smart Healthcare

Spot the Difference: A Novel Task for Embodied Agents in Changing Environments

Temporal Alignment for History Representation in Reinforcement Learning

The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition

The Unreasonable Effectiveness of CLIP features for Image Captioning: an Experimental Analysis

SeeFar: Vehicle Speed Estimation and Flow Analysis from a Moving UAV