Publications by Roberto Vezzani

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Roberto Vezzani

A Systematic Comparison of Depth Map Representations for Face Recognition

Authors: Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Maltoni, Davide; Cucchiara, Rita

Published in: SENSORS

2021 Articolo su rivista

Extracting accurate long-term behavior changes from a large pig dataset

Authors: Bergamini, L.; Pini, S.; Simoni, A.; Vezzani, R.; Calderara, S.; Eath, R. B. D.; Fisher, R. B.

Visual observation of uncontrolled real-world behavior leads to noisy observations, complicated by occlusions, ambiguity, variable motion rates, detection and tracking … (Read full abstract)

Visual observation of uncontrolled real-world behavior leads to noisy observations, complicated by occlusions, ambiguity, variable motion rates, detection and tracking errors, slow transitions between behaviors, etc. We show in this paper that reliable estimates of long-term trends can be extracted given enough data, even though estimates from individual frames may be noisy. We validate this concept using a new public dataset of approximately 20+ million daytime pig observations over 6 weeks of their main growth stage, and we provide annotations for various tasks including 5 individual behaviors. Our pipeline chains detection, tracking and behavior classification combining deep and shallow computer vision techniques. While individual detections may be noisy, we show that long-term behavior changes can still be extracted reliably, and we validate these results qualitatively on the full dataset. Eventually, starting from raw RGB video data we are able to both tell what pigs main daily activities are, and how these change through time.

2021 Relazione in Atti di Convegno

Improving Car Model Classification through Vehicle Keypoint Localization

Authors: Simoni, Alessandro; D'Eusanio, Andrea; Pini, Stefano; Borghi, Guido; Vezzani, Roberto

In this paper, we present a novel multi-task framework which aims to improve the performance of car model classification leveraging … (Read full abstract)

In this paper, we present a novel multi-task framework which aims to improve the performance of car model classification leveraging visual features and pose information extracted from single RGB images. In particular, we merge the visual features obtained through an image classification network and the features computed by a model able to predict the pose in terms of 2D car keypoints. We show how this approach considerably improves the performance on the model classification task testing our framework on a subset of the Pascal3D dataset containing the car classes. Finally, we conduct an ablation study to demonstrate the performance improvement obtained with respect to a single visual classifier network.

2021 Relazione in Atti di Convegno

Multi-Category Mesh Reconstruction From Image Collections

Authors: Simoni, Alessandro; Pini, Stefano; Vezzani, Roberto; Cucchiara, Rita

Recently, learning frameworks have shown the capability of inferring the accurate shape, pose, and texture of an object from a … (Read full abstract)

Recently, learning frameworks have shown the capability of inferring the accurate shape, pose, and texture of an object from a single RGB image. However, current methods are trained on image collections of a single category in order to exploit specific priors, and they often make use of category-specific 3D templates. In this paper, we present an alternative approach that infers the textured mesh of objects combining a series of deformable 3D models and a set of instance-specific deformation, pose, and texture. Differently from previous works, our method is trained with images of multiple object categories using only foreground masks and rough camera poses as supervision. Without specific 3D templates, the framework learns category-level models which are deformed to recover the 3D shape of the depicted object. The instance-specific deformations are predicted independently for each vertex of the learned 3D mesh, enabling the dynamic subdivision of the mesh during the training process. Experiments show that the proposed framework can distinguish between different object categories and learn category-specific shape priors in an unsupervised manner. Predicted shapes are smooth and can leverage from multiple steps of subdivision during the training process, obtaining comparable or state-of-the-art results on two public datasets. Models and code are publicly released.

2021 Relazione in Atti di Convegno

RefiNet: 3D Human Pose Refinement with Depth Maps

Authors: D’Eusanio, Andrea; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

Human Pose Estimation is a fundamental task for many applications in the Computer Vision community and it has been widely … (Read full abstract)

Human Pose Estimation is a fundamental task for many applications in the Computer Vision community and it has been widely investigated in the 2D domain, i.e. intensity images. Therefore, most of the available methods for this task are mainly based on 2D Convolutional Neural Networks and huge manually-annotated RGB datasets, achieving stunning results. In this paper, we propose RefiNet, a multi-stage framework that regresses an extremely-precise 3D human pose estimation from a given 2D pose and a depth map. The framework consists of three different modules, each one specialized in a particular refinement and data representation, i.e. depth patches, 3D skeleton and point clouds. Moreover, we present a new dataset, called Baracca, acquired with RGB, depth and thermal cameras and specifically created for the automotive context. Experimental results confirm the quality of the refinement procedure that largely improves the human pose estimations of off-the-shelf 2D methods.

2021 Relazione in Atti di Convegno

SHREC 2021: Skeleton-based hand gesture recognition in the wild

Authors: Caputo, Ariel; Giacchetti, Andrea; Soso, Simone; Pintani, Deborah; D'Eusanio, Andrea; Pini, Stefano; Borghi, Guido; Simoni, Alessandro; Vezzani, Roberto; Cucchiara, Rita; Ranieri, Andrea; Giannini, Franca; Lupinetti, Katia; Monti, Marina; Maghoumi, Mehran; Laviola Jr, Joseph; Le, Minh-Quan; Nguyen, Hai-Dang; Tran, Minh-Triet

Published in: COMPUTERS & GRAPHICS

This paper presents the results of the Eurographics 2019 SHape Retrieval Contest track on online gesture recognition. The goal of … (Read full abstract)

This paper presents the results of the Eurographics 2019 SHape Retrieval Contest track on online gesture recognition. The goal of this contest was to test state-of-the-art methods that can be used to online detect command gestures from hands' movements tracking on a basic benchmark where simple gestures are performed interleaving them with other actions. Unlike previous contests and benchmarks on trajectory-based gesture recognition, we proposed an online gesture recognition task, not providing pre-segmented gestures, but asking the participants to find gestures within recorded trajectories. The results submitted by the participants show that an online detection and recognition of sets of very simple gestures from 3D trajectories captured with a cheap sensor can be effectively performed. The best methods proposed could be, therefore, directly exploited to design effective gesture-based interfaces to be used in different contexts, from Virtual and Mixed reality applications to the remote control of home devices.

2021 Articolo su rivista

Video Frame Synthesis combining Conventional and Event Cameras

Authors: Pini, Stefano; Borghi, Guido; Vezzani, Roberto

Published in: INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE

Event cameras are biologically-inspired sensors that gather the temporal evolution of the scene. They capture pixel-wise brightness variations and output … (Read full abstract)

Event cameras are biologically-inspired sensors that gather the temporal evolution of the scene. They capture pixel-wise brightness variations and output a corresponding stream of asynchronous events. Despite having multiple advantages with respect to conventional cameras, their use is limited due to the scarce compatibility of asynchronous event streams with traditional data processing and vision algorithms. In this regard, we present a framework that synthesizes RGB frames from the output stream of an event camera and an initial or a periodic set of color key-frames. The deep learning-based frame synthesis framework consists of an adversarial image-to-image architecture and a recurrent module. Two public event-based datasets, DDD17 and MVSEC, are used to obtain qualitative and quantitative per-pixel and perceptual results. In addition, we converted into event frames two additional wellknown datasets, namely Kitti and Cityscapes, in order to present semantic results, in terms of object detection and semantic segmentation accuracy. Extensive experimental evaluation confirm the quality and the capability of the proposed approach of synthesizing frame sequences from color key-frames and sequences of intermediate events.

2021 Articolo su rivista

A Transformer-Based Network for Dynamic Hand Gesture Recognition

Authors: D’Eusanio, Andrea; Simoni, Alessandro; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita

Transformer-based neural networks represent a successful self-attention mechanism that achieves state-of-the-art results in language understanding and sequence modeling. However, their … (Read full abstract)

Transformer-based neural networks represent a successful self-attention mechanism that achieves state-of-the-art results in language understanding and sequence modeling. However, their application to visual data and, in particular, to the dynamic hand gesture recognition task has not yet been deeply investigated. In this paper, we propose a transformer-based architecture for the dynamic hand gesture recognition task. We show that the employment of a single active depth sensor, specifically the usage of depth maps and the surface normals estimated from them, achieves state-of-the-art results, overcoming all the methods available in the literature on two automotive datasets, namely NVidia Dynamic Hand Gesture and Briareo. Moreover, we test the method with other data types available with common RGB-D devices, such as infrared and color data. We also assess the performance in terms of inference time and number of parameters, showing that the proposed framework is suitable for an online in-car infotainment system.

2020 Relazione in Atti di Convegno

Baracca: a Multimodal Dataset for Anthropometric Measurements in Automotive

Authors: Pini, Stefano; D'Eusanio, Andrea; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita

2020 Relazione in Atti di Convegno

Face-from-Depth for Head Pose Estimation on Depth Images

Authors: Borghi, Guido; Fabbri, Matteo; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita

Published in: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Depth cameras allow to set up reliable solutions for people monitoring and behavior understanding, especially when unstable or poor illumination … (Read full abstract)

Depth cameras allow to set up reliable solutions for people monitoring and behavior understanding, especially when unstable or poor illumination conditions make unusable common RGB sensors. Therefore, we propose a complete framework for the estimation of the head and shoulder pose based on depth images only. A head detection and localization module is also included, in order to develop a complete end-to-end system. The core element of the framework is a Convolutional Neural Network, called POSEidon+, that receives as input three types of images and provides the 3D angles of the pose as output. Moreover, a Face-from-Depth component based on a Deterministic Conditional GAN model is able to hallucinate a face from the corresponding depth image. We empirically demonstrate that this positively impacts the system performances. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Experimental results show that our method overcomes several recent state-of-art works based on both intensity and depth input data, running in real-time at more than 30 frames per second.

2020 Articolo su rivista

Page 3 of 13 • Total publications: 124