Publications by Rita Cucchiara

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Rita Cucchiara

Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation

Authors: Fabbri, Matteo; Lanzi, Fabio; Calderara, Simone; Alletto, Stefano; Cucchiara, Rita

Published in: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

In this paper we present a novel approach for bottom-up multi-person 3D human pose estimation from monocular RGB images. We … (Read full abstract)

In this paper we present a novel approach for bottom-up multi-person 3D human pose estimation from monocular RGB images. We propose to use high resolution volumetric heatmaps to model joint locations, devising a simple and effective compression method to drastically reduce the size of this representation. At the core of the proposed method lies our Volumetric Heatmap Autoencoder, a fully-convolutional network tasked with the compression of ground-truth heatmaps into a dense intermediate representation. A second model, the Code Predictor, is then trained to predict these codes, which can be decompressed at test time to re-obtain the original representation. Our experimental evaluation shows that our method performs favorably when compared to state of the art on both multi-person and single-person 3D human pose estimation datasets and, thanks to our novel compression strategy, can process full-HD images at the constant runtime of 8 fps regardless of the number of subjects in the scene.

2020 Relazione in Atti di Convegno

Conditional Channel Gated Networks for Task-Aware Continual Learning

Authors: Abati, Davide; Tomczak, Jakub; Blankevoort, Tijmen; Calderara, Simone; Cucchiara, Rita; Bejnordi, Babak Ehteshami

Published in: PROCEEDINGS - IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

2020 Relazione in Atti di Convegno

Explaining Digital Humanities by Aligning Images and Textual Descriptions

Authors: Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita

Published in: PATTERN RECOGNITION LETTERS

Replicating the human ability to connect Vision and Language has recently been gaining a lot of attention in the Computer … (Read full abstract)

Replicating the human ability to connect Vision and Language has recently been gaining a lot of attention in the Computer Vision and the Natural Language Processing communities. This research effort has resulted in algorithms that can retrieve images from textual descriptions and vice versa, when realistic images and sentences with simple semantics are employed and when paired training data is provided. In this paper, we go beyond these limitations and tackle the design of visual-semantic algorithms in the domain of the Digital Humanities. This setting not only advertises more complex visual and semantic structures but also features a significant lack of training data which makes the use of fully-supervised approaches infeasible. With this aim, we propose a joint visual-semantic embedding that can automatically align illustrations and textual elements without paired supervision. This is achieved by transferring the knowledge learned on ordinary visual-semantic datasets to the artistic domain. Experiments, performed on two datasets specifically designed for this domain, validate the proposed strategies and quantify the domain shift between natural images and artworks.

2020 Articolo su rivista

Face-from-Depth for Head Pose Estimation on Depth Images

Authors: Borghi, Guido; Fabbri, Matteo; Vezzani, Roberto; Calderara, Simone; Cucchiara, Rita

Published in: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Depth cameras allow to set up reliable solutions for people monitoring and behavior understanding, especially when unstable or poor illumination … (Read full abstract)

Depth cameras allow to set up reliable solutions for people monitoring and behavior understanding, especially when unstable or poor illumination conditions make unusable common RGB sensors. Therefore, we propose a complete framework for the estimation of the head and shoulder pose based on depth images only. A head detection and localization module is also included, in order to develop a complete end-to-end system. The core element of the framework is a Convolutional Neural Network, called POSEidon+, that receives as input three types of images and provides the 3D angles of the pose as output. Moreover, a Face-from-Depth component based on a Deterministic Conditional GAN model is able to hallucinate a face from the corresponding depth image. We empirically demonstrate that this positively impacts the system performances. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Experimental results show that our method overcomes several recent state-of-art works based on both intensity and depth input data, running in real-time at more than 30 frames per second.

2020 Articolo su rivista

Mercury: a vision-based framework for Driver Monitoring

Authors: Borghi, Guido; Pini, Stefano; Vezzani, Roberto; Cucchiara, Rita

In this paper, we propose a complete framework, namely Mercury, that combines Computer Vision and Deep Learning algorithms to continuously … (Read full abstract)

In this paper, we propose a complete framework, namely Mercury, that combines Computer Vision and Deep Learning algorithms to continuously monitor the driver during the driving activity. The proposed solution complies to the require-ments imposed by the challenging automotive context: the light invariance, in or-der to have a system able to work regardless of the time of day and the weather conditions. Therefore, infrared-based images, i.e. depth maps (in which each pixel corresponds to the distance between the sensor and that point in the scene), have been exploited in conjunction with traditional intensity images. Second, the non-invasivity of the system is required, since driver’s movements must not be impeded during the driving activity: in this context, the use of camer-as and vision-based algorithms is one of the best solutions. Finally, real-time per-formance is needed since a monitoring system must immediately react as soon as a situation of potential danger is detected.

2020 Relazione in Atti di Convegno

Meshed-Memory Transformer for Image Captioning

Authors: Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Cucchiara, Rita

Published in: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability … (Read full abstract)

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M² - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M² Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at :https://github.com/aimagelab/meshed-memory-transformer.

2020 Relazione in Atti di Convegno

Multimodal Hand Gesture Classification for the Human-Car Interaction

Authors: D’Eusanio, Andrea; Simoni, Alessandro; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita

Published in: INFORMATICS

2020 Articolo su rivista

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

Authors: Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION

The ability to generate natural language explanations conditioned on the visual perception is a crucial step towards autonomous agents which … (Read full abstract)

The ability to generate natural language explanations conditioned on the visual perception is a crucial step towards autonomous agents which can explain themselves and communicate with humans. While the research efforts in image and video captioning are giving promising results, this is often done at the expense of the computational requirements of the approaches, limiting their applicability to real contexts. In this paper, we propose a fully-attentive captioning algorithm which can provide state-of-the-art performances on language generation while restricting its computational demands. Our model is inspired by the Transformer model and employs only two Transformer layers in the encoding and decoding stages. Further, it incorporates a novel memory-aware encoding of image regions. Experiments demonstrate that our approach achieves competitive results in terms of caption quality while featuring reduced computational demands. Further, to evaluate its applicability on autonomous agents, we conduct experiments on simulated scenes taken from the perspective of domestic robots.

2020 Relazione in Atti di Convegno

Welcome message from the general chairs

Authors: Chen, C. W.; Cucchiara, R.; Hua, X. -S.

2020 Relazione in Atti di Convegno

Art2Real: Unfolding the Reality of Artworks via Semantically-Aware Image-to-Image Translation

Authors: Tomei, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: PROCEEDINGS - IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

The applicability of computer vision to real paintings and artworks has been rarely investigated, even though a vast heritage would … (Read full abstract)

The applicability of computer vision to real paintings and artworks has been rarely investigated, even though a vast heritage would greatly benefit from techniques which can understand and process data from the artistic domain. This is partially due to the small amount of annotated artistic data, which is not even comparable to that of natural images captured by cameras. In this paper, we propose a semantic-aware architecture which can translate artworks to photo-realistic visualizations, thus reducing the gap between visual features of artistic and realistic data. Our architecture can generate natural images by retrieving and learning details from real photos through a similarity matching strategy which leverages a weakly-supervised semantic understanding of the scene. Experimental results show that the proposed technique leads to increased realism and to a reduction in domain shift, which improves the performance of pre-trained architectures for classification, detection, and segmentation. Code is publicly available at: https://github.com/aimagelab/art2real.

2019 Relazione in Atti di Convegno

Page 18 of 51 • Total publications: 505