Publications

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Semi-Perspective Decoupled Heatmaps for 3D Robot Pose Estimation from Depth Maps

Authors: Simoni, Alessandro; Pini, Stefano; Borghi, Guido; Vezzani, Roberto

Published in: IEEE ROBOTICS AND AUTOMATION LETTERS

Knowing the exact 3D location of workers and robots in a collaborative environment enables several real applications, such as the … (Read full abstract)

Knowing the exact 3D location of workers and robots in a collaborative environment enables several real applications, such as the detection of unsafe situations or the study of mutual interactions for statistical and social purposes. In this paper, we propose a non-invasive and light-invariant framework based on depth devices and deep neural networks to estimate the 3D pose of robots from an external camera. The method can be applied to any robot without requiring hardware access to the internal states. We introduce a novel representation of the predicted pose, namely Semi-Perspective Decoupled Heatmaps (SPDH), to accurately compute 3D joint locations in world coordinates adapting efficient deep networks designed for the 2D Human Pose Estimation. The proposed approach, which takes as input a depth representation based on XYZ coordinates, can be trained on synthetic depth data and applied to real-world settings without the need for domain adaptation techniques. To this end, we present the SimBa dataset, based on both synthetic and real depth images, and use it for the experimental evaluation. Results show that the proposed approach, made of a specific depth map representation and the SPDH, overcomes the current state of the art.

2022 Articolo su rivista

Sfruttare e Trasferire conoscenza a priori nelle Architetture di Deep Learning

Authors: Porrello, Angelo

Nell'ultimo decennio, il Deep Learning è diventato un argomento caldo oltre che uno strumento dirompente nel contesto del Machine Learning … (Read full abstract)

Nell'ultimo decennio, il Deep Learning è diventato un argomento caldo oltre che uno strumento dirompente nel contesto del Machine Learning e della Computer Vision. Si basa su un paradigma di apprendimento in cui i dati (ad esempio, i video acquisiti da telecamere di video-sorveglianza poste su una strada pubblica) giocano un ruolo cruciale. Sfruttando un gran numero di esempi, è possibile imparare compiti complessi e simili a quelli svolti da esseri umani (ad esempio, riconoscere azioni anomale in un video-stream) con risultati impressionanti. Tuttavia, se la disponibilità di dati rappresenta la più grande forza delle tecniche di Deep Learning, essa nasconde anche la più grande debolezza: lo sviluppo di applicazioni e servizi è, infatti, spesso limitato da tale requisito, poiché l'acquisizione e il mantenimento di una enorme quantità di dati sono attività costose che richiedono personale esperto e attrezzature idonee. Tuttavia, la progettazione delle moderne architetture di Deep Learning offre diversi gradi di libertà, i quali possono essere sfruttati per mitigare la mancanza di dati di allenamento, sia essa parziale che completa. L'idea di fondo è quella di compensare tale mancanza incorporando una conoscenza preliminare che gli umani (in particolare, colore che controllano e guidano il processo di apprendimento) detengono sul dominio in questione. Infatti, le regole e le proprietà intrinseche si estendono ben oltre i dati di formazione e spesso possono essere identificate e imposte al modello di learning. Se prendiamo in considerazione la classificazione delle immagini, il successo delle Reti Neurali Convoluzionali (CNN) rispetto alle soluzioni del passato (come le Reti Neurali Multistrato) può essere attribuito principalmente a tale pratica. Infatti, i principi di progettazione del suo elemento costitutivo fondamentale (cioè la convoluzione tra due segnali 2D) riflettono naturalmente ciò che sapevamo sulle immagini naturali: le correlazioni che sussistono tra le regioni vicine dell'immagine hanno fornito pertanto una potente intuizione per lo sviluppo di modelli efficienti ed efficaci come lo sono ancora le CNN. Lo scopo di questa tesi riguarda l'indagine e la proposta di nuovi modi di modellare e iniettare la conoscenza a priori nelle architetture di Deep Learning. È importante sottolineare che tale discussione è trasversale: infatti, si concentra su diversi domini di dati (ad esempio, immagini, video, dati strutturati mediante un grafo, ecc.) e coinvolge diversi livelli della pipeline complessiva. Su quest'ultimo punto, il lettore viene guidato in questa ricerca attraverso la seguente triplice categorizzazione: i) approcci basati sui parametri, che limitano lo spazio delle soluzioni possibili a quelle regioni che riflettono le proprietà geometriche dei dati; ii) approcci goal-driven, che guidano il processo di apprendimento verso soluzioni che incarnano alcune proprietà vantaggiose; iii) approcci data-driven, che sfruttano i dati per estrarre la conoscenza da utilizzare successivamente per condizionare l'algoritmo di training. Insieme a una descrizione completa di entrambe le impostazioni e degli strumenti coinvolti, presentiamo ampi risultati sperimentali e studi di ablazione che dimostrano il valore delle tecniche proposte in questa ricerca.

2022 Tesi di dottorato

SHREC 2022 track on online detection of heterogeneous gestures

Authors: Emporio, M.; Caputo, A.; Giachetti, A.; Cristani, M.; Borghi, G.; D'Eusanio, A.; Le, M. -Q.; Nguyen, H. -D.; Tran, M. -T.; Ambellan, F.; Hanik, M.; Nava-Yazdani, E.; Von Tycowicz, C.

Published in: COMPUTERS & GRAPHICS

This paper presents the outcomes of a contest organized to evaluate methods for the online recognition of heterogeneous gestures from … (Read full abstract)

This paper presents the outcomes of a contest organized to evaluate methods for the online recognition of heterogeneous gestures from sequences of 3D hand poses. The task is the detection of gestures belonging to a dictionary of 16 classes characterized by different pose and motion features. The dataset features continuous sequences of hand tracking data where the gestures are interleaved with non-significant motions. The data have been captured using the Hololens 2 finger tracking system in a realistic use-case of mixed reality interaction. The evaluation is based not only on the detection performances but also on the latency and the false positives, making it possible to understand the feasibility of practical interaction tools based on the algorithms proposed. The outcomes of the contest's evaluation demonstrate the necessity of further research to reduce recognition errors, while the computational cost of the algorithms proposed is sufficiently low.

2022 Articolo su rivista

Special Section on AI-empowered Multimedia Data Analytics for Smart Healthcare

Authors: Hossain, M. S.; Cucchiara, R.; Muhammad, G.; Tobon, D. P.; El Saddik, A.

Published in: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS

2022 Articolo su rivista

Spot the Difference: A Novel Task for Embodied Agents in Changing Environments

Authors: Landi, Federico; Bigazzi, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

Embodied AI is a recent research area that aims at creating intelligent agents that can move and operate inside an … (Read full abstract)

Embodied AI is a recent research area that aims at creating intelligent agents that can move and operate inside an environment. Existing approaches in this field demand the agents to act in completely new and unexplored scenes. However, this setting is far from realistic use cases that instead require executing multiple tasks in the same environment. Even if the environment changes over time, the agent could still count on its global knowledge about the scene while trying to adapt its internal representation to the current state of the environment. To make a step towards this setting, we propose Spot the Difference: a novel task for Embodied AI where the agent has access to an outdated map of the environment and needs to recover the correct layout in a fixed time budget. To this end, we collect a new dataset of occupancy maps starting from existing datasets of 3D spaces and generating a number of possible layouts for a single environment. This dataset can be employed in the popular Habitat simulator and is fully compliant with existing methods that employ reconstructed occupancy maps during navigation. Furthermore, we propose an exploration policy that can take advantage of previous knowledge of the environment and identify changes in the scene faster and more effectively than existing agents. Experimental results show that the proposed architecture outperforms existing state-of-the-art models for exploration on this new setting.

2022 Relazione in Atti di Convegno

Temporal Alignment for History Representation in Reinforcement Learning

Authors: Ermolov, A.; Sangineto, E.; Sebe, N.

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

Environments in Reinforcement Learning are usually only partially observable. To address this problem, a possible solution is to provide the … (Read full abstract)

Environments in Reinforcement Learning are usually only partially observable. To address this problem, a possible solution is to provide the agent with information about the past. However, providing complete observations of numerous steps can be excessive. Inspired by human memory, we propose to represent history with only important changes in the environment and, in our approach, to obtain automatically this representation using self-supervision. Our method (TempAl) aligns temporally-close frames, revealing a general, slowly varying state of the environment. This procedure is based on contrastive loss, which pulls embeddings of nearby observations to each other while pushing away other samples from the batch. It can be interpreted as a metric that captures the temporal relations of observations. We propose to combine both common instantaneous and our history representation and we evaluate TempAl on all available Atari games from the Arcade Learning Environment. TempAl surpasses the instantaneous-only baseline in 35 environments out of 49. The source code of the method and of all the experiments is available at https://github.com/htdt/tempal.

2022 Relazione in Atti di Convegno

The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition

Authors: Cascianelli, Silvia; Pippi, Vittorio; Maarand, Martin; Cornia, Marcella; Baraldi, Lorenzo; Kermorvant, Christopher; Cucchiara, Rita

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main … (Read full abstract)

Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting – even of the same author over a wide time-span – and the scarcity of data from ancient, poorly represented languages. With the aim of fostering the research on this topic, in this paper we present the Ludovico Antonio Muratori (LAM) dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years. The dataset comes in two configurations: a basic splitting and a date-based splitting which takes into account the age of the author. The first setting is intended to study HTR on ancient documents in Italian, while the second focuses on the ability of HTR systems to recognize text written by the same writer in time periods for which training data are not available. For both configurations, we analyze quantitative and qualitative characteristics, also with respect to other line-level HTR benchmarks, and present the recognition performance of state-of-the-art HTR architectures. The dataset is available for download at https://aimagelab.ing.unimore.it/go/lam.

2022 Relazione in Atti di Convegno

The Unreasonable Effectiveness of CLIP features for Image Captioning: an Experimental Analysis

Authors: Barraco, Manuele; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

Generating textual descriptions from visual inputs is a fundamental step towards machine intelligence, as it entails modeling the connections between … (Read full abstract)

Generating textual descriptions from visual inputs is a fundamental step towards machine intelligence, as it entails modeling the connections between the visual and textual modalities. For years, image captioning models have relied on pre-trained visual encoders and object detectors, trained on relatively small sets of data. Recently, it has been observed that large-scale multi-modal approaches like CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, provide a strong zero-shot capability on various vision tasks. In this paper, we study the advantage brought by CLIP in image captioning, employing it as a visual encoder. Through extensive experiments, we show how CLIP can significantly outperform widely-used visual encoders and quantify its role under different architectures, variants, and evaluation protocols, ranging from classical captioning performance to zero-shot transfer.

2022 Relazione in Atti di Convegno

Transfer without Forgetting

Authors: Boschini, Matteo; Bonicelli, Lorenzo; Porrello, Angelo; Bellitto, Giovanni; Pennisi, Matteo; Palazzo, Simone; Spampinato, Concetto; Calderara, Simone

Published in: LECTURE NOTES IN COMPUTER SCIENCE

This work investigates the entanglement between Continual Learning (CL) and Transfer Learning (TL). In particular, we shed light on the … (Read full abstract)

This work investigates the entanglement between Continual Learning (CL) and Transfer Learning (TL). In particular, we shed light on the widespread application of network pretraining, highlighting that it is itself subject to catastrophic forgetting. Unfortunately, this issue leads to the under-exploitation of knowledge transfer during later tasks. On this ground, we propose Transfer without Forgetting (TwF), a hybrid Continual Transfer Learning approach building upon a fixed pretrained sibling network, which continuously propagates the knowledge inherent in the source domain through a layer-wise loss term. Our experiments indicate that TwF steadily outperforms other CL methods across a variety of settings, averaging a 4.81% gain in Class-Incremental accuracy over a variety of datasets and different buffer sizes.

2022 Relazione in Atti di Convegno

Transform, Warp, and Dress: A New Transformation-Guided Model for Virtual Try-On

Authors: Fincato, Matteo; Cornia, Marcella; Landi, Federico; Cesari, Fabio; Cucchiara, Rita

Published in: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS

Virtual try-on has recently emerged in computer vision and multimedia communities with the development of architectures that can generate realistic … (Read full abstract)

Virtual try-on has recently emerged in computer vision and multimedia communities with the development of architectures that can generate realistic images of a target person wearing a custom garment. This research interest is motivated by the large role played by e-commerce and online shopping in our society. Indeed, the virtual try-on task can offer many opportunities to improve the efficiency of preparing fashion catalogs and to enhance the online user experience. The problem is far to be solved: current architectures do not reach sufficient accuracy with respect to manually generated images and can only be trained on image pairs with a limited variety. Existing virtual try-on datasets have two main limits: they contain only female models, and all the images are available only in low resolution. This not only affects the generalization capabilities of the trained architectures but makes the deployment to real applications impractical. To overcome these issues, we present Dress Code, a new dataset for virtual try-on that contains high-resolution images of a large variety of upper-body clothes and both male and female models. Leveraging this enriched dataset, we propose a new model for virtual try-on capable of generating high-quality and photo-realistic images using a three-stage pipeline. The first two stages perform two different geometric transformations to warp the desired garment and make it fit into the target person's body pose and shape. Then, we generate the new image of that same person wearing the try-on garment using a generative network. We test the proposed solution on the most widely used dataset for this task as well as on our newly collected dataset and demonstrate its effectiveness when compared to current state-of-the-art methods. Through extensive analyses on our Dress Code dataset, we show the adaptability of our model, which can generate try-on images even with a higher resolution.

2022 Articolo su rivista

Page 30 of 106 • Total publications: 1054