Publications

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Improving Car Model Classification through Vehicle Keypoint Localization

Authors: Simoni, Alessandro; D'Eusanio, Andrea; Pini, Stefano; Borghi, Guido; Vezzani, Roberto

In this paper, we present a novel multi-task framework which aims to improve the performance of car model classification leveraging … (Read full abstract)

In this paper, we present a novel multi-task framework which aims to improve the performance of car model classification leveraging visual features and pose information extracted from single RGB images. In particular, we merge the visual features obtained through an image classification network and the features computed by a model able to predict the pose in terms of 2D car keypoints. We show how this approach considerably improves the performance on the model classification task testing our framework on a subset of the Pascal3D dataset containing the car classes. Finally, we conduct an ablation study to demonstrate the performance improvement obtained with respect to a single visual classifier network.

2021 Relazione in Atti di Convegno

Improving Indoor Semantic Segmentation with Boundary-level Objectives

Authors: Amoroso, Roberto; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

While most of the recent literature on semantic segmentation has focused on outdoor scenarios, the generation of accurate indoor segmentation … (Read full abstract)

While most of the recent literature on semantic segmentation has focused on outdoor scenarios, the generation of accurate indoor segmentation maps has been partially under-investigated, although being a relevant task with applications in augmented reality, image retrieval, and personalized robotics. With the goal of increasing the accuracy of semantic segmentation in indoor scenarios, we develop and propose two novel boundary-level training objectives, which foster the generation of accurate boundaries between different semantic classes. In particular, we take inspiration from the Boundary and Active Boundary losses, two recent proposals which deal with the prediction of semantic boundaries, and propose modified geometric distance functions that improve predictions at the boundary level. Through experiments on the NYUDv2 dataset, we assess the appropriateness of our proposal in terms of accuracy and quality of boundary prediction and demonstrate its accuracy gain.

2021 Relazione in Atti di Convegno

L'intelligenza non è artificiale. La rivoluzione tecnologica che sta già cambiando il mondo

Authors: Cucchiara, Rita

2021 Monografia/Trattato scientifico

Learning to Read L'Infinito: Handwritten Text Recognition with Synthetic Training Data

Authors: Cascianelli, Silvia; Cornia, Marcella; Baraldi, Lorenzo; Piazzi, Maria Ludovica; Schiuma, Rosiana; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Deep learning-based approaches to Handwritten Text Recognition (HTR) have shown remarkable results on publicly available large datasets, both modern and … (Read full abstract)

Deep learning-based approaches to Handwritten Text Recognition (HTR) have shown remarkable results on publicly available large datasets, both modern and historical. However, it is often the case that historical manuscripts are preserved in small collections, most of the time with unique characteristics in terms of paper support, author handwriting style, and language. State-of-the-art HTR approaches struggle to obtain good performance on such small manuscript collections, for which few training samples are available. In this paper, we focus on HTR on small historical datasets and propose a new historical dataset, which we call Leopardi, with the typical characteristics of small manuscript collections, consisting of letters by the poet Giacomo Leopardi, and devise strategies to deal with the training data scarcity scenario. In particular, we explore the use of carefully designed but cost-effective synthetic data for pre-training HTR models to be applied to small single-author manuscripts. Extensive experiments validate the suitability of the proposed approach, and both the Leopardi dataset and synthetic data will be available to favor further research in this direction.

2021 Relazione in Atti di Convegno

Learning to Select: A Fully Attentive Approach for Novel Object Captioning

Authors: Cagrandi, Marco; Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Cucchiara, Rita

Image captioning models have lately shown impressive results when applied to standard datasets. Switching to real-life scenarios, however, constitutes a … (Read full abstract)

Image captioning models have lately shown impressive results when applied to standard datasets. Switching to real-life scenarios, however, constitutes a challenge due to the larger variety of visual concepts which are not covered in existing training sets. For this reason, novel object captioning (NOC) has recently emerged as a paradigm to test captioning models on objects which are unseen during the training phase. In this paper, we present a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set, and to constrain the generative process of a language model accordingly. Our architecture is fully-attentive and end-to-end trainable, also when incorporating constraints. We perform experiments on the held-out COCO dataset, where we demonstrate improvements over the state of the art, both in terms of adaptability to novel objects and caption quality.

2021 Relazione in Atti di Convegno

Metric-Learning-Based Deep Hashing Network for Content-Based Retrieval of Remote Sensing Images

Authors: Roy, Subhankar; Sangineto, Enver; Demir, Begum; Sebe, Nicu

Published in: IEEE GEOSCIENCE AND REMOTE SENSING LETTERS

Hashing methods have recently been shown to be very effective in the retrieval of remote sensing (RS) images due to … (Read full abstract)

Hashing methods have recently been shown to be very effective in the retrieval of remote sensing (RS) images due to their computational efficiency and fast search speed. Common hashing methods in RS are based on hand-crafted features on top of which they learn a hash function, which provides the final binary codes. However, these features are not optimized for the final task (i.e., retrieval using binary codes). On the other hand, modern deep neural networks (DNNs) have shown an impressive success in learning optimized features for a specific task in an end-to-end fashion. Unfortunately, typical RS data sets are composed of only a small number of labeled samples, which make the training (or fine-tuning) of big DNNs problematic and prone to overfitting. To address this problem, in this letter, we introduce a metric-learning-based hashing network, which: 1) implicitly uses a big, pretrained DNN as an intermediate representation step without the need of retraining or fine-tuning; 2) learns a semantic-based metric space where the features are optimized for the target retrieval task; and 3) computes compact binary hash codes for fast search. Experiments carried out on two RS benchmarks highlight that the proposed network significantly improves the retrieval performance under the same retrieval time when compared to the state-of-the-art hashing methods in RS.

2021 Articolo su rivista

MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking?

Authors: Fabbri, Matteo; Braso, Guillem; Maugeri, Gianluca; Cetintas, Orcun; Gasparini, Riccardo; Osep, Aljosa; Calderara, Simone; Leal-Taixe, Laura; Cucchiara, Rita

Published in: PROCEEDINGS IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION

2021 Relazione in Atti di Convegno

Multi-Category Mesh Reconstruction From Image Collections

Authors: Simoni, Alessandro; Pini, Stefano; Vezzani, Roberto; Cucchiara, Rita

Recently, learning frameworks have shown the capability of inferring the accurate shape, pose, and texture of an object from a … (Read full abstract)

Recently, learning frameworks have shown the capability of inferring the accurate shape, pose, and texture of an object from a single RGB image. However, current methods are trained on image collections of a single category in order to exploit specific priors, and they often make use of category-specific 3D templates. In this paper, we present an alternative approach that infers the textured mesh of objects combining a series of deformable 3D models and a set of instance-specific deformation, pose, and texture. Differently from previous works, our method is trained with images of multiple object categories using only foreground masks and rough camera poses as supervision. Without specific 3D templates, the framework learns category-level models which are deformed to recover the 3D shape of the depicted object. The instance-specific deformations are predicted independently for each vertex of the learned 3D mesh, enabling the dynamic subdivision of the mesh during the training process. Experiments show that the proposed framework can distinguish between different object categories and learn category-specific shape priors in an unsupervised manner. Predicted shapes are smooth and can leverage from multiple steps of subdivision during the training process, obtaining comparable or state-of-the-art results on two public datasets. Models and code are publicly released.

2021 Relazione in Atti di Convegno

Multimodal Attention Networks for Low-Level Vision-and-Language Navigation

Authors: Landi, Federico; Baraldi, Lorenzo; Cornia, Marcella; Corsini, Massimiliano; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a … (Read full abstract)

Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. The goal gets even harder as the actions available to the agent get simpler and move towards low-level, atomic interactions with the environment. This setting takes the name of low-level VLN. In this paper, we strive for the creation of an agent able to tackle three key issues: multi-modality, long-term dependencies, and adaptability towards different locomotive settings. To that end, we devise "Perceive, Transform, and Act" (PTA): a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities -- natural language, images, and low-level actions for the agent control. In particular, we adopt an early fusion strategy to merge lingual and visual information efficiently in our encoder. We then propose to refine the decoding phase with a late fusion extension between the agent's history of actions and the perceptual modalities. We experimentally validate our model on two datasets: PTA achieves promising results in low-level VLN on R2R and achieves good performance in the recently proposed R4R benchmark. Our code is publicly available at https://github.com/aimagelab/perceive-transform-and-act.

2021 Articolo su rivista

Optimizing Quality Inspection and Control in Powder Bed Metal Additive Manufacturing: Challenges and Research Directions

Authors: Di Cataldo, Santa; Vinco, Sara; Urgese, Gianvito; Calignano, Flaviana; Ficarra, Elisa; Macii, Alberto; Macii, Enrico

Published in: PROCEEDINGS OF THE IEEE

2021 Articolo su rivista

Page 35 of 106 • Total publications: 1054