Publications by Lorenzo Baraldi

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Lorenzo Baraldi

Matching Faces and Attributes Between the Artistic and the Real Domain: the PersonArt Approach

Authors: Cornia, Marcella; Tomei, Matteo; Baraldi, Lorenzo; Cucchiara, Rita

Published in: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS

In this article, we present an approach for retrieving similar faces between the artistic and the real domain. The application … (Read full abstract)

In this article, we present an approach for retrieving similar faces between the artistic and the real domain. The application we refer to is an interactive exhibition inside a museum, in which a visitor can take a photo of himself and search for a lookalike in the collection of paintings. The task requires not only to identify faces but also to extract discriminative features from artistic and photo-realistic images, tackling a significant domain shift. Our method integrates feature extraction networks which account for the aesthetic similarity of two faces and their correspondences in terms of semantic attributes. Also, it addresses the domain shift between realistic images and paintings by translating photo-realistic images into the artistic domain. Noticeably, by exploiting the same technique, our model does not need to rely on annotated data in the artistic domain. Experimental results are conducted on different paired datasets to show the effectiveness of the proposed solution in terms of identity and attribute preservation. The approach is also evaluated on unpaired settings and in combination with an interactive relevance feedback strategy. Finally, we show how the proposed algorithm has been implemented in a real showcase at the Gallerie Estensi museum in Italy, with the participation of more than 1,100 visitors in just three days.

2022 Articolo su rivista

Retrieval-Augmented Transformer for Image Captioning

Authors: Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past … (Read full abstract)

Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens based on the past context and on text retrieved from the external memory. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality. Our work opens up new avenues for improving image captioning models at larger scale.

2022 Relazione in Atti di Convegno

Spot the Difference: A Novel Task for Embodied Agents in Changing Environments

Authors: Landi, Federico; Bigazzi, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

Embodied AI is a recent research area that aims at creating intelligent agents that can move and operate inside an … (Read full abstract)

Embodied AI is a recent research area that aims at creating intelligent agents that can move and operate inside an environment. Existing approaches in this field demand the agents to act in completely new and unexplored scenes. However, this setting is far from realistic use cases that instead require executing multiple tasks in the same environment. Even if the environment changes over time, the agent could still count on its global knowledge about the scene while trying to adapt its internal representation to the current state of the environment. To make a step towards this setting, we propose Spot the Difference: a novel task for Embodied AI where the agent has access to an outdated map of the environment and needs to recover the correct layout in a fixed time budget. To this end, we collect a new dataset of occupancy maps starting from existing datasets of 3D spaces and generating a number of possible layouts for a single environment. This dataset can be employed in the popular Habitat simulator and is fully compliant with existing methods that employ reconstructed occupancy maps during navigation. Furthermore, we propose an exploration policy that can take advantage of previous knowledge of the environment and identify changes in the scene faster and more effectively than existing agents. Experimental results show that the proposed architecture outperforms existing state-of-the-art models for exploration on this new setting.

2022 Relazione in Atti di Convegno

The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition

Authors: Cascianelli, Silvia; Pippi, Vittorio; Maarand, Martin; Cornia, Marcella; Baraldi, Lorenzo; Kermorvant, Christopher; Cucchiara, Rita

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main … (Read full abstract)

Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting – even of the same author over a wide time-span – and the scarcity of data from ancient, poorly represented languages. With the aim of fostering the research on this topic, in this paper we present the Ludovico Antonio Muratori (LAM) dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years. The dataset comes in two configurations: a basic splitting and a date-based splitting which takes into account the age of the author. The first setting is intended to study HTR on ancient documents in Italian, while the second focuses on the ability of HTR systems to recognize text written by the same writer in time periods for which training data are not available. For both configurations, we analyze quantitative and qualitative characteristics, also with respect to other line-level HTR benchmarks, and present the recognition performance of state-of-the-art HTR architectures. The dataset is available for download at https://aimagelab.ing.unimore.it/go/lam.

2022 Relazione in Atti di Convegno

The Unreasonable Effectiveness of CLIP features for Image Captioning: an Experimental Analysis

Authors: Barraco, Manuele; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

Generating textual descriptions from visual inputs is a fundamental step towards machine intelligence, as it entails modeling the connections between … (Read full abstract)

Generating textual descriptions from visual inputs is a fundamental step towards machine intelligence, as it entails modeling the connections between the visual and textual modalities. For years, image captioning models have relied on pre-trained visual encoders and object detectors, trained on relatively small sets of data. Recently, it has been observed that large-scale multi-modal approaches like CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, provide a strong zero-shot capability on various vision tasks. In this paper, we study the advantage brought by CLIP in image captioning, employing it as a visual encoder. Through extensive experiments, we show how CLIP can significantly outperform widely-used visual encoders and quantify its role under different architectures, variants, and evaluation protocols, ranging from classical captioning performance to zero-shot transfer.

2022 Relazione in Atti di Convegno

A Novel Attention-based Aggregation Function to Combine Vision and Language

Authors: Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision … (Read full abstract)

The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements - like regions and words - proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.

2021 Relazione in Atti di Convegno

Assessing the Role of Boundary-level Objectives in Indoor Semantic Segmentation

Authors: Amoroso, Roberto; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Providing fine-grained and accurate segmentation maps of indoor scenes is a challenging task with relevant applications in the fields of … (Read full abstract)

Providing fine-grained and accurate segmentation maps of indoor scenes is a challenging task with relevant applications in the fields of augmented reality, image retrieval, and personalized robotics. While most of the recent literature on semantic segmentation has focused on outdoor scenarios, the generation of accurate indoor segmentation maps has been partially under-investigated. With the goal of increasing the accuracy of semantic segmentation in indoor scenarios, we focus on the analysis of boundary-level objectives, which foster the generation of fine-grained boundaries between different semantic classes and which have never been explored in the case of indoor segmentation. In particular, we test and devise variants of both the Boundary and Active Boundary losses, two recent proposals which deal with the prediction of semantic boundaries. Through experiments on the NYUDv2 dataset, we quantify the role of such losses in terms of accuracy and quality of boundary prediction and demonstrate the accuracy gain of the proposed variants.

2021 Relazione in Atti di Convegno

Estimating (and fixing) the Effect of Face Obfuscation in Video Recognition

Authors: Tomei, Matteo; Baraldi, Lorenzo; Bronzin, Simone; Cucchiara, Rita

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

2021 Relazione in Atti di Convegno

Explore and Explain: Self-supervised Navigation and Recounting

Authors: Bigazzi, Roberto; Landi, Federico; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

Embodied AI has been recently gaining attention as it aims to foster the development of autonomous and intelligent agents. In … (Read full abstract)

Embodied AI has been recently gaining attention as it aims to foster the development of autonomous and intelligent agents. In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path. In this context, the agent needs to navigate the environment driven by an exploration goal, select proper moments for description, and output natural language descriptions of relevant objects and scenes. Our model integrates a novel self-supervised exploration module with penalty, and a fully-attentive captioning model for explanation. Also, we investigate different policies for selecting proper moments for explanation, driven by information coming from both the environment and the navigation. Experiments are conducted on photorealistic environments from the Matterport3D dataset and investigate the navigation and explanation capabilities of the agent as well as the role of their interactions.

2021 Relazione in Atti di Convegno

Improving Indoor Semantic Segmentation with Boundary-level Objectives

Authors: Amoroso, Roberto; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

While most of the recent literature on semantic segmentation has focused on outdoor scenarios, the generation of accurate indoor segmentation … (Read full abstract)

While most of the recent literature on semantic segmentation has focused on outdoor scenarios, the generation of accurate indoor segmentation maps has been partially under-investigated, although being a relevant task with applications in augmented reality, image retrieval, and personalized robotics. With the goal of increasing the accuracy of semantic segmentation in indoor scenarios, we develop and propose two novel boundary-level training objectives, which foster the generation of accurate boundaries between different semantic classes. In particular, we take inspiration from the Boundary and Active Boundary losses, two recent proposals which deal with the prediction of semantic boundaries, and propose modified geometric distance functions that improve predictions at the boundary level. Through experiments on the NYUDv2 dataset, we assess the appropriateness of our proposal in terms of accuracy and quality of boundary prediction and demonstrate its accuracy gain.

2021 Relazione in Atti di Convegno

Page 8 of 15 • Total publications: 144