Publications by Lorenzo Baraldi

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Lorenzo Baraldi

Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters

Authors: Landi, Federico; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita

In Vision-and-Language Navigation (VLN), an embodied agent needs to reach a target destination with the only guidance of a natural … (Read full abstract)

In Vision-and-Language Navigation (VLN), an embodied agent needs to reach a target destination with the only guidance of a natural language instruction. To explore the environment and progress towards the target location, the agent must perform a series of low-level actions, such as rotate, before stepping ahead. In this paper, we propose to exploit dynamic convolutional filters to encode the visual information and the lingual description in an efficient way. Differently from some previous works that abstract from the agent perspective and use high-level navigation spaces, we design a policy which decodes the information provided by dynamic convolution into a series of low-level, agent friendly actions. Results show that our model exploiting dynamic filters performs better than other architectures with traditional convolution, being the new state of the art for embodied VLN in the low-level action space. Additionally, we attempt to categorize recent work on VLN depending on their architectural choices and distinguish two main groups: we call them low-level actions and high-level actions models. To the best of our knowledge, we are the first to propose this analysis and categorization for VLN.

2019 Relazione in Atti di Convegno

Image-to-Image Translation to Unfold the Reality of Artworks: an Empirical Analysis

Authors: Tomei, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

State-of-the-art Computer Vision pipelines show poor performances on artworks and data coming from the artistic domain, thus limiting the applicability … (Read full abstract)

State-of-the-art Computer Vision pipelines show poor performances on artworks and data coming from the artistic domain, thus limiting the applicability of current architectures to the automatic understanding of the cultural heritage. This is mainly due to the difference in texture and low-level feature distribution between artistic and real images, on which state-of-the-art approaches are usually trained. To enhance the applicability of pre-trained architectures on artistic data, we have recently proposed an unpaired domain translation approach which can translate artworks to photo-realistic visualizations. Our approach leverages semantically-aware memory banks of real patches, which are used to drive the generation of the translated image while improving its realism. In this paper, we provide additional analyses and experimental results which demonstrate the effectiveness of our approach. In particular, we evaluate the quality of generated results in the case of the translation of landscapes, portraits and of paintings coming from four different styles using automatic distance metrics. Also, we analyze the response of pre-trained architecture for classification, detection and segmentation both in terms of feature distribution and entropy of prediction, and show that our approach effectively reduces the domain shift of paintings. As an additional contribution, we also provide a qualitative analysis of the reduction of the domain shift for detection, segmentation and image captioning.

2019 Relazione in Atti di Convegno

M-VAD Names: a Dataset for Video Captioning with Naming

Authors: Pini, Stefano; Cornia, Marcella; Bolelli, Federico; Baraldi, Lorenzo; Cucchiara, Rita

Published in: MULTIMEDIA TOOLS AND APPLICATIONS

Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" … (Read full abstract)

Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" tag. The lack of movie description datasets with characters' visual annotations surely plays a relevant role in this shortage. Recently, we proposed to extend the M-VAD dataset by introducing such information. In this paper, we present an improved version of the dataset, namely M-VAD Names, and its semi-automatic annotation procedure. The resulting dataset contains 63k visual tracks and 34k textual mentions, all associated with character identities. To showcase the features of the dataset and quantify the complexity of the naming task, we investigate multimodal architectures to replace the "someone" tags with proper character names in existing video captions. The evaluation is further extended by testing this application on videos outside of the M-VAD Names dataset.

2019 Articolo su rivista

Recognizing social relationships from an egocentric vision perspective

Authors: Alletto, Stefano; Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita

In this chapter we address the problem of partitioning social gatherings into interacting groups in egocentric scenarios. People in the … (Read full abstract)

In this chapter we address the problem of partitioning social gatherings into interacting groups in egocentric scenarios. People in the scene are tracked, their head pose and 3D location are estimated. Following the formalism of the f-formation, we define with the orientation and distance an inherently social pairwise feature capable of describing how two people stand in relation to one another. We present a Structural SVM based approach to learn how to weight each component of the feature vector depending on the social situation is applied to. To better understand the social dynamics, we also estimate what we call social relevance of each subject in a group using a saliency attentive model. Extensive tests on two publicly available datasets show that our solution achieves encouraging results when detecting social groups and their relevant subjects in the challenging egocentric scenarios.

2019 Capitolo/Saggio

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions

Authors: Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As … (Read full abstract)

Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at hand, a higher degree of controllability is needed to apply captioning algorithms in complex scenarios. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability. Given a control signal in the form of a sequence or set of image regions, we generate the corresponding caption through a recurrent architecture which predicts textual chunks explicitly grounded on regions, following the constraints of the given control. Experiments are conducted on Flickr30k Entities and on COCO Entities, an extended version of COCO in which we add grounding annotations collected in a semi-automatic manner. Results demonstrate that our method achieves state of the art performances on controllable image captioning, in terms of caption quality and diversity. Code and annotations are publicly available at: https://github.com/aimagelab/show-control-and-tell.

2019 Relazione in Atti di Convegno

Towards Cycle-Consistent Models for Text and Image Retrieval

Authors: Cornia, Marcella; Baraldi, Lorenzo; Rezazadegan Tavakoli, Hamed; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Cross-modal retrieval has been recently becoming an hot-spot research, thanks to the development of deeply-learnable architectures. Such architectures generally learn … (Read full abstract)

Cross-modal retrieval has been recently becoming an hot-spot research, thanks to the development of deeply-learnable architectures. Such architectures generally learn a joint multi-modal embedding space in which text and images could be projected and compared. Here we investigate a different approach, and reformulate the problem of cross-modal retrieval as that of learning a translation between the textual and visual domain. In particular, we propose an end-to-end trainable model which can translate text into image features and vice versa, and regularizes this mapping with a cycle-consistency criterion. Preliminary experimental evaluations show promising results with respect to ordinary visual-semantic models.

2019 Relazione in Atti di Convegno

Visual-Semantic Alignment Across Domains Using a Semi-Supervised Approach

Authors: Carraggi, Angelo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval of images and sentences. In this setting, … (Read full abstract)

Visual-semantic embeddings have been extensively used as a powerful model for cross-modal retrieval of images and sentences. In this setting, data coming from different modalities can be projected in a common embedding space, in which distances can be used to infer the similarity between pairs of images and sentences. While this approach has shown impressive performances on fully supervised settings, its application to semi-supervised scenarios has been rarely investigated. In this paper we propose a domain adaptation model for cross-modal retrieval, in which the knowledge learned from a supervised dataset can be transferred on a target dataset in which the pairing between images and sentences is not known, or not useful for training due to the limited size of the set. Experiments are performed on two target unsupervised scenarios, respectively related to the fashion and cultural heritage domain. Results show that our model is able to effectively transfer the knowledge learned on ordinary visual-semantic datasets, achieving promising results. As an additional contribution, we collect and release the dataset used for the cultural heritage domain.

2019 Relazione in Atti di Convegno

What was Monet seeing while painting? Translating artworks to photo-realistic images

Authors: Tomei, Matteo; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

State of the art Computer Vision techniques exploit the availability of large-scale datasets, most of which consist of images captured … (Read full abstract)

State of the art Computer Vision techniques exploit the availability of large-scale datasets, most of which consist of images captured from the world as it is. This brings to an incompatibility between such methods and digital data from the artistic domain, on which current techniques under-perform. A possible solution is to reduce the domain shift at the pixel level, thus translating artistic images to realistic copies. In this paper, we present a model capable of translating paintings to photo-realistic images, trained without paired examples. The idea is to enforce a patch level similarity between real and generated images, aiming to reproduce photo-realistic details from a memory bank of real images. This is subsequently adopted in the context of an unpaired image-to-image translation framework, mapping each image from one distribution to a new one belonging to the other distribution. Qualitative and quantitative results are presented on Monet, Cezanne and Van Gogh paintings translation tasks, showing that our approach increases the realism of generated images with respect to the CycleGAN approach.

2019 Relazione in Atti di Convegno

A Hierarchical Quasi-Recurrent approach to Video Captioning

Authors: Bolelli, Federico; Baraldi, Lorenzo; Grana, Costantino

Video captioning has picked up a considerable measure of attention thanks to the use of Recurrent Neural Networks, since they … (Read full abstract)

Video captioning has picked up a considerable measure of attention thanks to the use of Recurrent Neural Networks, since they can be utilized to both encode the input video and to create the corresponding description. In this paper, we present a recurrent video encoding scheme which can find and exploit the layered structure of the video. Differently from the established encoder-decoder approach, in which a video is encoded continuously by a recurrent layer, we propose to employ Quasi-Recurrent Neural Networks, further extending their basic cell with a boundary detector which can recognize discontinuity points between frames or segments and likewise modify the temporal connections of the encoding layer. We assess our approach on a large scale dataset, the Montreal Video Annotation dataset. Experiments demonstrate that our approach can find suitable levels of representation of the input information, while reducing the computational requirements.

2018 Relazione in Atti di Convegno

Aligning Text and Document Illustrations: towards Visually Explainable Digital Humanities

Authors: Baraldi, Lorenzo; Cornia, Marcella; Grana, Costantino; Cucchiara, Rita

While several approaches to bring vision and language together are emerging, none of them has yet addressed the digital humanities … (Read full abstract)

While several approaches to bring vision and language together are emerging, none of them has yet addressed the digital humanities domain, which, nevertheless, is a rich source of visual and textual data. To foster research in this direction, we investigate the learning of visual-semantic embeddings for historical document illustrations, devising both supervised and semi-supervised approaches. We exploit the joint visual-semantic embeddings to automatically align illustrations and textual elements, thus providing an automatic annotation of the visual content of a manuscript. Experiments are performed on the Borso d'Este Holy Bible, one of the most sophisticated illuminated manuscript from the Renaissance, which we manually annotate aligning every illustration with textual commentaries written by experts. Experimental results quantify the domain shift between ordinary visual-semantic datasets and the proposed one, validate the proposed strategies, and devise future works on the same line.

2018 Relazione in Atti di Convegno

Page 11 of 15 • Total publications: 144