Publications by Lorenzo Baraldi

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Lorenzo Baraldi

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Authors: Moratelli, Nicholas; Barraco, Manuele; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: SENSORS

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this … (Read full abstract)

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.

2023 Articolo su rivista

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Authors: Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cascianelli, Silvia; Fiameni, Giuseppe; Cucchiara, Rita

Published in: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted … (Read full abstract)

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.

2023 Articolo su rivista

Fully-Attentive Iterative Networks for Region-based Controllable Image and Video Captioning

Authors: Cornia, Marcella; Baraldi, Lorenzo; Ayellet, Tal; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios … (Read full abstract)

Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios of image captioning algorithms. In this task, a captioner is conditioned on an external control signal, which needs to be followed during the generation of the caption. We aim to overcome the limitations of current controllable captioning methods by proposing a fully-attentive and iterative network that can generate grounded and controllable captions from a control signal given as a sequence of visual regions from the image. Our architecture is based on a set of novel attention operators, which take into account the hierarchical nature of the control signal, and is endowed with a decoder which explicitly focuses on each part of the control signal. We demonstrate the effectiveness of the proposed approach by conducting experiments on three datasets, where our model surpasses the performances of previous methods and achieves a new state of the art on both image and video controllable captioning.

2023 Articolo su rivista

Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation

Authors: Betti, Federico; Staiano, Jacopo; Baraldi, Lorenzo; Baraldi, Lorenzo; Cucchiara, Rita; Sebe, Nicu

2023 Relazione in Atti di Convegno

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

Authors: Sarto, Sara; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation … (Read full abstract)

The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language models. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. We publicly release our source code and trained models.

2023 Relazione in Atti di Convegno

Sharing Cultural Heritage—The Case of the Lodovico Media Library

Authors: Al Kalak, Matteo; Baraldi, Lorenzo

Published in: MULTIMODAL TECHNOLOGIES AND INTERACTION

2023 Articolo su rivista

Superpixel Positional Encoding to Improve ViT-based Semantic Segmentation Models

Authors: Amoroso, Roberto; Tomei, Matteo; Baraldi, Lorenzo; Cucchiara, Rita

2023 Relazione in Atti di Convegno

SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning

Authors: Caffagni, Davide; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Image captioning is a challenging task that combines Computer Vision and Natural Language Processing to generate descriptive and accurate textual … (Read full abstract)

Image captioning is a challenging task that combines Computer Vision and Natural Language Processing to generate descriptive and accurate textual descriptions for input images. Research efforts in this field mainly focus on developing novel architectural components to extend image captioning models and using large-scale image-text datasets crawled from the web to boost final performance. In this work, we explore an alternative to web-crawled data and augment the training dataset with synthetic images generated by a latent diffusion model. In particular, we propose a simple yet effective synthetic data augmentation framework that is capable of significantly improving the quality of captions generated by a standard Transformer-based model, leading to competitive results on the COCO dataset.

2023 Relazione in Atti di Convegno

Towards Explainable Navigation and Recounting

Authors: Poppi, Samuele; Rawal, Niyati; Bigazzi, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Explainability and interpretability of deep neural networks have become of crucial importance over the years in Computer Vision, concurrently with … (Read full abstract)

Explainability and interpretability of deep neural networks have become of crucial importance over the years in Computer Vision, concurrently with the need to understand increasingly complex models. This necessity has fostered research on approaches that facilitate human comprehension of neural methods. In this work, we propose an explainable setting for visual navigation, in which an autonomous agent needs to explore an unseen indoor environment while portraying and explaining interesting scenes with natural language descriptions. We combine recent advances in ongoing research fields, employing an explainability method on images generated through agent-environment interaction. Our approach uses explainable maps to visualize model predictions and highlight the correlation between the observed entities and the generated words, to focus on prominent objects encountered during the environment exploration. The experimental section demonstrates that our approach can identify the regions of the images that the agent concentrates on to describe its point of view, improving explainability.

2023 Relazione in Atti di Convegno

Unveiling the Impact of Image Transformations on Deepfake Detection: An Experimental Analysis

Authors: Cocchi, Federico; Baraldi, Lorenzo; Poppi, Samuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

With the recent explosion of interest in visual Generative AI, the field of deepfake detection has gained a lot of … (Read full abstract)

With the recent explosion of interest in visual Generative AI, the field of deepfake detection has gained a lot of attention. In fact, deepfake detection might be the only measure to counter the potential proliferation of generated media in support of fake news and its consequences. While many of the available works limit the detection to a pure and direct classification of fake versus real, this does not translate well to a real-world scenario. Indeed, malevolent users can easily apply post-processing techniques to generated content, changing the underlying distribution of fake data. In this work, we provide an in-depth analysis of the robustness of a deepfake detection pipeline, considering different image augmentations, transformations, and other pre-processing steps. These transformations are only applied in the evaluation phase, thus simulating a practical situation in which the detector is not trained on all the possible augmentations that can be used by the attacker. In particular, we analyze the performance of a k-NN and a linear probe detector on the COCOFake dataset, using image features extracted from pre-trained models, like CLIP and DINO. Our results demonstrate that while the CLIP visual backbone outperforms DINO in deepfake detection with no augmentation, its performance varies significantly in presence of any transformation, favoring the robustness of DINO.

2023 Relazione in Atti di Convegno

Page 6 of 15 • Total publications: 144