Publications by Lorenzo Baraldi

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Lorenzo Baraldi

MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

Authors: Pipoli, Vittorio; Saporita, Alessia; Bolelli, Federico; Cornia, Marcella; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita; Ficarra, Elisa

Recently, Multimodal Large Language Models (MLLMs) have emerged as a leading framework for enhancing the ability of Large Language Models … (Read full abstract)

Recently, Multimodal Large Language Models (MLLMs) have emerged as a leading framework for enhancing the ability of Large Language Models (LLMs) to interpret non-linguistic modalities. Despite their impressive capabilities, the robustness of MLLMs under conditions where one or more modalities are missing remains largely unexplored. In this paper, we investigate the extent to which MLLMs can maintain performance when faced with missing modality inputs. Moreover, we propose a novel framework to mitigate the aforementioned issue called Retrieval-Augmented Generation for missing modalities (MissRAG). It consists of a novel multimodal RAG technique alongside a tailored prompt engineering strategy designed to enhance model robustness by mitigating the impact of absent modalities while preventing the burden of additional instruction tuning. To demonstrate the effectiveness of our techniques, we conducted comprehensive evaluations across five diverse datasets, covering tasks such as audio-visual question answering, audio-visual captioning, and multimodal sentiment analysis.

2025 Relazione in Atti di Convegno

Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization

Authors: Compagnoni, Alberto; Caffagni, Davide; Moratelli, Nicholas; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita

Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to … (Read full abstract)

Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to computer vision. Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate, that is to generate answers to the user's query that are not reflected in the visual input. In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations. In contrast to recent approaches that require complicated pipelines to build synthetic preference data for alignment training, often relying on proprietary models, we capitalize on the well-known CHAIR metric, originally proposed to gauge the degree of hallucinations in image captioning. Given a pair of generated answers, we leverage CHAIR to distinguish winner and loser options (i.e., non-hallucinated and hallucinated samples) and fine-tune off-the-shelf MLLMs via Direct Preference Optimization (DPO). The resulting method, which we refer to as CHAIR-DPO, effectively diminishes the amount of hallucinated answers on several hallucination benchmarks, demonstrating the effectiveness of fine-tuning the MLLM with a CHAIR-based reward.

2025 Relazione in Atti di Convegno

Multimodal Dialogue for Empathetic Human-Robot Interaction

Authors: Rawal, Niyati; Singh Maharjan, Rahul; Salici, Giacomo; Catalini, Riccardo; Romeo, Marta; Bigazzi, Roberto; Baraldi, Lorenzo; Vezzani, Roberto; Cucchiara, Rita; Cangelosi, Angelo

2025 Relazione in Atti di Convegno

Multimodal Emotion Recognition in Conversation via Possible Speaker's Audio and Visual Sequence Selection

Authors: Singh Maharjan, Rahul; Rawal, Niyati; Romeo, Marta; Baraldi, Lorenzo; Cucchiara, Rita; Cangelosi, Angelo

Published in: PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING

2025 Relazione in Atti di Convegno

Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries

Authors: Amoroso, Roberto; Zhang, Gengyuan; Koner, Rajat; Baraldi, Lorenzo; Cucchiara, Rita; Tresp, Volker

Video Question Answering (Video QA) is a critical and challenging task in video understanding, necessitating models to comprehend entire videos, … (Read full abstract)

Video Question Answering (Video QA) is a critical and challenging task in video understanding, necessitating models to comprehend entire videos, identify the most pertinent information based on the contextual cues from the question, and reason accurately to provide answers. Initial endeavors in harnessing Multimodal Large Language Models (MLLMs) have cast new light on Visual QA, particularly highlighting their commonsense and temporal reasoning capacities. Models that effectively align visual and textual elements can offer more accurate answers tailored to visual inputs. Nevertheless, an unresolved question persists regarding video content: How can we efficiently extract the most relevant information from videos over time and space for enhanced VQA? In this study, we evaluate the efficacy of various temporal modeling techniques in conjunction with MLLMs and introduce a novel component, T-Former, designed as a question-guided temporal querying transformer. T-Former bridges frame-wise visual perception and the reasoning capabilities of LLMs. Our evaluation across various VideoQA benchmarks shows that T-Former, with its linear computational complexity, competes favorably with existing temporal modeling approaches and aligns with the latest advancements in Video QA tasks.

2025 Relazione in Atti di Convegno

Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Authors: Sarto, Sara; Moratelli, Nicholas; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: INTERNATIONAL JOURNAL OF COMPUTER VISION

2025 Articolo su rivista

Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Authors: Caffagni, Davide; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning … (Read full abstract)

Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries -- composed of both an image and a text -- and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT.

2025 Relazione in Atti di Convegno

Semantically Conditioned Prompts for Visual Recognition under Missing Modality Scenarios

Authors: Pipoli, Vittorio; Bolelli, Federico; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita; Ficarra, Elisa

Published in: IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION

This paper tackles the domain of multimodal prompting for visual recognition, specifically when dealing with missing modalities through multimodal Transformers. … (Read full abstract)

This paper tackles the domain of multimodal prompting for visual recognition, specifically when dealing with missing modalities through multimodal Transformers. It presents two main contributions: (i) we introduce a novel prompt learning module which is designed to produce sample-specific prompts and (ii) we show that modality-agnostic prompts can effectively adjust to diverse missing modality scenarios. Our model, termed SCP, exploits the semantic representation of available modalities to query a learnable memory bank, which allows the generation of prompts based on the semantics of the input. Notably, SCP distinguishes itself from existing methodologies for its capacity of self-adjusting to both the missing modality scenario and the semantic context of the input, without prior knowledge about the specific missing modality and the number of modalities. Through extensive experiments, we show the effectiveness of the proposed prompt learning framework and demonstrate enhanced performance and robustness across a spectrum of missing modality cases.

2025 Relazione in Atti di Convegno

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Authors: Barsellotti, Luca; Bianchi, Lorenzo; Messina, Nicola; Carrara, Fabio; Cornia, Marcella; Baraldi, Lorenzo; Falchi, Fabrizio; Cucchiara, Rita

Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such … (Read full abstract)

Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks.

2025 Relazione in Atti di Convegno

Tracing Information Flow in LLaMA Vision: A Step Toward Multimodal Understanding

Authors: Saporita, Alessia; Pipoli, Vittorio; Bolelli, Federico; Baraldi, Lorenzo; Acquaviva, Andrea; Ficarra, Elisa

Multimodal Large Language Models (MLLMs) have recently emerged as a powerful framework for extending the capabilities of Large Language Models … (Read full abstract)

Multimodal Large Language Models (MLLMs) have recently emerged as a powerful framework for extending the capabilities of Large Language Models (LLMs) to reason over non-textual modalities. However, despite their success, understanding how they integrate visual and textual information remains an open challenge. Among them, LLaMA~3.2-Vision represents a significant milestone in the development of open-source MLLMs, offering a reproducible and efficient architecture that competes with leading proprietary models, such as Claude 3 Haiku and GPT-4o mini. Motivated by these characteristics, we conduct the first systematic analysis of the information flow between vision and language in LLaMA~3.2-Vision. We analyze three visual question answering (VQA) benchmarks, covering the tasks of VQA on natural images---using both open-ended and multiple-choice question formats---as well as document VQA. These tasks require diverse reasoning capabilities, making them well-suited to reveal distinct patterns in multimodal reasoning. Our analysis unveils a four-stage reasoning strategy: an initial semantic interpretation of the question, an early-to-mid-layer multimodal fusion, a task-specific reasoning stage guided by the resulting multimodal embedding, and a final answer prediction stage. Furthermore, we reveal that multimodal fusion is task-dependent: in complex settings such as document VQA, the model postpones cross-modal integration until semantic reasoning over the question has been established. Overall, our findings offer new insights into the internal dynamics of MLLMs and contribute to advancing the interpretability of vision-language architectures. Our source code is available at https://github.com/AImageLab/MLLMs-FlowTracker.

2025 Relazione in Atti di Convegno

Page 2 of 15 • Total publications: 144