Publications by Lorenzo Baraldi

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Lorenzo Baraldi

FG-TRACER: Tracing Information Flow in Multimodal Large Language Models in Free-Form Generation

Authors: Saporita, Alessia; Pipoli, Vittorio; Bolelli, Federico; Baraldi, Lorenzo; Acquaviva, Andrea; Ficarra, Elisa

Multimodal Large Language Models (MLLMs) have achieved impressive performance across a variety of vision–language tasks. However, their internal working mechanisms … (Read full abstract)

Multimodal Large Language Models (MLLMs) have achieved impressive performance across a variety of vision–language tasks. However, their internal working mechanisms remain largely underexplored. In his work, we introduce FG-TRACER, a framework designed to analyze the information flow between visual and textual modalities in MLLMs in free-form generation. Notably, our numerically stabilized computational method enables the first systematic analysis of multimodal information flow in underexplored domains such as image captioning and chain-of-thought (CoT) reasoning. We apply FG-TRACER to two state-of-the-art MLLMs—LLaMA 3.2-Vision and LLaVA 1.5—across three vision–language benchmarks—TextVQA, COCO 2014, and ChartQA—and we conduct a word-level analysis of multimodal integration. Our findings uncover distinct patterns of multimodal fusion across models and tasks, demonstrating that fusion dynamics are both model- and task-dependent. Overall, FG-TRACER offers a robust methodology for probing the internal mechanisms of MLLMs in free-form settings, providing new insights into their multimodal reasoning strategies. Our source code is publicly available at https://anonymous.4open.science/r/FG-TRACER-CB5A/.

2026

AIGeN-Llama: An Adversarial Approach for Instruction Generation in VLN using Llama2 Model

Authors: Rawal, Niyati; Baraldi, Lorenzo; Cucchiara, Rita

Published in: CEUR WORKSHOP PROCEEDINGS

2025 Relazione in Atti di Convegno

Augmenting and Mixing Transformers with Synthetic Data for Image Captioning

Authors: Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IMAGE AND VISION COMPUTING

Image captioning has attracted significant attention within the Computer Vision and Multimedia research domains, resulting in the development of effective … (Read full abstract)

Image captioning has attracted significant attention within the Computer Vision and Multimedia research domains, resulting in the development of effective methods for generating natural language descriptions of images. Concurrently, the rise of generative models has facilitated the production of highly realistic and high-quality images, particularly through recent advancements in latent diffusion models. In this paper, we propose to leverage the recent advances in Generative AI and create additional training data that can be effectively used to boost the performance of an image captioning model. Specifically, we combine real images with their synthetic counterparts generated by Stable Diffusion using a Mixup data augmentation technique to create novel training examples. Extensive experiments on the COCO dataset demonstrate the effectiveness of our solution in comparison to different baselines and state-of-the-art methods and validate the benefits of using synthetic data to augment the training stage of an image captioning model and improve the quality of the generated captions. Source code and trained models are publicly available at: https://github.com/aimagelab/synthcap_pp.

2025 Articolo su rivista

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Authors: Cocchi, Federico; Moratelli, Nicholas; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. … (Read full abstract)

Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge and predict the relevance of information retrieved from an external database. Tokens are trained following a two-stage two-model training recipe. This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed. Through our experiments, we demonstrate the efficacy of ReflectiVA for knowledge-based visual question answering, highlighting its superior performance compared to existing methods. Source code and trained models are publicly available at https://github.com/aimagelab/ReflectiVA.

2025 Relazione in Atti di Convegno

Causal Graphical Models for Vision-Language Compositional Understanding

Authors: Parascandolo, Fiorenzo; Moratelli, Nicholas; Sangineto, Enver; Baraldi, Lorenzo; Cucchiara, Rita

2025 Relazione in Atti di Convegno

Hallucination Early Detection in Diffusion Models

Authors: Betti, Federico; Baraldi, Lorenzo; Baraldi, Lorenzo; Cucchiara, Rita; Sebe, Nicu

Published in: INTERNATIONAL JOURNAL OF COMPUTER VISION

2025 Articolo su rivista

Hyperbolic Safety-Aware Vision-Language Models

Authors: Poppi, Tobia; Kasarla, Tejaswi; Mettes, Pascal; Baraldi, Lorenzo; Cucchiara, Rita

2025 Relazione in Atti di Convegno

Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training

Authors: Baraldi, Lorenzo; Amoroso, Roberto; Cornia, Marcella; Pilzer, Andrea; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of many different visual tasks. … (Read full abstract)

The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of many different visual tasks. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a backbone by reconstructing visual tokens associated with randomly masked image patches. This masking approach, however, introduces noise into the input data during pre-training, leading to discrepancies that can impair performance during the fine-tuning phase. Furthermore, input masking neglects the dependencies between corrupted patches, increasing the inconsistencies observed in downstream fine-tuning tasks. To overcome these issues, we propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT), that employs autoregressive and permuted predictions to capture intra-patch dependencies. In addition, MaPeT employs auxiliary positional information to reduce the disparity between the pre-training and fine-tuning phases. In our experiments, we employ a fair setting to ensure reliable and meaningful comparisons and conduct investigations on multiple visual tokenizers, including our proposed k-CLIP which directly employs discretized CLIP features. Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting. We release an implementation of our code and models at https://github.com/aimagelab/MaPeT.

2025 Articolo su rivista

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Authors: Cocchi, Federico; Moratelli, Nicholas; Caffagni, Davide; Sarto, Sara; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita

2025 Relazione in Atti di Convegno

MATE: Multimodal Agent that Talks and Empathizes

Authors: Rawal, Niyati; Xia, Matteo; Tessaro, David; Baraldi, Lorenzo; Cucchiara, Rita

2025 Relazione in Atti di Convegno
2 3 »

Page 1 of 15 • Total publications: 144