Publications

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization

Authors: Compagnoni, Alberto; Caffagni, Davide; Moratelli, Nicholas; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita

Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to … (Read full abstract)

Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to computer vision. Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate, that is to generate answers to the user's query that are not reflected in the visual input. In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations. In contrast to recent approaches that require complicated pipelines to build synthetic preference data for alignment training, often relying on proprietary models, we capitalize on the well-known CHAIR metric, originally proposed to gauge the degree of hallucinations in image captioning. Given a pair of generated answers, we leverage CHAIR to distinguish winner and loser options (i.e., non-hallucinated and hallucinated samples) and fine-tune off-the-shelf MLLMs via Direct Preference Optimization (DPO). The resulting method, which we refer to as CHAIR-DPO, effectively diminishes the amount of hallucinated answers on several hallucination benchmarks, demonstrating the effectiveness of fine-tuning the MLLM with a CHAIR-based reward.

2025 Relazione in Atti di Convegno

Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

Authors: Cartella, Giuseppe; Cuculo, Vittorio; D'Amelio, Alessandro; Cornia, Marcella; Boccignone, Giuseppe; Cucchiara, Rita

Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. … (Read full abstract)

Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research.

2025 Relazione in Atti di Convegno

Modular embedding recomposition for incremental learning

Authors: Panariello, Aniello; Frascaroli, Emanuele; Buzzega, Pietro; Bonicelli, Lorenzo; Porrello, Angelo; Calderara, Simone

2025 Relazione in Atti di Convegno

Monocular per-object distance estimation with Masked Object Modeling

Authors: Panariello, Aniello; Mancusi, Gianluca; Haj Ali, Fedy; Porrello, Angelo; Calderara, Simone; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

2025 Articolo su rivista

Mosaic-SR: An Adaptive Multi-step Super-Resolution Method for Low-Resolution 2D Barcodes

Authors: Vezzali, Enrico; Vorabbi, Lorenzo; Grana, Costantino; Bolelli, Federico

QR and Datamatrix codes are widely used in warehouse logistics and high-speed production pipelines. Still, distant or small barcodes often … (Read full abstract)

QR and Datamatrix codes are widely used in warehouse logistics and high-speed production pipelines. Still, distant or small barcodes often yield low-pixel-density images that are hard to read. Conventional solutions rely on costly hardware or enhanced lighting, raising expenses and potentially reducing depth of field. We propose Mosaic-SR, a multi-step, adaptive super-resolution (SR) method that devotes more computation to barcode regions than uniform backgrounds. For each patch, it predicts an uncertainty value to decide how many refinement steps are required. Our experiments show that Mosaic-SR surpasses state-of-the-art SR models on 2D barcode images, achieving higher PSNR and decoding rates in less time. All code and trained models are publicly available at https://github.com/Henvezz95/mosaic-sr.

2025 Relazione in Atti di Convegno

Multimodal Dialogue for Empathetic Human-Robot Interaction

Authors: Rawal, Niyati; Singh Maharjan, Rahul; Salici, Giacomo; Catalini, Riccardo; Romeo, Marta; Bigazzi, Roberto; Baraldi, Lorenzo; Vezzani, Roberto; Cucchiara, Rita; Cangelosi, Angelo

2025 Relazione in Atti di Convegno

Multimodal Emotion Recognition in Conversation via Possible Speaker's Audio and Visual Sequence Selection

Authors: Singh Maharjan, Rahul; Rawal, Niyati; Romeo, Marta; Baraldi, Lorenzo; Cucchiara, Rita; Cangelosi, Angelo

Published in: PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING

2025 Relazione in Atti di Convegno

No More Slice Wars: Towards Harmonized Brain MRI Synthesis for the BraSyn Challenge

Authors: Carpentiero, Omar; Marchesini, Kevin; Grana, Costantino; Bolelli, Federico

The synthesis of missing MRI modalities has emerged as a critical solution to address incomplete multi-parametric imaging in brain tumor … (Read full abstract)

The synthesis of missing MRI modalities has emerged as a critical solution to address incomplete multi-parametric imaging in brain tumor diagnosis and treatment planning. While recent advances in generative models, especially GANs and diffusion-based approaches, have demonstrated promising results in cross-modality MRI generation, challenges remain in preserving anatomical fidelity and minimizing synthesis artifacts. In this work, we build upon the Hybrid Fusion GAN (\hfgan) framework, introducing several enhancements aimed at improving synthesis quality and generalization across tumor types. Specifically, we incorporate z-score normalization, optimize network components for faster and more stable training, and extend the pipeline to support multi-view generation across various brain tumor categories, including gliomas, metastases, and meningiomas. Our approach focuses on refining 2D slice-based generation to ensure intra-slice coherence and reduce intensity inconsistencies, ultimately supporting more accurate and robust tumor segmentation in scenarios with missing imaging modalities. Our source code is available at https://github.com/AImageLab-zip/BraSyn25.

2025 Relazione in Atti di Convegno

One transformer for all time series: representing and training with time-dependent heterogeneous tabular data

Authors: Luetto, S.; Garuti, F.; Sangineto, E.; Forni, L.; Cucchiara, R.

Published in: MACHINE LEARNING

There is a recent growing interest in applying Deep Learning techniques to tabular data in order to replicate the success … (Read full abstract)

There is a recent growing interest in applying Deep Learning techniques to tabular data in order to replicate the success of other Artificial Intelligence areas in this structured domain. Particularly interesting is the case in which tabular data have a time dependence, such as, for instance, financial transactions. However, the heterogeneity of the tabular values, in which categorical elements are mixed with numerical features, makes this adaptation difficult. In this paper we propose UniTTab, a Transformer based architecture whose goal is to uniformly represent heterogeneous time-dependent tabular data, in which both numerical and categorical features are described using continuous embedding vectors. Moreover, differently from common approaches, which use a combination of different loss functions for training with both numerical and categorical targets, UniTTab is uniformly trained with a unique Masked Token pretext task. Finally, UniTTab can also represent time series in which the individual row components have a variable internal structure with a variable number of fields, which is a common situation in many application domains, such as in real world transactional data. Using extensive experiments with five datasets of variable size and complexity, we empirically show that UniTTab consistently and significantly improves the prediction accuracy over several downstream tasks and with respect to both Deep Learning and more standard Machine Learning approaches. Our code and our models are available at: https://github.com/fabriziogaruti/UniTTab.

2025 Articolo su rivista

Optimizing Resource Allocation in Public Healthcare: A Machine Learning Approach for Length-of-Stay Prediction

Authors: Perliti Scorzoni, Paolo; Giovanetti, Anita; Bolelli, Federico; Grana, Costantino

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Effective hospital resource management hinges on established metrics such as Length of Stay (LOS) and Prolonged Length of Stay (pLOS). … (Read full abstract)

Effective hospital resource management hinges on established metrics such as Length of Stay (LOS) and Prolonged Length of Stay (pLOS). Reducing pLOS is associated with improved patient outcomes and optimized resource utilization (e.g., bed allocation). This study investigates several Machine Learning (ML) models for both LOS and pLOS prediction. We conducted a retrospective study analyzing data from general inpatients discharged between 2022 and 2023 at a northern Italian hospital. Sixteen regression and twelve classification algorithms were compared in forecasting LOS as either a continuous or multi-class variable (1-3 days, 4-10 days, >10 days). Additionally, the same models were assessed for pLOS prediction (defined as LOS exceeding 8 days). All models were evaluated using two variants of the same dataset: one containing only structured data (e.g., demographics and clinical information), and a second one also containing features extracted from free-text diagnosis. Ensemble models, leveraging the combined strengths of multiple ML algorithms, demonstrated superior accuracy in predicting both LOS and pLOS compared to single-algorithm models, particularly when utilizing both structured and unstructured data extracted from diagnoses. Integration of ML, particularly ensemble models, has the potential to significantly improve LOS prediction and identify patients at high risk of pLOS. Such insights can empower healthcare professionals and bed managers to optimize patient care and resource allocation, promoting overall healthcare efficiency and sustainability.

2025 Relazione in Atti di Convegno

Page 8 of 106 • Total publications: 1054