LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
Authors: Cocchi, Federico; Moratelli, Nicholas; Caffagni, Davide; Sarto, Sara; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita
Explore our research publications: papers, articles, and conference proceedings from AImageLab.
Tip: type @ to pick an author and # to pick a keyword.
Authors: Cocchi, Federico; Moratelli, Nicholas; Caffagni, Davide; Sarto, Sara; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita
Authors: Catalini, Riccardo; Biagi, Federico; Salici, Giacomo; Borghi, Guido; Vezzani, Roberto; Biagiotti, Luigi
Humanoid robots are increasingly being integrated into diverse scenarios, such as healthcare facilities, social settings, and workplaces. As the need for intuitive control by non-expert users grows, many studies have explored the use of Artificial Intelligence to enable communication and control. However, these approaches are often tailored to specific robots due to the absence of standardized conventions and notation. This study addresses the challenges posed by these inconsistencies and investigates their impact on the ability of Large Language Models (LLMs) to generate accurate 3D robot poses, even when detailed robot specifications are provided as input.
Authors: Catalini, Riccardo; Salici, Giacomo; Biagi, Federico; Borghi, Guido; Biagiotti, Luigi; Vezzani, Roberto
In this study, we demonstrate the capabilities of state-of-the-art Large Language Models (LLMs) in teaching social robots to perform specific actions within a 3D environment. Specifically, we introduce the use of LLMs to generate sequences of 3D joint angles - in both zero-shot and one-shot prompting - that a humanoid robot must follow to perform a given action. This work is driven by the growing demand for intuitive interactions with social robots: indeed, LLMs could empower non-expert users to operate and benefit from robotic systems effectively. Additionally, this method leverages the possibility to generate synthetic data without effort, enabling privacy-focused use cases. To evaluate the output quality of seven different LLMs, we conducted a blind user study to compare the pose sequences. Participants were shown videos of the well-known NAO robot performing the generated actions and were asked to identify the intended action and choose the best match with the original instruction from a collection of candidates created by different LLMs. The results highlight that the majority of LLMs are indeed capable of planning correct and complete recognizable actions, showing a novel perspective of how AI can be applied to social robotics.
Authors: Lumetti, Luca; Pipoli, Vittorio; Bolelli, Federico; Ficarra, Elisa; Grana, Costantino
Published in: LECTURE NOTES IN COMPUTER SCIENCE
The segmentation of the Inferior Alveolar Canal (IAC) plays a central role in maxillofacial surgery, drawing significant attention in the current research. Because of their outstanding results, deep learning methods are widely adopted in the segmentation of 3D medical volumes, including the IAC in Cone Beam Computed Tomography (CBCT) data. One of the main challenges when segmenting large volumes, including those obtained through CBCT scans, arises from the use of patch-based techniques, mandatory to fit memory constraints. Such training approaches compromise neural network performance due to a reduction in the global contextual information. Performance degradation is prominently evident when the target objects are small with respect to the background, as it happens with the inferior alveolar nerve that develops across the mandible, but involves only a few voxels of the entire scan. In order to target this issue and push state-of-the-art performance in the segmentation of the IAC, we propose an innovative approach that exploits spatial information of extracted patches and integrates it into a Transformer architecture. By incorporating prior knowledge about patch location, our model improves state of the art by ~2 points on the Dice score when integrated with the standard U-Net architecture. The source code of our proposal is publicly released.
Authors: Perliti Scorzoni, Paolo; Giovanetti, Anita; Bolelli, Federico; Grana, Costantino
Overcrowding in Emergency Departments (EDs) is a pressing concern driven by high patient demand and limited resources. Prolonged Length of Stay (pLOS), a major contributor to this congestion, may lead to adverse outcomes, including patients leaving without being seen, suboptimal clinical care, increased mortality rates, provider burnout, and escalating healthcare costs. This study investigates the application of various Machine Learning (ML) algorithms to predict both LOS and pLOS. A retrospective analysis examined 32,967 accesses at a northern Italian hospital’s ED between 2022 and 2024. Twelve classification algorithms were evaluated in forecasting pLOS, using clinically relevant thresholds. Two data variants were employed for model comparison: one containing only structured data (e.g., demographics and clinical information), while a second one also including features extracted from free-text nursing notes. To enhance the accuracy of LOS prediction, novel queue-based variables capturing the real-time state of the ED were incorporated as additional dynamic predictors. Compared to single-algorithm models, ensemble models demonstrated superior robustness in forecasting both ED-LOS and ED-pLOS. These findings highlight the potential for integrating ML into EDs practices as auxiliary tools to provide valuable insights into patient flow. By identifying patients at high risk of pLOS, healthcare professionals can proactively implement strategies to expedite care, optimize resource allocation, and ultimately improve patient outcomes and ED efficiency, promoting a more effective and sustainable public healthcare delivery.
Authors: Mosconi, Matteo; Sorokin, Andriy; Panariello, Aniello; Porrello, Angelo; Bonato, Jacopo; Cotogni, Marco; Sabetta, Luigi; Calderara, Simone; Cucchiara, Rita
Published in: LECTURE NOTES IN COMPUTER SCIENCE
The use of skeletal data allows deep learning models to perform action recognition efficiently and effectively. Herein, we believe that exploring this problem within the context of Continual Learning is crucial. While numerous studies focus on skeleton-based action recognition from a traditional offline perspective, only a handful venture into online approaches. In this respect, we introduce CHARON (Continual Human Action Recognition On skeletoNs), which maintains consistent performance while operating within an efficient framework. Through techniques like uniform sampling, interpolation, and a memory-efficient training stage based on masking, we achieve improved recognition accuracy while minimizing computational overhead. Our experiments on Split NTU-60 and the proposed Split NTU-120 datasets demonstrate that CHARON sets a new benchmark in this domain. The code is available at https://github.com/Sperimental3/CHARON.
Authors: Rawal, Niyati; Xia, Matteo; Tessaro, David; Baraldi, Lorenzo; Cucchiara, Rita
Authors: Li, Jianning; Zhou, Zongwei; Yang, Jiancheng; Pepe, Antonio; Gsaxner, Christina; Luijten, Gijs; Qu, Chongyu; Zhang, Tiezheng; Chen, Xiaoxi; Li, Wenxuan; Wodzinski, Marek Michal; Friedrich, Paul; Xie, Kangxian; Jin, Yuan; Ambigapathy, Narmada; Nasca, Enrico; Solak, Naida; Melito Gian, Marco; Duc Vu, Viet; Memon Afaque, R.; Schlachta, Christopher; De Ribaupierre, Sandrine; Patel, Rajnikant; Eagleson, Roy; Chen Xiaojun Mächler, Heinrich; Kirschke Jan, Stefan; De La Rosa, Ezequiel; Christ Patrick, Ferdinand; Hongwei Bran, Li; Ellis David, G.; Aizenberg Michele, R.; Gatidis, Sergios; Küstner, Thomas; Shusharina, Nadya; Heller, Nicholas; Rearczyk, Vincent; Depeursinge, Adrien; Hatt, Mathieu; Sekuboyina, Anjany; Löffler Maximilian, T.; Liebl, Hans; Dorent, Reuben; Vercauteren, Tom; Shapey, Jonathan; Kujawa, Aaron; Cornelissen, Stefan; Langenhuizen, Patrick; Ben-Hamadou, Achraf; Rekik, Ahmed; Pujades, Sergi; Boyer, Edmond; Bolelli, Federico; Grana, Costantino; Lumetti, Luca; Salehi, Hamidreza;
Published in: BIOMEDIZINISCHE TECHNIK
Objectives: The shape is commonly used to describe the objects. State-of-the-art algorithms in medical imaging are predominantly diverging from computer vision, where voxel grids, meshes, point clouds, and implicit surfacemodels are used. This is seen from the growing popularity of ShapeNet (51,300 models) and Princeton ModelNet (127,915 models). However, a large collection of anatomical shapes (e.g., bones, organs, vessels) and 3D models of surgical instruments is missing. Methods: We present MedShapeNet to translate datadriven vision algorithms to medical applications and to adapt state-of-the-art vision algorithms to medical problems. As a unique feature, we directly model the majority of shapes on the imaging data of real patients. We present use cases in classifying brain tumors, skull reconstructions, multi-class anatomy completion, education, and 3D printing. Results: By now, MedShapeNet includes 23 datasets with more than 100,000 shapes that are paired with annotations (ground truth). Our data is freely accessible via aweb interface and a Python application programming interface and can be used for discriminative, reconstructive, and variational benchmarks as well as various applications in virtual, augmented, or mixed reality, and 3D printing. Conclusions: MedShapeNet contains medical shapes from anatomy and surgical instruments and will continue to collect data for benchmarks and applications. The project page is: https://medshapenet.ikim.nrw/.
Authors: Quattrini, F.; Pippi, V.; Cascianelli, S.; Cucchiara, R.
Published in: LECTURE NOTES IN COMPUTER SCIENCE
Diffusion models have become the State-of-the-Art for text-to-image generation, and increasing research effort has been dedicated to adapting the inference process of pretrained diffusion models to achieve zero-shot capabilities. An example is the generation of panorama images, which has been tackled in recent works by combining independent diffusion paths over overlapping latent features, which is referred to as joint diffusion, obtaining perceptually aligned panoramas. However, these methods often yield semantically incoherent outputs and trade-off diversity for uniformity. To overcome this limitation, we propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting to improve the perceptual and semantical coherence of the generated panorama images. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Extensive quantitative and qualitative experimental analysis, together with a user study, demonstrate that our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence. We release the code at https://github.com/aimagelab/MAD.
Authors: Pipoli, Vittorio; Saporita, Alessia; Bolelli, Federico; Cornia, Marcella; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita; Ficarra, Elisa
Recently, Multimodal Large Language Models (MLLMs) have emerged as a leading framework for enhancing the ability of Large Language Models (LLMs) to interpret non-linguistic modalities. Despite their impressive capabilities, the robustness of MLLMs under conditions where one or more modalities are missing remains largely unexplored. In this paper, we investigate the extent to which MLLMs can maintain performance when faced with missing modality inputs. Moreover, we propose a novel framework to mitigate the aforementioned issue called Retrieval-Augmented Generation for missing modalities (MissRAG). It consists of a novel multimodal RAG technique alongside a tailored prompt engineering strategy designed to enhance model robustness by mitigating the impact of absent modalities while preventing the burden of additional instruction tuning. To demonstrate the effectiveness of our techniques, we conducted comprehensive evaluations across five diverse datasets, covering tasks such as audio-visual question answering, audio-visual captioning, and multimodal sentiment analysis.