OXA-MISS: A Robust Multimodal Architecture for Chemotherapy Response Prediction under Data Scarcity
Authors: Miccolis, Francesca; Marinelli, Fabio; Pipoli, Vittorio; Afenteva, Daria; Virtanen, Anni; Lovino, Marta; Ficarra, Elisa
Explore our research publications: papers, articles, and conference proceedings from AImageLab.
Authors: Miccolis, Francesca; Marinelli, Fabio; Pipoli, Vittorio; Afenteva, Daria; Virtanen, Anni; Lovino, Marta; Ficarra, Elisa
Authors: Bolelli, Federico; Marchesini, Kevin; Van Nistelrooij, Niels; Lumetti, Luca; Pipoli, Vittorio; Ficarra, Elisa; Vinayahalingam, Shankeeth; Grana, Costantino
Cone-beam computed tomography (CBCT) is a standard imaging modality in orofacial and dental practices, providing essential 3D volumetric imaging of anatomical structures, including jawbones, teeth, sinuses, and neurovascular canals. Accurately segmenting these structures is fundamental to numerous clinical applications, such as surgical planning and implant placement. However, manual segmentation of CBCT scans is time-intensive and requires expert input, creating a demand for automated solutions through deep learning. Effective development of such algorithms relies on access to large, well-annotated datasets, yet current datasets are often privately stored or limited in scope and considered structures, especially concerning 3D annotations. This paper proposes ToothFairy2, a comprehensive, publicly accessible CBCT dataset with voxel-level 3D annotations of 42 distinct classes corresponding to maxillofacial structures. We validate the dataset by benchmarking state-of-the-art neural network models, including convolutional, transformer-based, and hybrid Mamba-based architectures, to evaluate segmentation performance across complex anatomical regions. Our work also explores adaptations to the nnU-Net framework to optimize multi-class segmentation for maxillofacial anatomy. The proposed dataset provides a fundamental resource for advancing maxillofacial segmentation and supports future research in automated 3D image analysis in digital dentistry.
Authors: Pipoli, Vittorio; Bolelli, Federico; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita; Ficarra, Elisa
Published in: IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION
This paper tackles the domain of multimodal prompting for visual recognition, specifically when dealing with missing modalities through multimodal Transformers. It presents two main contributions: (i) we introduce a novel prompt learning module which is designed to produce sample-specific prompts and (ii) we show that modality-agnostic prompts can effectively adjust to diverse missing modality scenarios. Our model, termed SCP, exploits the semantic representation of available modalities to query a learnable memory bank, which allows the generation of prompts based on the semantics of the input. Notably, SCP distinguishes itself from existing methodologies for its capacity of self-adjusting to both the missing modality scenario and the semantic context of the input, without prior knowledge about the specific missing modality and the number of modalities. Through extensive experiments, we show the effectiveness of the proposed prompt learning framework and demonstrate enhanced performance and robustness across a spectrum of missing modality cases.
Authors: Lumetti, Luca; Marchesini, Kevin; Pipoli, Vittorio; Ficarra, Elisa; Grana, Costantino; Bolelli, Federico
Published in: IEEE ACCESS
Recently, the field of 3D medical segmentation has been dominated by deep learning models employing Convolutional Neural Networks (CNNs) and Transformer-based architectures, each with its distinctive strengths and limitations. CNNs are constrained by a local receptive field, whereas Transformer are hindered by their substantial memory requirements as well as their data hunger, making them not ideal for processing 3D medical volumes at a fine-grained level. For these reasons, fully convolutional neural networks, as nnU-Net, still dominate the scene when segmenting medical structures in large 3D medical volumes. Despite numerous advancements toward developing transformer variants with subquadratic time and memory complexity, these models still fall short in content-based reasoning. A recent breakthrough is Mamba, a Recurrent Neural Network (RNN) based on State Space Models (SSMs), outperforming Transformers in many long-context tasks (million-length sequences) on famous natural language processing and genomic benchmarks while keeping a linear complexity. In this paper, we evaluate the effectiveness of Mamba-based architectures in comparison to state-of-the-art convolutional and Transformer-based models for 3D medical image segmentation across three well-established datasets: Synapse Abdomen, MSD BrainTumor, and ACDC. Additionally, we address the primary limitations of existing Mamba-based architectures by proposing alternative architectural designs, hence improving segmentation performances. The source code is publicly available to ensure reproducibility and facilitate further research: https://github.com/LucaLumetti/TamingMambas.
Authors: Capitani, Giacomo; Bonicelli, Lorenzo; Porrello, Angelo; Bolelli, Federico; Calderara, Simone; Ficarra, Elisa
Published in: IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION
Authors: Saporita, Alessia; Pipoli, Vittorio; Bolelli, Federico; Baraldi, Lorenzo; Acquaviva, Andrea; Ficarra, Elisa
Multimodal Large Language Models (MLLMs) have recently emerged as a powerful framework for extending the capabilities of Large Language Models (LLMs) to reason over non-textual modalities. However, despite their success, understanding how they integrate visual and textual information remains an open challenge. Among them, LLaMA~3.2-Vision represents a significant milestone in the development of open-source MLLMs, offering a reproducible and efficient architecture that competes with leading proprietary models, such as Claude 3 Haiku and GPT-4o mini. Motivated by these characteristics, we conduct the first systematic analysis of the information flow between vision and language in LLaMA~3.2-Vision. We analyze three visual question answering (VQA) benchmarks, covering the tasks of VQA on natural images---using both open-ended and multiple-choice question formats---as well as document VQA. These tasks require diverse reasoning capabilities, making them well-suited to reveal distinct patterns in multimodal reasoning. Our analysis unveils a four-stage reasoning strategy: an initial semantic interpretation of the question, an early-to-mid-layer multimodal fusion, a task-specific reasoning stage guided by the resulting multimodal embedding, and a final answer prediction stage. Furthermore, we reveal that multimodal fusion is task-dependent: in complex settings such as document VQA, the model postpones cross-modal integration until semantic reasoning over the question has been established. Overall, our findings offer new insights into the internal dynamics of MLLMs and contribute to advancing the interpretability of vision-language architectures. Our source code is available at https://github.com/AImageLab/MLLMs-FlowTracker.
Authors: Lumetti, Luca; Capitani, Giacomo; Ficarra, Elisa; Grana, Costantino; Calderara, Simone; Porrello, Angelo; Bolelli, Federico
Despite their remarkable success in medical image segmentation, the life cycle of deep neural networks remains a challenge in clinical applications. These models must be regularly updated to integrate new medical data and customized to meet evolving diagnostic standards, regulatory requirements, commercial needs, and privacy constraints. Model merging offers a promising solution, as it allows working with multiple specialized networks that can be created and combined dynamically instead of relying on monolithic models. While extensively studied in standard 2D classification, the potential of model merging for 3D segmentation remains unexplored. This paper presents an efficient framework that allows effective model merging in the domain of 3D image segmentation. Our approach builds upon theoretical analysis and encourages wide minima during pre-training, which we demonstrate to facilitate subsequent model merging. Using U-Net 3D, we evaluate the method on distinct anatomical structures with the ToothFairy2 and BTCV Abdomen datasets. To support further research, we release the source code and all the model weights in a dedicated repository: https://github.com/LucaLumetti/UNetTransplant
Authors: Rinaldi, Filippo; Capitani, Giacomo; Bonicelli, Lorenzo; Crisostomi, Donato; Bolelli, Federico; Rodolà, Emanuele; Ficarra, Elisa; Calderara, Simone; Porrello, Angelo
Foundation models serve as the backbone for numerous specialized models developed through fine-tuning. However, when the underlying pretrained model is updated or retrained (e.g., on larger and more curated datasets), the fine-tuned model becomes obsolete, losing its utility and requiring retraining. This raises the question: is it possible to transfer fine-tuning to a new release of the model? In this work, we investigate how to transfer fine-tuning to a new checkpoint without having to re-train, in a data-free manner. To do so, we draw principles from model re-basin and provide a recipe based on weight permutations to re-base the modifications made to the original base model, often called task vector. In particular, our approach tailors model re-basin for Transformer models, taking into account the challenges of residual connections and multi-head attention layers. Specifically, we propose a two-level method rooted in spectral theory, initially permuting the attention heads and subsequently adjusting parameters within select pairs of heads. Through extensive experiments on visual and textual tasks, we achieve the seamless transfer of fine-tuned knowledge to new pre-trained backbones without relying on a single training step or datapoint.
Authors: Bontempo, Gianpaolo; Bolelli, Federico; Porrello, Angelo; Calderara, Simone; Ficarra, Elisa
Published in: IEEE TRANSACTIONS ON MEDICAL IMAGING
The usage of Multi Instance Learning (MIL) for classifying Whole Slide Images (WSIs) has recently increased. Due to their gigapixel size, the pixel-level annotation of such data is extremely expensive and time-consuming, practically unfeasible. For this reason, multiple automatic approaches have been raised in the last years to support clinical practice and diagnosis. Unfortunately, most state-of-the-art proposals apply attention mechanisms without considering the spatial instance correlation and usually work on a single-scale resolution. To leverage the full potential of pyramidal structured WSI, we propose a graph-based multi-scale MIL approach, DAS-MIL. Our model comprises three modules: i) a self-supervised feature extractor, ii) a graph-based architecture that precedes the MIL mechanism and aims at creating a more contextualized representation of the WSI structure by considering the mutual (spatial) instance correlation both inter and intra-scale. Finally, iii) a (self) distillation loss between resolutions is introduced to compensate for their informative gap and significantly improve the final prediction. The effectiveness of the proposed framework is demonstrated on two well-known datasets, where we outperform SOTA on WSI classification, gaining a +2.7% AUC and +3.7% accuracy on the popular Camelyon16 benchmark.
Authors: Capitani, Giacomo; Lucarini, Alice; Bonicelli, Lorenzo; Bolelli, Federico; Calderara, Simone; Vezzali, Loris; Ficarra, Elisa
Implicit biases, subtle and unconscious attitudes, permeate various facets of human decision-making and are similarly pervasive in Artificial Intelligence (AI) systems. These biases can stem from shortcut learning, where models rely on superficial patterns that do not capture the underlying phenomena. Inspired by social psychology studies, we introduce two novel metrics to analyze implicit biases in visual-language models. Our comprehensive analysis of 90 open-clip models reveals widespread anomalies related to ethnicity and gender. The first metric considers the cosine similarity between images and text prompts related to social stereotypes. The second metric adapts the Implicit Association Test (IAT), which evaluates prejudice and hidden discrimination within human behavior. Our findings illustrate that conventional text-based debiasing efforts can inadvertently amplify second-order biases instead of mitigating them. Furthermore, in expanding our evaluation to multimodal Large Language Models (LLMs), we demonstrate disparities in the tendency to generate semantically positive or negative outputs, depending on the ethnicity or gender of the individuals depicted in the input images.