OXA-MISS: A Robust Multimodal Architecture for Chemotherapy Response Prediction under Data Scarcity
Authors: Miccolis, Francesca; Marinelli, Fabio; Pipoli, Vittorio; Afenteva, Daria; Virtanen, Anni; Lovino, Marta; Ficarra, Elisa
Explore our research publications: papers, articles, and conference proceedings from AImageLab.
Tip: type @ to pick an author and # to pick a keyword.
Authors: Miccolis, Francesca; Marinelli, Fabio; Pipoli, Vittorio; Afenteva, Daria; Virtanen, Anni; Lovino, Marta; Ficarra, Elisa
Authors: Amoroso, Roberto; Zhang, Gengyuan; Koner, Rajat; Baraldi, Lorenzo; Cucchiara, Rita; Tresp, Volker
Video Question Answering (Video QA) is a critical and challenging task in video understanding, necessitating models to comprehend entire videos, identify the most pertinent information based on the contextual cues from the question, and reason accurately to provide answers. Initial endeavors in harnessing Multimodal Large Language Models (MLLMs) have cast new light on Visual QA, particularly highlighting their commonsense and temporal reasoning capacities. Models that effectively align visual and textual elements can offer more accurate answers tailored to visual inputs. Nevertheless, an unresolved question persists regarding video content: How can we efficiently extract the most relevant information from videos over time and space for enhanced VQA? In this study, we evaluate the efficacy of various temporal modeling techniques in conjunction with MLLMs and introduce a novel component, T-Former, designed as a question-guided temporal querying transformer. T-Former bridges frame-wise visual perception and the reasoning capabilities of LLMs. Our evaluation across various VideoQA benchmarks shows that T-Former, with its linear computational complexity, competes favorably with existing temporal modeling approaches and aligns with the latest advancements in Video QA tasks.
Authors: Caffagni, Davide; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries -- composed of both an image and a text -- and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT.
Authors: Savarese, Marco; De Blasi, Antonio; Zaccagnino, Carmine; Salici, Giacomo; Cascianelli, Silvia; Vezzani, Roberto; Grazia, Carlo Augusto
Efficient energy provisioning is a fundamental requirement for modern transportation systems, making refueling path optimization a critical challenge. Existing solutions often focus either on inter-vehicle communication or intravehicle monitoring, leveraging Intelligent Transportation Systems, Digital Twins, and Software-Defined Internet of Vehicles with Cloud/Fog/Edge infrastructures. However, integrated frameworks that adapt dynamically to driver mobility patterns are still underdeveloped. Building on our previous PIENO framework, we present RI-PIENO (Revised and Improved Petrolfilling Itinerary Estimation aNd Optimization), a system that combines intra-vehicle sensor data with external geospatial and fuel price information, processed via IoT-enabled Cloud/Fog services. RI-PIENO models refueling as a dynamic, time-evolving directed acyclic graph that reflects both habitual daily trips and real-time vehicular inputs, transforming the system from a static recommendation tool into a continuously adaptive decision engine. We validate RI-PIENO in a daily-commute use case through realistic multi-driver, multi-week simulations, showing that it achieves significant cost savings and more efficient routing compared to previous approaches. The framework is designed to leverage emerging roadside infrastructure and V2X communication, supporting scalable deployment within next-generation IoT and vehicular networking ecosystems.
Authors: Bellameche, F.; Modica, F.; Cortiello, M.; Costi, E.; Riccioni, C.; De Marchis, F.; Rubini, A.; Belfiori, B.; Bellucci, M.; Brilli, L.; Sberveglieri, V.; Lovino, M.; Núñez-Carmona, E.; Giovanardi, D.
Published in: JOURNAL OF PLANT PATHOLOGY
The control of soil-borne diseases in hops, such as Verticillium wilt remains challenging due to the limited effectiveness of fungicides, the perennial nature of hop cultivation, and the long-term persistence of the pathogens in the soil. Microbial biocontrol agents (mBCAs) with plant growth-promoting (PGP) and antagonistic effects offer a sustainable ecofriendly alternative for hops protection. Two Pseudomonas spp. strains from the UniMORE microbial collection were selected for this study based on their strong antagonistic activity against Verticillium spp. and multiple plant growth-promoting (PGP) traits. Rhizospheric and endophytic colonization capacities of the strains DLS1929 and DLS2318 were evaluated in hop plants (cv. Cascade) under controlled conditions at seven- and fourteen-days post-inoculation (DPI). Both bacterial strains were rhizosphere and endorhiza competent, with slight differences in their abundances. The highest cell density was observed at 7 DPI for the strain DLS2318, reaching log10 6.39 CFU g−1 root fresh weight in the rhizosphere and log10 4.75 CFU g−1 root fresh weight in the endorhiza; at 14 DPI, colonization results were in line with the previous assessment. Confocal laser scanning microscopy visualization of both eGFP-tagged Pseudomonas spp. strains confirmed their rhizosphere competence in hop. Additionally, root colonization by these bacteria enhanced the photosynthetic capacity in hop leaves, supporting their potential as a PGP agents observed in vitro. Successful root colonization and PGP effects are key prerequisites for an effective biocontrol of soilborne pathogens. Further studies are required to assess the consistent efficacy in the field of these beneficial mBCA candidates. This research was funded by the Italian Ministry of University and Research (MUR), under the European Union funding – Next Generation EU - PRIN- 2022, (prot. 2022M3HR45) project: “IoHOP: Quality valorization of the Italian hop based on a multi-approach strategy”.
Authors: Di Domenico, N.; Borghi, G.; Franco, A.; Boschetti, M.; Giacomini, F.; Barzaghi, S.; Ferucci, S.; Zambruno, S.; Mularoni, L.; Gao, Q.; Che, C.; Li, G.; Zu, Y.; Hao, J.; Zhang, J.; Ducz, A.; Gego, L.; Imeri, K.; Nemkin, V.; Rakhmatillaev, A.; Szatmari, S.; Rowan, W.
Published in: LECTURE NOTES IN COMPUTER SCIENCE
The sixth-century Basilica of San Vitale in Ravenna, Italy, once featured intricate circular colored glass windows that illuminated its interior. Although these windows are now lost, several fragments were recovered during recent restorations. Unfortunately, reconstructing the original glass windows from these fragments is extremely complex and time-consuming, requiring the use of specialized expertise. Therefore, the development of automatic reconstruction techniques based on Artificial Intelligence is particularly important and challenging, due to, for instance, the presence of uniform color, damaged glass edges, and many fragment outliers. In this direction, the San Vitale Challenge was organized to gather the best methods and algorithms, as described and summarized in this paper. The challenge, split into several sub-tracks of increasing difficulty and realism, received the submission of several solutions, ranging from more classical computer vision algorithms to purely deep learning-based approaches, whose results are quantitatively evaluated and compared. In the last part of the paper, directions for future developments of such systems are discussed.
Authors: Cartella, Giuseppe; Cuculo, Vittorio; Cornia, Marcella; Papasidero, Marco; Ruozzi, Federico; Cucchiara, Rita
Published in: ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE
We introduce Sanctuaria-Gaze, a multimodal dataset featuring egocentric recordings from 40 visits to four architecturally and culturally significant sanctuaries in Northern Italy. Collected using wearable devices with integrated eye trackers, the dataset offers RGB videos synchronized with streams of gaze coordinates, head motion, and environmental point cloud, resulting in over four hours of recordings. Along with the dataset, we provide a framework for automatic detection and analysis of Areas of Interest (AOIs). This framework fills a critical gap by offering an open-source, flexible tool for gaze-based research that adapts to dynamic settings without requiring manual intervention. Our study analyzes human visual attention to sacred, architectural, and cultural objects, providing insights into how visitors engage with these elements and how their background influences their interactions. By releasing both the dataset and the analysis framework, Sanctuaria-Gaze aims to advance interdisciplinary research on gaze behavior, human-computer interaction, and visual attention in real-world environments. Code and dataset are available at https://github.com/aimagelab/Sanctuaria-Gaze.