Publications
Explore our research publications: papers, articles, and conference proceedings from AImageLab.
Tip: type @ to pick an author and # to pick a keyword.
Spotting Culex pipiens from satellite: modeling habitat suitability in central Italy using Sentinel-2 and deep learning techniques
Authors: Ippoliti, Carla; Bonicelli, Lorenzo; De Ascentis, Matteo; Tora, Susanna; Di Lorenzo, Alessio; Gerardo D’Alessio, Silvio; Porrello, Angelo; Bonanni, Americo; Cioci, Daniela; Goffredo, Maria; Calderara, Simone; Conte, Annamaria
Published in: FRONTIERS IN VETERINARY SCIENCE
Culex pipiens, an important vector of many vector borne diseases, is a species capable to feeding on a wide variety … (Read full abstract)
Culex pipiens, an important vector of many vector borne diseases, is a species capable to feeding on a wide variety of hosts and adapting to different environments. To predict the potential distribution of Cx. pipiens in central Italy, this study integrated presence/absence data from a four-year entomological survey (2019-2022) carried out in the Abruzzo and Molise regions, with a datacube of spectral bands acquired by Sentinel-2 satellites, as patches of 224 x 224 pixels of 20 meters spatial resolution around each site and for each satellite revisit time. We investigated three scenarios: the baseline model, which considers the environmental conditions at the time of collection; the multitemporal model, focusing on conditions in the 2 months preceding the collection; and the MultiAdjacency Graph Attention Network (MAGAT) model, which accounts for similarities in temperature and nearby sites using a graph architecture. For the baseline scenario, a deep convolutional neural network (DCNN) analyzed a single multi-band Sentinel-2 image. The DCNN in the multitemporal model extracted temporal patterns from a sequence of 10 multispectral images; the MAGAT model incorporated spatial and climatic relationships among sites through a graph neural network aggregation method. For all models, we also evaluated temporal lags between the multi-band Earth Observation datacube date of acquisition and the mosquito collection, from 0 to 50 days. The study encompassed a total of 2,555 entomological collections, and 108,064 images (patches) at 20 meters spatial resolution. The baseline model achieved an F1 score higher than 75.8% for any temporal lag, which increased up to 81.4% with the multitemporal model. The MAGAT model recorded the highest F1 score of 80.9%. The study confirms the widespread presence of Cx. pipiens throughout the majority of the surveyed area. Utilizing only Sentinel-2 spectral bands, the models effectively capture early in advance the temporal patterns of the mosquito population, offering valuable insights for directing surveillance activities during the vector season. The methodology developed in this study can be scaled up to the national territory and extended to other vectors, in order to support the Ministry of Health in the surveillance and control strategies for the vectors and the diseases they transmit.
Sustainable Use of Resources in Hospitals: A Machine Learning-Based Approach to Predict Prolonged Length of Stay at the Time of Admission
Authors: Perliti Scorzoni, Paolo; Giovanetti, Anita; Bolelli, Federico; Grana, Costantino
Introduction. Length of Stay (LOS) and Prolonged Length of Stay (pLOS) are critical indicators of hospital efficiency. Reducing pLOS is … (Read full abstract)
Introduction. Length of Stay (LOS) and Prolonged Length of Stay (pLOS) are critical indicators of hospital efficiency. Reducing pLOS is crucial for patient safety, autonomy, and bed allocation. This study investigates different machine learning (ML) models to predict LOS and pLOS. Methods. We analyzed a dataset of patients discharged from a northern Italian hospital between 2022 and 2023 as a retrospective cohort study. We compared sixteen regression algorithms and twelve classification methods for predicting LOS as either a continuous or multi-class variable (1-3 days, 4-10 days, >10 days). We also evaluated pLOS prediction using the same models, having pLOS defined as any hospitalization with LOS longer than 8 days. We further analyzed all models using two versions of the same dataset: one containing only structured data (e.g. demographics and clinical information), whereas the second one also containing features extracted from free-text diagnosis. Results. Our results indicate that ensemble models achieved the highest prediction accuracy for both LOS and pLOS, outperforming traditional single-algorithm models, particularly when using both structured and unstructured data extracted from diagnoses. Discussion. The integration of ML, particularly ensemble models, can significantly improve LOS prediction and identify patients at increased risk of pLOS. This information can guide healthcare professionals and bed managers in making informed decisions to enhance patient care and optimize resource allocation.
The Revolution of Multimodal Large Language Models: A Survey
Authors: Caffagni, Davide; Cocchi, Federico; Barsellotti, Luca; Moratelli, Nicholas; Sarto, Sara; Baraldi, Lorenzo; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita
Published in: PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING
Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of … (Read full abstract)
Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-following capabilities. In this paper, we provide a comprehensive review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques. We also conduct a detailed analysis of these models across a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. Additionally, we compile and describe training datasets and evaluation benchmarks, conducting comparisons among existing models in terms of performance and computational requirements. Overall, this survey offers a comprehensive overview of the current state of the art, laying the groundwork for future MLLMs.
Towards Federated Learning for Morphing Attack Detection
Authors: Robledo-Moreno, M.; Borghi, G.; Di Domenico, N.; Franco, A.; Raja, K.; Maltoni, D.
Through the Face Morphing attack is possible to use the same legal document by two different people, destroying the unique … (Read full abstract)
Through the Face Morphing attack is possible to use the same legal document by two different people, destroying the unique biometric link between the document and its owner. In other words, a morphed face image has the potential to bypass face verification-based security controls, then representing a severe security threat. Unfortunately, the lack of public, extensive and varied training datasets severely hampers the development of effective and robust Morphing Attack Detection (MAD) models, key tools in contrasting the Face Morphing attack since able to automatically detect the presence of morphing images. Indeed, privacy regulations limit the possibility of acquiring, storing, and transferring MAD-related data that contain personal information, such as faces. Therefore, in this paper, we investigate the use of Federated Learning to train a MAD model on local training samples across multiple sites, eliminating the need for a single centralized training dataset, as common in Machine Learning, and then overcoming privacy limitations. Experimental results suggest that FL is a viable solution that will need to be considered in future research works in MAD.
Towards Retrieval-Augmented Architectures for Image Captioning
Authors: Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Nicolosi, Alessandro; Cucchiara, Rita
Published in: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS
The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural … (Read full abstract)
The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach toward developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a kNN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions, especially with a larger retrieval corpus. This work provides valuable insights into retrieval-augmented captioning models and opens up new avenues for improving image captioning at a larger scale.
Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation
Authors: Barsellotti, Luca; Amoroso, Roberto; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
Published in: IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION
Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of … (Read full abstract)
Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further training on large-scale datasets inevitably brings significant computational costs. In this paper we propose FreeDA a training-free diffusion-augmented method for open-vocabulary semantic segmentation which leverages the ability of diffusion models to visually localize generated concepts and local-global similarities to match class-agnostic regions with semantic classes. Our approach involves an offline stage in which textual-visual reference embeddings are collected starting from a large set of captions and leveraging visual and semantic contexts. At test time these are queried to support the visual matching process which is carried out by jointly considering class-agnostic regions and global semantic similarities. Extensive analyses demonstrate that FreeDA achieves state-of-the-art performance on five datasets surpassing previous methods by more than 7.0 average points in terms of mIoU and without requiring any training. Our source code is available at https://aimagelab.github.io/freeda/.
Trends, Applications, and Challenges in Human Attention Modelling
Authors: Cartella, Giuseppe; Cornia, Marcella; Cuculo, Vittorio; D'Amelio, Alessandro; Zanca, Dario; Boccignone, Giuseppe; Cucchiara, Rita
Published in: IJCAI
Human attention modelling has proven, in recent years, to be particularly useful not only for understanding the cognitive processes underlying … (Read full abstract)
Human attention modelling has proven, in recent years, to be particularly useful not only for understanding the cognitive processes underlying visual exploration, but also for providing support to artificial intelligence models that aim to solve problems in various domains, including image and video processing, vision-and-language applications, and language modelling. This survey offers a reasoned overview of recent efforts to integrate human attention mechanisms into contemporary deep learning models and discusses future research directions and challenges.
Unlearning Vision Transformers without Retaining Data via Low-Rank Decompositions
Authors: Poppi, Samuele; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
The implementation of data protection regulations such as the GDPR and the California Consumer Privacy Act has sparked a growing … (Read full abstract)
The implementation of data protection regulations such as the GDPR and the California Consumer Privacy Act has sparked a growing interest in removing sensitive information from pre-trained models without requiring retraining from scratch, all while maintaining predictive performance on remaining data. Recent studies on machine unlearning for deep neural networks have resulted in different attempts that put constraints on the training procedure and which are limited to small-scale architectures and with poor adaptability to real-world requirements. In this paper, we develop an approach to delete information on a class from a pre-trained model, by injecting a trainable low-rank decomposition into the network parameters, and without requiring access to the original training set. Our approach greatly reduces the number of parameters to train as well as time and memory requirements. This allows a painless application to real-life settings where the entire training set is unavailable, and compliance with the requirement of time-bound deletion. We conduct experiments on various Vision Transformer architectures for class forgetting. Extensive empirical analyses demonstrate that our proposed method is efficient, safe to apply, and effective in removing learned information while maintaining accuracy.