Publications - AImageLab

Multimodal Emotion Recognition in Conversation via Possible Speaker's Audio and Visual Sequence Selection

Authors: Singh Maharjan, Rahul; Rawal, Niyati; Romeo, Marta; Baraldi, Lorenzo; Cucchiara, Rita; Cangelosi, Angelo

Published in: PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING

2025 Relazione in Atti di Convegno

One transformer for all time series: representing and training with time-dependent heterogeneous tabular data

Authors: Luetto, S.; Garuti, F.; Sangineto, E.; Forni, L.; Cucchiara, R.

Published in: MACHINE LEARNING

There is a recent growing interest in applying Deep Learning techniques to tabular data in order to replicate the success … (Read full abstract)

There is a recent growing interest in applying Deep Learning techniques to tabular data in order to replicate the success of other Artificial Intelligence areas in this structured domain. Particularly interesting is the case in which tabular data have a time dependence, such as, for instance, financial transactions. However, the heterogeneity of the tabular values, in which categorical elements are mixed with numerical features, makes this adaptation difficult. In this paper we propose UniTTab, a Transformer based architecture whose goal is to uniformly represent heterogeneous time-dependent tabular data, in which both numerical and categorical features are described using continuous embedding vectors. Moreover, differently from common approaches, which use a combination of different loss functions for training with both numerical and categorical targets, UniTTab is uniformly trained with a unique Masked Token pretext task. Finally, UniTTab can also represent time series in which the individual row components have a variable internal structure with a variable number of fields, which is a common situation in many application domains, such as in real world transactional data. Using extensive experiments with five datasets of variable size and complexity, we empirically show that UniTTab consistently and significantly improves the prediction accuracy over several downstream tasks and with respect to both Deep Learning and more standard Machine Learning approaches. Our code and our models are available at: https://github.com/fabriziogaruti/UniTTab.

2025 Articolo su rivista

DOI IRIS

Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries

Authors: Amoroso, Roberto; Zhang, Gengyuan; Koner, Rajat; Baraldi, Lorenzo; Cucchiara, Rita; Tresp, Volker

Video Question Answering (Video QA) is a critical and challenging task in video understanding, necessitating models to comprehend entire videos, … (Read full abstract)

Video Question Answering (Video QA) is a critical and challenging task in video understanding, necessitating models to comprehend entire videos, identify the most pertinent information based on the contextual cues from the question, and reason accurately to provide answers. Initial endeavors in harnessing Multimodal Large Language Models (MLLMs) have cast new light on Visual QA, particularly highlighting their commonsense and temporal reasoning capacities. Models that effectively align visual and textual elements can offer more accurate answers tailored to visual inputs. Nevertheless, an unresolved question persists regarding video content: How can we efficiently extract the most relevant information from videos over time and space for enhanced VQA? In this study, we evaluate the efficacy of various temporal modeling techniques in conjunction with MLLMs and introduce a novel component, T-Former, designed as a question-guided temporal querying transformer. T-Former bridges frame-wise visual perception and the reasoning capabilities of LLMs. Our evaluation across various VideoQA benchmarks shows that T-Former, with its linear computational complexity, competes favorably with existing temporal modeling approaches and aligns with the latest advancements in Video QA tasks.

2025 Relazione in Atti di Convegno

IRIS

Pixels of Faith: Exploiting Visual Saliency to Detect Religious Image Manipulation

Authors: Cartella, G.; Cuculo, V.; Cornia, M.; Papasidero, M.; Ruozzi, F.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

2025 Relazione in Atti di Convegno

DOI IRIS

Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Authors: Sarto, Sara; Moratelli, Nicholas; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: INTERNATIONAL JOURNAL OF COMPUTER VISION

2025 Articolo su rivista

DOI IRIS

Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Authors: Caffagni, Davide; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning … (Read full abstract)

Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries -- composed of both an image and a text -- and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT.

2025 Relazione in Atti di Convegno

IRIS

Sanctuaria-Gaze: A Multimodal Egocentric Dataset for Human Attention Analysis in Religious Sites

Authors: Cartella, Giuseppe; Cuculo, Vittorio; Cornia, Marcella; Papasidero, Marco; Ruozzi, Federico; Cucchiara, Rita

Published in: ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE

We introduce Sanctuaria-Gaze, a multimodal dataset featuring egocentric recordings from 40 visits to four architecturally and culturally significant sanctuaries in … (Read full abstract)

We introduce Sanctuaria-Gaze, a multimodal dataset featuring egocentric recordings from 40 visits to four architecturally and culturally significant sanctuaries in Northern Italy. Collected using wearable devices with integrated eye trackers, the dataset offers RGB videos synchronized with streams of gaze coordinates, head motion, and environmental point cloud, resulting in over four hours of recordings. Along with the dataset, we provide a framework for automatic detection and analysis of Areas of Interest (AOIs). This framework fills a critical gap by offering an open-source, flexible tool for gaze-based research that adapts to dynamic settings without requiring manual intervention. Our study analyzes human visual attention to sacred, architectural, and cultural objects, providing insights into how visitors engage with these elements and how their background influences their interactions. By releasing both the dataset and the analysis framework, Sanctuaria-Gaze aims to advance interdisciplinary research on gaze behavior, human-computer interaction, and visual attention in real-world environments. Code and dataset are available at https://github.com/aimagelab/Sanctuaria-Gaze.

2025 Articolo su rivista

DOI IRIS

Semantically Conditioned Prompts for Visual Recognition under Missing Modality Scenarios

Authors: Pipoli, Vittorio; Bolelli, Federico; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Grana, Costantino; Cucchiara, Rita; Ficarra, Elisa

Published in: IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION

This paper tackles the domain of multimodal prompting for visual recognition, specifically when dealing with missing modalities through multimodal Transformers. … (Read full abstract)

This paper tackles the domain of multimodal prompting for visual recognition, specifically when dealing with missing modalities through multimodal Transformers. It presents two main contributions: (i) we introduce a novel prompt learning module which is designed to produce sample-specific prompts and (ii) we show that modality-agnostic prompts can effectively adjust to diverse missing modality scenarios. Our model, termed SCP, exploits the semantic representation of available modalities to query a learnable memory bank, which allows the generation of prompts based on the semantics of the input. Notably, SCP distinguishes itself from existing methodologies for its capacity of self-adjusting to both the missing modality scenario and the semantic context of the input, without prior knowledge about the specific missing modality and the number of modalities. Through extensive experiments, we show the effectiveness of the proposed prompt learning framework and demonstrate enhanced performance and robustness across a spectrum of missing modality cases.

2025 Relazione in Atti di Convegno

DOI IRIS

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Authors: Barsellotti, Luca; Bianchi, Lorenzo; Messina, Nicola; Carrara, Fabio; Cornia, Marcella; Baraldi, Lorenzo; Falchi, Fabrizio; Cucchiara, Rita

Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such … (Read full abstract)

Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks.

2025 Relazione in Atti di Convegno

IRIS

TPP-Gaze: Modelling Gaze Dynamics in Space and Time with Neural Temporal Point Processes

Authors: D'Amelio, Alessandro; Cartella, Giuseppe; Cuculo, Vittorio; Lucchi, Manuele; Cornia, Marcella; Cucchiara, Rita; Boccignone, Giuseppe

Attention guides our gaze to fixate the proper location of the scene and holds it in that location for the … (Read full abstract)

Attention guides our gaze to fixate the proper location of the scene and holds it in that location for the deserved amount of time given current processing demands, before shifting to the next one. As such, gaze deployment crucially is a temporal process. Existing computational models have made significant strides in predicting spatial aspects of observer's visual scanpaths (where to look), while often putting on the background the temporal facet of attention dynamics (when). In this paper we present TPP-Gaze, a novel and principled approach to model scanpath dynamics based on Neural Temporal Point Process (TPP), that jointly learns the temporal dynamics of fixations position and duration, integrating deep learning methodologies with point process theory. We conduct extensive experiments across five publicly available datasets. Our results show the overall superior performance of the proposed model compared to state-of-the-art approaches.

2025 Relazione in Atti di Convegno

DOI IRIS

Publications by Rita Cucchiara

Multimodal Emotion Recognition in Conversation via Possible Speaker's Audio and Visual Sequence Selection

One transformer for all time series: representing and training with time-dependent heterogeneous tabular data

Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries

Pixels of Faith: Exploiting Visual Saliency to Detect Religious Image Manipulation

Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Sanctuaria-Gaze: A Multimodal Egocentric Dataset for Human Attention Analysis in Religious Sites

Semantically Conditioned Prompts for Visual Recognition under Missing Modality Scenarios

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

TPP-Gaze: Modelling Gaze Dynamics in Space and Time with Neural Temporal Point Processes