Publications by Giuseppe Cartella

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Giuseppe Cartella

Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

Authors: Cartella, Giuseppe; Cuculo, Vittorio; D'Amelio, Alessandro; Cornia, Marcella; Boccignone, Giuseppe; Cucchiara, Rita

Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. … (Read full abstract)

Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research.

2025 Relazione in Atti di Convegno

Pixels of Faith: Exploiting Visual Saliency to Detect Religious Image Manipulation

Authors: Cartella, G.; Cuculo, V.; Cornia, M.; Papasidero, M.; Ruozzi, F.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

2025 Relazione in Atti di Convegno

Sanctuaria-Gaze: A Multimodal Egocentric Dataset for Human Attention Analysis in Religious Sites

Authors: Cartella, Giuseppe; Cuculo, Vittorio; Cornia, Marcella; Papasidero, Marco; Ruozzi, Federico; Cucchiara, Rita

Published in: ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE

We introduce Sanctuaria-Gaze, a multimodal dataset featuring egocentric recordings from 40 visits to four architecturally and culturally significant sanctuaries in … (Read full abstract)

We introduce Sanctuaria-Gaze, a multimodal dataset featuring egocentric recordings from 40 visits to four architecturally and culturally significant sanctuaries in Northern Italy. Collected using wearable devices with integrated eye trackers, the dataset offers RGB videos synchronized with streams of gaze coordinates, head motion, and environmental point cloud, resulting in over four hours of recordings. Along with the dataset, we provide a framework for automatic detection and analysis of Areas of Interest (AOIs). This framework fills a critical gap by offering an open-source, flexible tool for gaze-based research that adapts to dynamic settings without requiring manual intervention. Our study analyzes human visual attention to sacred, architectural, and cultural objects, providing insights into how visitors engage with these elements and how their background influences their interactions. By releasing both the dataset and the analysis framework, Sanctuaria-Gaze aims to advance interdisciplinary research on gaze behavior, human-computer interaction, and visual attention in real-world environments. Code and dataset are available at https://github.com/aimagelab/Sanctuaria-Gaze.

2025 Articolo su rivista

TPP-Gaze: Modelling Gaze Dynamics in Space and Time with Neural Temporal Point Processes

Authors: D'Amelio, Alessandro; Cartella, Giuseppe; Cuculo, Vittorio; Lucchi, Manuele; Cornia, Marcella; Cucchiara, Rita; Boccignone, Giuseppe

Attention guides our gaze to fixate the proper location of the scene and holds it in that location for the … (Read full abstract)

Attention guides our gaze to fixate the proper location of the scene and holds it in that location for the deserved amount of time given current processing demands, before shifting to the next one. As such, gaze deployment crucially is a temporal process. Existing computational models have made significant strides in predicting spatial aspects of observer's visual scanpaths (where to look), while often putting on the background the temporal facet of attention dynamics (when). In this paper we present TPP-Gaze, a novel and principled approach to model scanpath dynamics based on Neural Temporal Point Process (TPP), that jointly learns the temporal dynamics of fixations position and duration, integrating deep learning methodologies with point process theory. We conduct extensive experiments across five publicly available datasets. Our results show the overall superior performance of the proposed model compared to state-of-the-art approaches.

2025 Relazione in Atti di Convegno

Unravelling Neurodivergent Gaze Behaviour through Visual Attention Causal Graphs

Authors: Cartella, Giuseppe; Cuculo, Vittorio; D'Amelio, Alessandro; Cucchiara, Rita; Boccignone, Giuseppe

Can the very fabric of how we visually explore the world hold the key to distinguishing individuals with Autism Spectrum … (Read full abstract)

Can the very fabric of how we visually explore the world hold the key to distinguishing individuals with Autism Spectrum Disorder (ASD)? While eye tracking has long promised quantifiable insights into neurodevelopmental conditions, the causal underpinnings of gaze behaviour remain largely uncharted territory. Moving beyond traditional descriptive metrics of gaze, this study employs cutting-edge causal discovery methods to reconstruct the directed networks that govern the flow of attention across natural scenes. Given the well-documented atypical patterns of visual attention in ASD, particularly regarding socially relevant cues, our central hypothesis is that individuals with ASD exhibit distinct causal signatures in their gaze patterns, significantly different from those of typically developing controls. To our knowledge, this is the first study to explore the diagnostic potential of causal modeling of eye movements in uncovering the cognitive phenotypes of ASD and offers a novel window into the neurocognitive alterations characteristic of the disorder.

2025 Relazione in Atti di Convegno

Trends, Applications, and Challenges in Human Attention Modelling

Authors: Cartella, Giuseppe; Cornia, Marcella; Cuculo, Vittorio; D'Amelio, Alessandro; Zanca, Dario; Boccignone, Giuseppe; Cucchiara, Rita

Published in: IJCAI

Human attention modelling has proven, in recent years, to be particularly useful not only for understanding the cognitive processes underlying … (Read full abstract)

Human attention modelling has proven, in recent years, to be particularly useful not only for understanding the cognitive processes underlying visual exploration, but also for providing support to artificial intelligence models that aim to solve problems in various domains, including image and video processing, vision-and-language applications, and language modelling. This survey offers a reasoned overview of recent efforts to integrate human attention mechanisms into contemporary deep learning models and discusses future research directions and challenges.

2024 Relazione in Atti di Convegno

Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images

Authors: Cartella, Giuseppe; Cuculo, Vittorio; Cornia, Marcella; Cucchiara, Rita

Published in: IEEE SIGNAL PROCESSING LETTERS

Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural … (Read full abstract)

Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural language of your desired output is all you need to obtain breathtaking results. However, as the use of generative models grows, so do concerns about the propagation of malicious content and misinformation. Consequently, the research community is actively working on the development of novel fake detection techniques, primarily focusing on low-level features and possible fingerprints left by generative models during the image generation process. In a different vein, in our work, we leverage human semantic knowledge to investigate the possibility of being included in frameworks of fake image detection. To achieve this, we collect a novel dataset of partially manipulated images using diffusion models and conduct an eye-tracking experiment to record the eye movements of different observers while viewing real and fake stimuli. A preliminary statistical analysis is conducted to explore the distinctive patterns in how humans perceive genuine and altered images. Statistical findings reveal that, when perceiving counterfeit samples, humans tend to focus on more confined regions of the image, in contrast to the more dispersed observational pattern observed when viewing genuine images. Our dataset is publicly available at: https://github.com/aimagelab/unveiling-the-truth.

2024 Articolo su rivista

LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

Authors: Morelli, Davide; Baldrati, Alberto; Cartella, Giuseppe; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita

The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the … (Read full abstract)

The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model's characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task.

2023 Relazione in Atti di Convegno

Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

Authors: Baldrati, Alberto; Morelli, Davide; Cartella, Giuseppe; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita

Published in: PROCEEDINGS IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION

Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, … (Read full abstract)

Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs. Source code and collected multimodal annotations are publicly available at: https://github.com/aimagelab/multimodal-garment-designer.

2023 Relazione in Atti di Convegno

OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Authors: Cartella, Giuseppe; Baldrati, Alberto; Morelli, Davide; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In … (Read full abstract)

The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip.

2023 Relazione in Atti di Convegno