Publications by Rita Cucchiara

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Rita Cucchiara

Trajectory Forecasting Through Low-Rank Adaptation of Discrete Latent Codes

Authors: Benaglia, R.; Porrello, A.; Buzzega, P.; Calderara, S.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Trajectory forecasting is crucial for video surveillance analytics, as it enables the anticipation of future movements for a set of … (Read full abstract)

Trajectory forecasting is crucial for video surveillance analytics, as it enables the anticipation of future movements for a set of agents, e.g., basketball players engaged in intricate interactions with long-term intentions. Deep generative models offer a natural learning approach for trajectory forecasting, yet they encounter difficulties in achieving an optimal balance between sampling fidelity and diversity. We address this challenge by leveraging Vector Quantized Variational Autoencoders (VQ-VAEs), which utilize a discrete latent space to tackle the issue of posterior collapse. Specifically, we introduce an instance-based codebook that allows tailored latent representations for each example. In a nutshell, the rows of the codebook are dynamically adjusted to reflect contextual information (i.e., past motion patterns extracted from the observed trajectories). In this way, the discretization process gains flexibility, leading to improved reconstructions. Notably, instance-level dynamics are injected into the codebook through low-rank updates, which restrict the customization of the codebook to a lower dimension space. The resulting discrete space serves as the basis of the subsequent step, which regards the training of a diffusion-based predictive model. We show that such a two-fold framework, augmented with instance-level discretization, leads to accurate and diverse forecasts, yielding state-of-the-art performance on three established benchmarks.

2025 Relazione in Atti di Convegno

Unravelling Neurodivergent Gaze Behaviour through Visual Attention Causal Graphs

Authors: Cartella, Giuseppe; Cuculo, Vittorio; D'Amelio, Alessandro; Cucchiara, Rita; Boccignone, Giuseppe

Can the very fabric of how we visually explore the world hold the key to distinguishing individuals with Autism Spectrum … (Read full abstract)

Can the very fabric of how we visually explore the world hold the key to distinguishing individuals with Autism Spectrum Disorder (ASD)? While eye tracking has long promised quantifiable insights into neurodevelopmental conditions, the causal underpinnings of gaze behaviour remain largely uncharted territory. Moving beyond traditional descriptive metrics of gaze, this study employs cutting-edge causal discovery methods to reconstruct the directed networks that govern the flow of attention across natural scenes. Given the well-documented atypical patterns of visual attention in ASD, particularly regarding socially relevant cues, our central hypothesis is that individuals with ASD exhibit distinct causal signatures in their gaze patterns, significantly different from those of typically developing controls. To our knowledge, this is the first study to explore the diagnostic potential of causal modeling of eye movements in uncovering the cognitive phenotypes of ASD and offers a novel window into the neurocognitive alterations characteristic of the disorder.

2025 Relazione in Atti di Convegno

VATr++: Choose Your Words Wisely for Handwritten Text Generation

Authors: Vanherle, B.; Pippi, V.; Cascianelli, S.; Michiels, N.; Van Reeth, F.; Cucchiara, R.

Published in: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Styled Handwritten Text Generation (HTG) has received significant attention in recent years,propelled by the success of learning-based solutions employing GANs,Transformers,and,preliminarily,Diffusion … (Read full abstract)

Styled Handwritten Text Generation (HTG) has received significant attention in recent years,propelled by the success of learning-based solutions employing GANs,Transformers,and,preliminarily,Diffusion Models. Despite this surge in interest,there remains a critical yet understudied aspect - the impact of the input,both visual and textual,on the HTG model training and its subsequent influence on performance. This work extends the VATr [1] Styled-HTG approach by addressing the pre-processing and training issues that it faces,which are common to many HTG models. In particular,we propose generally applicable strategies for input preparation and training regularization that allow the model to achieve better performance and generalization capabilities. Moreover,in this work,we go beyond performance optimization and address a significant hurdle in HTG research - the lack of a standardized evaluation protocol. In particular,we propose a standardization of the evaluation protocol for HTG and conduct a comprehensive benchmarking of existing approaches. By doing so,we aim to establish a foundation for fair and meaningful comparisons between HTG strategies,fostering progress in the field.

2025 Articolo su rivista

Verifier Matters: Enhancing Inference-Time Scaling for Video Diffusion Models

Authors: Baraldi, Lorenzo; Bucciarelli, Davide; Zeng, Zifan; Zhang, Chongzhe; Zhang, Qunli; Cornia, Marcella; Baraldi, Lorenzo; Liu, Feng; Hu, Zheng; Cucchiara, Rita

Inference-time scaling has recently gained attention as an effective strategy for improving the performance of generative models without requiring additional … (Read full abstract)

Inference-time scaling has recently gained attention as an effective strategy for improving the performance of generative models without requiring additional training. Although this paradigm has been successfully applied in text and image generation tasks, its extension to video diffusion models remains relatively underexplored. Indeed, video generation presents unique challenges due to its spatiotemporal complexity, particularly in evaluating intermediate generated samples, a procedure that is required by inference-time scaling algorithms. In this work, we systematically investigate the role of the verifier: the scoring mechanism used to guide sampling. We show that current verifiers, when applied at early diffusion steps, face significant reliability challenges due to noisy samples. We further demonstrate that fine-tuning verifiers on partially denoised samples significantly improves early-stage evaluation and leads to gains in generation quality across multiple inference-time scaling algorithms, including Greedy Search, Beam Search, and a novel Successive Halving baseline.

2025 Relazione in Atti di Convegno

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

Authors: Baraldi, Lorenzo; Bucciarelli, Davide; Betti, Federico; Cornia, Marcella; Baraldi, Lorenzo; Sebe, Nicu; Cucchiara, Rita

Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most … (Read full abstract)

Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.

2025 Relazione in Atti di Convegno

Zero-Shot Styled Text Image Generation, but Make It Autoregressive

Authors: Pippi, Vittorio; Quattrini, Fabio; Cascianelli, Silvia; Tonioni, Alessio; Cucchiara, Rita

2025 Relazione in Atti di Convegno

Adapt to Scarcity: Few-Shot Deepfake Detection via Low-Rank Adaptation

Authors: Cappelletti, Silvia; Baraldi, Lorenzo; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

The boundary between AI-generated images and real photographs is becoming increasingly narrow, thanks to the realism provided by contemporary generative … (Read full abstract)

The boundary between AI-generated images and real photographs is becoming increasingly narrow, thanks to the realism provided by contemporary generative models. Such technological progress necessitates the evolution of existing deepfake detection algorithms to counter new threats and protect the integrity of perceived reality. Although the prevailing approach among deepfake detection methodologies relies on large collections of generated and real data, the efficacy of these methods in adapting to scenarios characterized by data scarcity remains uncertain. This obstacle arises due to the introduction of novel generation algorithms and proprietary generative models that impose restrictions on access to large-scale datasets, thereby constraining the availability of generated images. In this paper, we first analyze how the performance of current deepfake methodologies, based on the CLIP embedding space, adapt in a few-shot situation over four state-of-the-art generators. Being the CLIP embedding space not specifically tailored for the task, a fine-tuning stage is desirable, although the amount of data needed is often unavailable in a data scarcity scenario. To address this issue and limit possible overfitting, we introduce a novel approach through the Low-Rank Adaptation (LoRA) of the CLIP architecture, tailored for few-shot deepfake detection scenarios. Remarkably, the LoRA-modified CLIP, even when fine-tuned with merely 50 pairs of real and fake images, surpasses the performance of all evaluated deepfake detection models across the tested generators. Additionally, when LoRA CLIP is benchmarked against other models trained on 1,000 samples and evaluated on generative models not seen during training it exhibits superior generalization capabilities.

2024 Relazione in Atti di Convegno

AIGeN: An Adversarial Approach for Instruction Generation in VLN

Authors: Rawal, Niyati; Bigazzi, Roberto; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

2024 Relazione in Atti di Convegno

Are Learnable Prompts the Right Way of Prompting? Adapting Vision-and-Language Models with Memory Optimization

Authors: Moratelli, Nicholas; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTELLIGENT SYSTEMS

Few-shot learning (FSL) requires fine-tuning a pretrained model on a limited set of examples from novel classes. When applied to … (Read full abstract)

Few-shot learning (FSL) requires fine-tuning a pretrained model on a limited set of examples from novel classes. When applied to vision-and-language models, the dominant approach for FSL has been that of learning input prompts which can be concatenated to the input context of the model. Despite the considerable promise they hold, the effectiveness and expressive power of prompts are limited by the fact that they can only lie at the input of the architecture. In this article, we critically question the usage of learnable prompts, and instead leverage the concept of “implicit memory” to directly capture low- and high-level relationships within the attention mechanism at any layer of the architecture, thereby establishing an alternative to prompts in FSL. Our proposed approach, termed MemOp, exhibits superior performance across 11 widely recognized image classification datasets and a benchmark for contextual domain shift evaluation, effectively addressing the challenges associated with learnable prompts.

2024 Articolo su rivista

Binarizing Documents by Leveraging both Space and Frequency

Authors: Quattrini, F.; Pippi, V.; Cascianelli, S.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Document Image Binarization is a well-known problem in Document Analysis and Computer Vision, although it is far from being solved. … (Read full abstract)

Document Image Binarization is a well-known problem in Document Analysis and Computer Vision, although it is far from being solved. One of the main challenges of this task is that documents generally exhibit degradations and acquisition artifacts that can greatly vary throughout the page. Nonetheless, even when dealing with a local patch of the document, taking into account the overall appearance of a wide portion of the page can ease the prediction by enriching it with semantic information on the ink and background conditions. In this respect, approaches able to model both local and global information have been proven suitable for this task. In particular, recent applications of Vision Transformer (ViT)-based models, able to model short and long-range dependencies via the attention mechanism, have demonstrated their superiority over standard Convolution-based models, which instead struggle to model global dependencies. In this work, we propose an alternative solution based on the recently introduced Fast Fourier Convolutions, which overcomes the limitation of standard convolutions in modeling global information while requiring fewer parameters than ViTs. We validate the effectiveness of our approach via extensive experimental analysis considering different types of degradations.

2024 Relazione in Atti di Convegno

Page 5 of 51 • Total publications: 504