Publications by Rita Cucchiara

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Rita Cucchiara

BRUM: Robust 3D Vehicle Reconstruction from 360° Sparse Images

Authors: Di Nucci, Davide; Tomei, Matteo; Borghi, Guido; Ciuffreda, Luca; Vezzani, Roberto; Cucchiara, Rita

2025 Relazione in Atti di Convegno

Causal Graphical Models for Vision-Language Compositional Understanding

Authors: Parascandolo, Fiorenzo; Moratelli, Nicholas; Sangineto, Enver; Baraldi, Lorenzo; Cucchiara, Rita

2025 Relazione in Atti di Convegno

Continual Facial Features Transfer for Facial Expression Recognition

Authors: Maharjan, R. S.; Bonicelli, L.; Romeo, M.; Calderara, S.; Cangelosi, A.; Cucchiara, R.

Published in: IEEE TRANSACTIONS ON AFFECTIVE COMPUTING

2025 Articolo su rivista

Decoding Facial Expressions in Video: A Multiple Instance Learning Perspective on Action Units

Authors: Del Gaudio, Livia; Cuculo, Vittorio; Cucchiara, Rita

Facial expression recognition (FER) in video sequences is a longstanding challenge in affective computing and computer vision, particularly due to … (Read full abstract)

Facial expression recognition (FER) in video sequences is a longstanding challenge in affective computing and computer vision, particularly due to the temporal complexity and subtlety of emotional expressions. In this paper, we propose a novel pipeline that leverages facial Action Units (AUs) as structured time series descriptors of facial muscle activity, enabling emotion classification in videos through a Multiple Instance Learning (MIL) framework. Our approach models each video as a bag of AU-based instances, capturing localized temporal patterns, and allows for robust learning even when only coarse video-level emotion labels are available. Crucially, the approach incorporates interpretability mechanisms that highlight the temporal segments most influential to the final prediction, providing informed decision-making and facilitating downstream analysis. Experimental results on benchmark FER video datasets demonstrate that our method achieves competitive performance using only visual data, without requiring multimodal signals or frame-level supervision. This highlights its potential as an interpretable and efficient solution for weakly supervised emotion recognition in real-world scenarios.

2025 Relazione in Atti di Convegno

Diffusion Transformers for Tabular Data Time Series Generation

Authors: Garuti, Fabrizio; Sangineto, Enver; Luetto, Simone; Forni, Lorenzo; Cucchiara, Rita

2025 Relazione in Atti di Convegno

DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection

Authors: Cappellino, Chiara; Mancusi, Gianluca; Mosconi, Matteo; Porrello, Angelo; Calderara, Simone; Cucchiara, Rita

2025 Relazione in Atti di Convegno

Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

Authors: Sanguigni, Fulvio; Morelli, Davide; Cornia, Marcella; Cucchiara, Rita

In recent years, the fashion industry has increasingly adopted AI technologies to enhance customer experience, driven by the proliferation of … (Read full abstract)

In recent years, the fashion industry has increasingly adopted AI technologies to enhance customer experience, driven by the proliferation of e-commerce platforms and virtual applications. Among the various tasks, virtual try-on and multimodal fashion image editing – which utilizes diverse input modalities such as text, garment sketches, and body poses – have become a key area of research. Diffusion models have emerged as a leading approach for such generative tasks, offering superior image quality and diversity. However, most existing virtual try-on methods rely on having a specific garment input, which is often impractical in real-world scenarios where users may only provide textual specifications. To address this limitation, in this work we introduce Fashion Retrieval-Augmented Generation (Fashion-RAG), a novel method that enables the customization of fashion items based on user preferences provided in textual form. Our approach retrieves multiple garments that match the input specifications and generates a personalized image by incorporating attributes from the retrieved items. To achieve this, we employ textual inversion techniques, where retrieved garment images are projected into the textual embedding space of the Stable Diffusion text encoder, allowing seamless integration of retrieved elements into the generative process. Experimental results on the Dress Code dataset demonstrate that Fashion-RAG outperforms existing methods both qualitatively and quantitatively, effectively capturing fine-grained visual details from retrieved garments. To the best of our knowledge, this is the first work to introduce a retrieval-augmented generation approach specifically tailored for multimodal fashion image editing.

2025 Relazione in Atti di Convegno

Hallucination Early Detection in Diffusion Models

Authors: Betti, Federico; Baraldi, Lorenzo; Baraldi, Lorenzo; Cucchiara, Rita; Sebe, Nicu

Published in: INTERNATIONAL JOURNAL OF COMPUTER VISION

2025 Articolo su rivista

Hyperbolic Safety-Aware Vision-Language Models

Authors: Poppi, Tobia; Kasarla, Tejaswi; Mettes, Pascal; Baraldi, Lorenzo; Cucchiara, Rita

2025 Relazione in Atti di Convegno

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Authors: Sarto, Sara; Cornia, Marcella; Cucchiara, Rita

The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models … (Read full abstract)

The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggest promising directions for future research in image captioning assessment.

2025 Relazione in Atti di Convegno

Page 2 of 51 • Total publications: 504