Publications by Rita Cucchiara

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Rita Cucchiara

Depth-based 3D human pose refinement: Evaluating the refinet framework

Authors: D'Eusanio, A.; Simoni, A.; Pini, S.; Borghi, G.; Vezzani, R.; Cucchiara, R.

Published in: PATTERN RECOGNITION LETTERS

In recent years, Human Pose Estimation has achieved impressive results on RGB images. The advent of deep learning architectures and … (Read full abstract)

In recent years, Human Pose Estimation has achieved impressive results on RGB images. The advent of deep learning architectures and large annotated datasets have contributed to these achievements. However, little has been done towards estimating the human pose using depth maps, and especially towards obtaining a precise 3D body joint localization. To fill this gap, this paper presents RefiNet, a depth-based 3D human pose refinement framework. Given a depth map and an initial coarse 2D human pose, RefiNet regresses a fine 3D pose. The framework is composed of three modules, based on different data representations, i.e. 2D depth patches, 3D human skeletons, and point clouds. An extensive experimental evaluation is carried out to investigate the impact of the model hyper-parameters and to compare RefiNet with off-the-shelf 2D methods and literature approaches. Results confirm the effectiveness of the proposed framework and its limited computational requirements.

2023 Articolo su rivista

Embodied Agents for Efficient Exploration and Smart Scene Description

Authors: Bigazzi, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION

The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last … (Read full abstract)

The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments. As a step towards this objective, in this work, we tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment while portraying interesting scenes with natural language descriptions. To this end, we propose and evaluate an approach that combines recent advances in visual robotic exploration and image captioning on images generated through agent-environment interaction. Our approach can generate smart scene descriptions that maximize semantic knowledge of the environment and avoid repetitions. Further, such descriptions offer user-understandable insights into the robot's representation of the environment by high-lighting the prominent objects and the correlation between them as encountered during the exploration. To quantitatively assess the performance of the proposed approach, we also devise a specific score that takes into account both exploration and description skills. The experiments carried out on both photorealistic simulated environments and real-world ones demonstrate that our approach can effectively describe the robot's point of view during exploration, improving the human-friendly interpretability of its observations.

2023 Relazione in Atti di Convegno

Enhancing Open-Vocabulary Semantic Segmentation with Prototype Retrieval

Authors: Barsellotti, Luca; Amoroso, Roberto; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

2023 Relazione in Atti di Convegno

Evaluating synthetic pre-Training for handwriting processing tasks

Authors: Pippi, V.; Cascianelli, S.; Baraldi, L.; Cucchiara, R.

Published in: PATTERN RECOGNITION LETTERS

In this work, we explore massive pre-training on synthetic word images for enhancing the performance on four benchmark downstream handwriting … (Read full abstract)

In this work, we explore massive pre-training on synthetic word images for enhancing the performance on four benchmark downstream handwriting analysis tasks. To this end, we build a large synthetic dataset of word images rendered in several handwriting fonts, which offers a complete supervision sig-nal. We use it to train a simple convolutional neural network (ConvNet) with a fully supervised objective. The vector representations of the images obtained from the pre-trained ConvNet can then be consid-ered as encodings of the handwriting style. We exploit such representations for Writer Retrieval, Writer Identification, Writer Verification, and Writer Classification and demonstrate that our pre-training strat-egy allows extracting rich representations of the writers' style that enable the aforementioned tasks with competitive results with respect to task-specific State-of-the-Art approaches.& COPY; 2023 Elsevier B.V. All rights reserved.

2023 Articolo su rivista

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Authors: Moratelli, Nicholas; Barraco, Manuele; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: SENSORS

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this … (Read full abstract)

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.

2023 Articolo su rivista

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Authors: Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cascianelli, Silvia; Fiameni, Giuseppe; Cucchiara, Rita

Published in: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted … (Read full abstract)

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.

2023 Articolo su rivista

Fully-Attentive Iterative Networks for Region-based Controllable Image and Video Captioning

Authors: Cornia, Marcella; Baraldi, Lorenzo; Ayellet, Tal; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios … (Read full abstract)

Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios of image captioning algorithms. In this task, a captioner is conditioned on an external control signal, which needs to be followed during the generation of the caption. We aim to overcome the limitations of current controllable captioning methods by proposing a fully-attentive and iterative network that can generate grounded and controllable captions from a control signal given as a sequence of visual regions from the image. Our architecture is based on a set of novel attention operators, which take into account the hierarchical nature of the control signal, and is endowed with a decoder which explicitly focuses on each part of the control signal. We demonstrate the effectiveness of the proposed approach by conducting experiments on three datasets, where our model surpasses the performances of previous methods and achieves a new state of the art on both image and video controllable captioning.

2023 Articolo su rivista

Handwritten Text Generation from Visual Archetypes

Authors: Pippi, V.; Cascianelli, S.; Cucchiara, R.

Published in: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

Generating synthetic images of handwritten text in a writer-specific style is a challenging task, especially in the case of unseen … (Read full abstract)

Generating synthetic images of handwritten text in a writer-specific style is a challenging task, especially in the case of unseen styles and new words, and even more when these latter contain characters that are rarely encountered during training. While emulating a writer's style has been recently addressed by generative models, the generalization towards rare characters has been disregarded. In this work, we devise a Transformer-based model for Few-Shot styled handwritten text generation and focus on obtaining a robust and informative representation of both the text and the style. In particular, we propose a novel representation of the textual content as a sequence of dense vectors obtained from images of symbols written as standard GNU Unifont glyphs, which can be considered their visual archetypes. This strategy is more suitable for generating characters that, despite having been seen rarely during training, possibly share visual details with the frequently observed ones. As for the style, we obtain a robust representation of unseen writers' calligraphy by exploiting specific pre-training on a large synthetic dataset. Quantitative and qualitative results demonstrate the effectiveness of our proposal in generating words in unseen styles and with rare characters more faithfully than existing approaches relying on independent one-hot encodings of the characters.

2023 Relazione in Atti di Convegno

How to Choose Pretrained Handwriting Recognition Models for Single Writer Fine-Tuning

Authors: Pippi, V.; Cascianelli, S.; Kermorvant, C.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on both modern and … (Read full abstract)

Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on both modern and historical manuscripts in large benchmark datasets. Nonetheless, those models struggle to obtain the same performance when applied to manuscripts with peculiar characteristics, such as language, paper support, ink, and author handwriting. This issue is very relevant for valuable but small collections of documents preserved in historical archives, for which obtaining sufficient annotated training data is costly or, in some cases, unfeasible. To overcome this challenge, a possible solution is to pretrain HTR models on large datasets and then fine-tune them on small single-author collections. In this paper, we take into account large, real benchmark datasets and synthetic ones obtained with a styled Handwritten Text Generation model. Through extensive experimental analysis, also considering the amount of fine-tuning lines, we give a quantitative indication of the most relevant characteristics of such data for obtaining an HTR model able to effectively transcribe manuscripts in small collections with as little as five real fine-tuning lines.

2023 Relazione in Atti di Convegno

HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

Authors: Pippi, V.; Quattrini, F.; Cascianelli, S.; Cucchiara, R.

2023 Relazione in Atti di Convegno

Page 9 of 51 • Total publications: 504