Publications by Marcella Cornia

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Marcella Cornia

Video Surveillance and Privacy: A Solvable Paradox?

Authors: Cucchiara, Rita; Baraldi, Lorenzo; Cornia, Marcella; Sarto, Sara

Published in: COMPUTER

Video Surveillance started decades ago to remotely monitor specific areas and allow control from human inspectors. Later, Computer Vision gradually … (Read full abstract)

Video Surveillance started decades ago to remotely monitor specific areas and allow control from human inspectors. Later, Computer Vision gradually replaced human monitoring, firstly through motion alerts and now with Deep Learning techniques. From the beginning of this journey, people have worried about the risk of privacy violations. This article surveys the main steps of Computer Vision in Video Surveillance, from early approaches for people detection and tracking to action analysis and language description, outlining the most relevant directions on the topic to deal with privacy concerns. We show how the relationship between Video Surveillance and privacy is a biased paradox since surveillance provides increased safety but does not necessarily require the people identification. Through experiments on action recognition and natural language description, we showcase that the paradox of surveillance and privacy can be solved by Artificial Intelligence and that the respect of human rights is not an impossible chimera.

2024 Articolo su rivista

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

Authors: Caffagni, Davide; Cocchi, Federico; Moratelli, Nicholas; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

Multimodal LLMs are the natural evolution of LLMs and enlarge their capabilities so as to work beyond the pure textual … (Read full abstract)

Multimodal LLMs are the natural evolution of LLMs and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach termed Wiki-LLaVA aims at integrating an external knowledge source of multimodal documents which is accessed through a hierarchical retrieval pipeline. Relevant passages using this approach are retrieved from the external knowledge source and employed as additional context for the LLM augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.

2024 Relazione in Atti di Convegno

Computer Vision in Human Analysis: From Face and Body to Clothes

Authors: Daoudi, Mohamed; Vezzani, Roberto; Borghi, Guido; Ferrari, Claudio; Cornia, Marcella; Becattini, Federico; Pilzer, Andrea

Published in: SENSORS

For decades, researchers of different areas, ranging from artificial intelligence to computer vision, have intensively investigated human-centered data, i.e., data … (Read full abstract)

For decades, researchers of different areas, ranging from artificial intelligence to computer vision, have intensively investigated human-centered data, i.e., data in which the human plays a significant role, acquired through a non-invasive approach, such as cameras. This interest has been largely supported by the highly informative nature of this kind of data, which provides a variety of information with which it is possible to understand many aspects including, for instance, the human body or the outward appearance. Some of the main tasks related to human analysis are focused on the body (e.g., human pose estimation and anthropocentric measurement estimation), the hands (e.g., gesture detection and recognition), the head (e.g., head pose estimation), or the face (e.g., emotion and expression recognition). Additional tasks are based on non-corporal elements, such as motion (e.g., action recognition and human behavior understanding) and clothes (e.g., garment-based virtual try-on and attribute recognition). Unfortunately, privacy issues severely limit the usage and the diffusion of this kind of data, making the exploitation of learning approaches challenging. In particular, privacy issues behind the acquisition and the use of human-centered data must be addressed by public and private institutions and companies. Thirteen high-quality papers have been published in this Special Issue and are summarized in the following: four of them are focused on the human face (facial geometry, facial landmark detection, and emotion recognition), two on eye image analysis (eye status classification and 3D gaze estimation), five on the body (pose estimation, conversational gesture analysis, and action recognition), and two on the outward appearance (transferring clothing styles and fashion-oriented image captioning). These numbers confirm the high interest in human-centered data and, in particular, the variety of real-world applications that it is possible to develop.

2023 Articolo su rivista

Embodied Agents for Efficient Exploration and Smart Scene Description

Authors: Bigazzi, Roberto; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION

The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last … (Read full abstract)

The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments. As a step towards this objective, in this work, we tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment while portraying interesting scenes with natural language descriptions. To this end, we propose and evaluate an approach that combines recent advances in visual robotic exploration and image captioning on images generated through agent-environment interaction. Our approach can generate smart scene descriptions that maximize semantic knowledge of the environment and avoid repetitions. Further, such descriptions offer user-understandable insights into the robot's representation of the environment by high-lighting the prominent objects and the correlation between them as encountered during the exploration. To quantitatively assess the performance of the proposed approach, we also devise a specific score that takes into account both exploration and description skills. The experiments carried out on both photorealistic simulated environments and real-world ones demonstrate that our approach can effectively describe the robot's point of view during exploration, improving the human-friendly interpretability of its observations.

2023 Relazione in Atti di Convegno

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Authors: Moratelli, Nicholas; Barraco, Manuele; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: SENSORS

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this … (Read full abstract)

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.

2023 Articolo su rivista

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Authors: Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cascianelli, Silvia; Fiameni, Giuseppe; Cucchiara, Rita

Published in: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted … (Read full abstract)

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.

2023 Articolo su rivista

Fully-Attentive Iterative Networks for Region-based Controllable Image and Video Captioning

Authors: Cornia, Marcella; Baraldi, Lorenzo; Ayellet, Tal; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios … (Read full abstract)

Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios of image captioning algorithms. In this task, a captioner is conditioned on an external control signal, which needs to be followed during the generation of the caption. We aim to overcome the limitations of current controllable captioning methods by proposing a fully-attentive and iterative network that can generate grounded and controllable captions from a control signal given as a sequence of visual regions from the image. Our architecture is based on a set of novel attention operators, which take into account the hierarchical nature of the control signal, and is endowed with a decoder which explicitly focuses on each part of the control signal. We demonstrate the effectiveness of the proposed approach by conducting experiments on three datasets, where our model surpasses the performances of previous methods and achieves a new state of the art on both image and video controllable captioning.

2023 Articolo su rivista

LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

Authors: Morelli, Davide; Baldrati, Alberto; Cartella, Giuseppe; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita

The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the … (Read full abstract)

The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model's characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task.

2023 Relazione in Atti di Convegno

Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

Authors: Baldrati, Alberto; Morelli, Davide; Cartella, Giuseppe; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita

Published in: PROCEEDINGS IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION

Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, … (Read full abstract)

Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs. Source code and collected multimodal annotations are publicly available at: https://github.com/aimagelab/multimodal-garment-designer.

2023 Relazione in Atti di Convegno

OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Authors: Cartella, Giuseppe; Baldrati, Alberto; Morelli, Davide; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In … (Read full abstract)

The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip.

2023 Relazione in Atti di Convegno

Page 5 of 11 • Total publications: 107