Publications by Marcella Cornia

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Marcella Cornia

Working Memory Connections for LSTM

Authors: Landi, Federico; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita

Published in: NEURAL NETWORKS

Recurrent Neural Networks with Long Short-Term Memory (LSTM) make use of gating mechanisms to mitigate exploding and vanishing gradients when … (Read full abstract)

Recurrent Neural Networks with Long Short-Term Memory (LSTM) make use of gating mechanisms to mitigate exploding and vanishing gradients when learning long-term dependencies. For this reason, LSTMs and other gated RNNs are widely adopted, being the standard de facto for many sequence modeling tasks. Although the memory cell inside the LSTM contains essential information, it is not allowed to influence the gating mechanism directly. In this work, we improve the gate potential by including information coming from the internal cell state. The proposed modification, named Working Memory Connection, consists in adding a learnable nonlinear projection of the cell content into the network gates. This modification can fit into the classical LSTM gates without any assumption on the underlying task, being particularly effective when dealing with longer sequences. Previous research effort in this direction, which goes back to the early 2000s, could not bring a consistent improvement over vanilla LSTM. As part of this paper, we identify a key issue tied to previous connections that heavily limits their effectiveness, hence preventing a successful integration of the knowledge coming from the internal cell state. We show through extensive experimental evaluation that Working Memory Connections constantly improve the performance of LSTMs on a variety of tasks. Numerical results suggest that the cell state contains useful information that is worth including in the gate structure.

2021 Articolo su rivista

A Unified Cycle-Consistent Neural Model for Text and Image Retrieval

Authors: Cornia, Marcella; Baraldi, Lorenzo; Tavakoli, Hamed R.; Cucchiara, Rita

Published in: MULTIMEDIA TOOLS AND APPLICATIONS

Text-image retrieval has been recently becoming a hot-spot research field, thanks to the development of deeply-learnable architectures which can retrieve … (Read full abstract)

Text-image retrieval has been recently becoming a hot-spot research field, thanks to the development of deeply-learnable architectures which can retrieve visual items given textual queries and vice-versa. The key idea of many state-of-the-art approaches has been that of learning a joint multi-modal embedding space in which text and images could be projected and compared. Here we take a different approach and reformulate the problem of text-image retrieval as that of learning a translation between the textual and visual domain. Our proposal leverages an end-to-end trainable architecture that can translate text into image features and vice versa and regularizes this mapping with a cycle-consistency criterion. Experimental evaluations for text-to-image and image-to-text retrieval, conducted on small, medium and large-scale datasets show consistent improvements over the baselines, thus confirming the appropriateness of using a cycle-consistent constrain for the text-image matching task.

2020 Articolo su rivista

Explaining Digital Humanities by Aligning Images and Textual Descriptions

Authors: Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita

Published in: PATTERN RECOGNITION LETTERS

Replicating the human ability to connect Vision and Language has recently been gaining a lot of attention in the Computer … (Read full abstract)

Replicating the human ability to connect Vision and Language has recently been gaining a lot of attention in the Computer Vision and the Natural Language Processing communities. This research effort has resulted in algorithms that can retrieve images from textual descriptions and vice versa, when realistic images and sentences with simple semantics are employed and when paired training data is provided. In this paper, we go beyond these limitations and tackle the design of visual-semantic algorithms in the domain of the Digital Humanities. This setting not only advertises more complex visual and semantic structures but also features a significant lack of training data which makes the use of fully-supervised approaches infeasible. With this aim, we propose a joint visual-semantic embedding that can automatically align illustrations and textual elements without paired supervision. This is achieved by transferring the knowledge learned on ordinary visual-semantic datasets to the artistic domain. Experiments, performed on two datasets specifically designed for this domain, validate the proposed strategies and quantify the domain shift between natural images and artworks.

2020 Articolo su rivista

Imparare a descrivere gli oggetti salienti presenti nelle immagini tramite la visione e il linguaggio

Authors: Cornia, Marcella

Replicare l’abilità degli esseri umani di connettere la visione e il linguaggio ha recentemente ottenuto molta attenzione nella visione e … (Read full abstract)

Replicare l’abilità degli esseri umani di connettere la visione e il linguaggio ha recentemente ottenuto molta attenzione nella visione e intelligenza artificiale, risultando in nuovi modelli e architetture capaci di descrivere le immagini in modo automatico attraverso delle frasi testuali. Questa attività, chiamata “image captioning”, non solo richiede di riconoscere gli oggetti salienti in un’immagine e di comprendere le loro interazioni, ma anche di poterli esprimere attraverso il linguaggio naturale. In questa tesi, vengono presentate soluzioni stato dell’arte per questi problemi affrontando tutti gli aspetti coinvolti nella generazione di descrizioni testuali. Infatti, quando gli esseri umani descrivono una scena, osservano un oggetto prima di nominarlo all’interno della frase. Questo avviene grazie a dei meccanismi selettivi che attirano lo sguardo degli esseri umani sulle parti salienti e rilevanti della scena. Motivati dall’importanza di stimare in maniera automatica il focus dell’attenzione degli esseri umani su immagini, la prima parte di questa dissertazione introduce due differenti modelli di predizione della salienza basati su reti neurali. Nel primo modello, viene utilizzata una combinazione di caratteristiche visuali estratte a differenti livelli di una rete neurale convolutiva per stimare la salienza di un’immagine. Nel secondo modello, invece, viene utilizzata un’architettura ricorrente insieme a meccanismi neurali attentivi che si focalizzano sulle regioni più salienti dell’immagine in modo da rifinire iterativamente la mappa di salienza predetta. Nonostante la predizione della salienza identifichi le regioni più rilevanti di un’immagine, non è mai stata incorporata in un’architettura di descrizione automatica in linguaggio naturale. In questa tesi, viene quindi anche mostrato come incorporare la predizione della salienza per migliorare la qualità delle descrizioni di immagini e viene introdotto un modello che considera sia le regioni salienti che il contesto dell’immagine durante la generazione della descrizione testuale. Inspirati dalla recente diffusione di modelli completamente attentivi, viene inoltre investigato l’uso del modello Transformer nel contesto della generazione automatica di descrizioni di immagini e viene proposta una nuova architettura nella quale vengono completamente abbandonate le reti ricorrenti precedentemente usate in questo contesto. Gli approcci classici di descrizione automatica non forniscono alcun controllo su quali regioni dell’immagine vengono descritte e quale importanza è data a ciascuna di esse. Questa mancanza di controllabilità limita l’applicabilità degli algoritmi di descrizione automatica a scenari complessi in cui è necessaria una qualche forma di controllo sul processo di generazione. Per affrontare questi problemi, viene presentato un modello in grado di generare descrizioni in linguaggio naturale diversificate sulla base di un segnale di controllo dato nella forma di un insieme di regioni dell’immagine che devono essere descritte. Su una linea differente, viene anche esplorata la possibilità di nominare con il proprio nome i personaggi presenti nei film, necessitando anche in questo caso di un certo grado di controllabilità sul modello di descrizione automatica. Nell’ultima parte della tesi, vengono presentate soluzioni di “cross-modal retrieval”, un’altra attività che combina visione e linguaggio e che consiste nel trovare le immagini corrispondenti ad una query testuale e viceversa. Infine, viene mostrata l’applicazione di queste tecniche di retrieval nel contesto dei beni culturali e delle digital humanities, ottenendo risultati promettenti sia con modelli supervisionati che non supervisionati.

2020 Tesi di dottorato

Meshed-Memory Transformer for Image Captioning

Authors: Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Cucchiara, Rita

Published in: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability … (Read full abstract)

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M² - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M² Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at :https://github.com/aimagelab/meshed-memory-transformer.

2020 Relazione in Atti di Convegno

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

Authors: Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION

The ability to generate natural language explanations conditioned on the visual perception is a crucial step towards autonomous agents which … (Read full abstract)

The ability to generate natural language explanations conditioned on the visual perception is a crucial step towards autonomous agents which can explain themselves and communicate with humans. While the research efforts in image and video captioning are giving promising results, this is often done at the expense of the computational requirements of the approaches, limiting their applicability to real contexts. In this paper, we propose a fully-attentive captioning algorithm which can provide state-of-the-art performances on language generation while restricting its computational demands. Our model is inspired by the Transformer model and employs only two Transformer layers in the encoding and decoding stages. Further, it incorporates a novel memory-aware encoding of image regions. Experiments demonstrate that our approach achieves competitive results in terms of caption quality while featuring reduced computational demands. Further, to evaluate its applicability on autonomous agents, we conduct experiments on simulated scenes taken from the perspective of domestic robots.

2020 Relazione in Atti di Convegno

Art2Real: Unfolding the Reality of Artworks via Semantically-Aware Image-to-Image Translation

Authors: Tomei, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: PROCEEDINGS - IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

The applicability of computer vision to real paintings and artworks has been rarely investigated, even though a vast heritage would … (Read full abstract)

The applicability of computer vision to real paintings and artworks has been rarely investigated, even though a vast heritage would greatly benefit from techniques which can understand and process data from the artistic domain. This is partially due to the small amount of annotated artistic data, which is not even comparable to that of natural images captured by cameras. In this paper, we propose a semantic-aware architecture which can translate artworks to photo-realistic visualizations, thus reducing the gap between visual features of artistic and realistic data. Our architecture can generate natural images by retrieving and learning details from real photos through a similarity matching strategy which leverages a weakly-supervised semantic understanding of the scene. Experimental results show that the proposed technique leads to increased realism and to a reduction in domain shift, which improves the performance of pre-trained architectures for classification, detection, and segmentation. Code is publicly available at: https://github.com/aimagelab/art2real.

2019 Relazione in Atti di Convegno

Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences in the Artistic Domain

Authors: Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Corsini, Massimiliano; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

As vision and language techniques are widely applied to realistic images, there is a growing interest in designing visual-semantic models … (Read full abstract)

As vision and language techniques are widely applied to realistic images, there is a growing interest in designing visual-semantic models suitable for more complex and challenging scenarios. In this paper, we address the problem of cross-modal retrieval of images and sentences coming from the artistic domain. To this aim, we collect and manually annotate the Artpedia dataset that contains paintings and textual sentences describing both the visual content of the paintings and other contextual information. Thus, the problem is not only to match images and sentences, but also to identify which sentences actually describe the visual content of a given image. To this end, we devise a visual-semantic model that jointly addresses these two challenges by exploiting the latent alignment between visual and textual chunks. Experimental evaluations, obtained by comparing our model to different baselines, demonstrate the effectiveness of our solution and highlight the challenges of the proposed dataset. The Artpedia dataset is publicly available at: http://aimagelab.ing.unimore.it/artpedia.

2019 Relazione in Atti di Convegno

Image-to-Image Translation to Unfold the Reality of Artworks: an Empirical Analysis

Authors: Tomei, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

State-of-the-art Computer Vision pipelines show poor performances on artworks and data coming from the artistic domain, thus limiting the applicability … (Read full abstract)

State-of-the-art Computer Vision pipelines show poor performances on artworks and data coming from the artistic domain, thus limiting the applicability of current architectures to the automatic understanding of the cultural heritage. This is mainly due to the difference in texture and low-level feature distribution between artistic and real images, on which state-of-the-art approaches are usually trained. To enhance the applicability of pre-trained architectures on artistic data, we have recently proposed an unpaired domain translation approach which can translate artworks to photo-realistic visualizations. Our approach leverages semantically-aware memory banks of real patches, which are used to drive the generation of the translated image while improving its realism. In this paper, we provide additional analyses and experimental results which demonstrate the effectiveness of our approach. In particular, we evaluate the quality of generated results in the case of the translation of landscapes, portraits and of paintings coming from four different styles using automatic distance metrics. Also, we analyze the response of pre-trained architecture for classification, detection and segmentation both in terms of feature distribution and entropy of prediction, and show that our approach effectively reduces the domain shift of paintings. As an additional contribution, we also provide a qualitative analysis of the reduction of the domain shift for detection, segmentation and image captioning.

2019 Relazione in Atti di Convegno

M-VAD Names: a Dataset for Video Captioning with Naming

Authors: Pini, Stefano; Cornia, Marcella; Bolelli, Federico; Baraldi, Lorenzo; Cucchiara, Rita

Published in: MULTIMEDIA TOOLS AND APPLICATIONS

Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" … (Read full abstract)

Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" tag. The lack of movie description datasets with characters' visual annotations surely plays a relevant role in this shortage. Recently, we proposed to extend the M-VAD dataset by introducing such information. In this paper, we present an improved version of the dataset, namely M-VAD Names, and its semi-automatic annotation procedure. The resulting dataset contains 63k visual tracks and 34k textual mentions, all associated with character identities. To showcase the features of the dataset and quantify the complexity of the naming task, we investigate multimodal architectures to replace the "someone" tags with proper character names in existing video captions. The evaluation is further extended by testing this application on videos outside of the M-VAD Names dataset.

2019 Articolo su rivista

Page 9 of 11 • Total publications: 107