Publications

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Gender recognition in the wild with small sample size : A dictionary learning approach

Authors: D'Amelio, A.; Cuculo, V.; Bursic, S.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

In this work we address the problem of gender recognition from facial images acquired in the wild. This problem is … (Read full abstract)

In this work we address the problem of gender recognition from facial images acquired in the wild. This problem is particularly difficult due to the presence of variations in pose, ethnicity, age and image quality. Moreover, we consider the special case in which only a small sample size is available for the training phase. We rely on a feature representation obtained from the well known VGG-Face Deep Convolutional Neural Network (DCNN) and exploit the effectiveness of a sparse-driven sub-dictionary learning strategy which has proven to be able to represent both local and global characteristics of the train and probe faces. Results on the publicly available LFW dataset are provided in order to demonstrate the effectiveness of the proposed method.

2020 Relazione in Atti di Convegno

How to look next? A data-driven approach for scanpath prediction

Authors: Boccignone, G.; Cuculo, V.; D'Amelio, A.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

By and large, current visual attention models mostly rely, when considering static stimuli, on the following procedure. Given an image, … (Read full abstract)

By and large, current visual attention models mostly rely, when considering static stimuli, on the following procedure. Given an image, a saliency map is computed, which, in turn, might serve the purpose of predicting a sequence of gaze shifts, namely a scanpath instantiating the dynamics of visual attention deployment. The temporal pattern of attention unfolding is thus confined to the scanpath generation stage, whilst salience is conceived as a static map, at best conflating a number of factors (bottom-up information, top-down, spatial biases, etc.). In this note we propose a novel sequential scheme that consists of a three-stage processing relying on a center-bias model, a context/layout model, and an object-based model, respectively. Each stage contributes, at different times, to the sequential sampling of the final scanpath. We compare the method against classic scanpath generation that exploits state-of-the-art static saliency model. Results show that accounting for the structure of the temporal unfolding leads to gaze dynamics close to human gaze behaviour.

2020 Relazione in Atti di Convegno

Imparare a descrivere gli oggetti salienti presenti nelle immagini tramite la visione e il linguaggio

Authors: Cornia, Marcella

Replicare l’abilità degli esseri umani di connettere la visione e il linguaggio ha recentemente ottenuto molta attenzione nella visione e … (Read full abstract)

Replicare l’abilità degli esseri umani di connettere la visione e il linguaggio ha recentemente ottenuto molta attenzione nella visione e intelligenza artificiale, risultando in nuovi modelli e architetture capaci di descrivere le immagini in modo automatico attraverso delle frasi testuali. Questa attività, chiamata “image captioning”, non solo richiede di riconoscere gli oggetti salienti in un’immagine e di comprendere le loro interazioni, ma anche di poterli esprimere attraverso il linguaggio naturale. In questa tesi, vengono presentate soluzioni stato dell’arte per questi problemi affrontando tutti gli aspetti coinvolti nella generazione di descrizioni testuali. Infatti, quando gli esseri umani descrivono una scena, osservano un oggetto prima di nominarlo all’interno della frase. Questo avviene grazie a dei meccanismi selettivi che attirano lo sguardo degli esseri umani sulle parti salienti e rilevanti della scena. Motivati dall’importanza di stimare in maniera automatica il focus dell’attenzione degli esseri umani su immagini, la prima parte di questa dissertazione introduce due differenti modelli di predizione della salienza basati su reti neurali. Nel primo modello, viene utilizzata una combinazione di caratteristiche visuali estratte a differenti livelli di una rete neurale convolutiva per stimare la salienza di un’immagine. Nel secondo modello, invece, viene utilizzata un’architettura ricorrente insieme a meccanismi neurali attentivi che si focalizzano sulle regioni più salienti dell’immagine in modo da rifinire iterativamente la mappa di salienza predetta. Nonostante la predizione della salienza identifichi le regioni più rilevanti di un’immagine, non è mai stata incorporata in un’architettura di descrizione automatica in linguaggio naturale. In questa tesi, viene quindi anche mostrato come incorporare la predizione della salienza per migliorare la qualità delle descrizioni di immagini e viene introdotto un modello che considera sia le regioni salienti che il contesto dell’immagine durante la generazione della descrizione testuale. Inspirati dalla recente diffusione di modelli completamente attentivi, viene inoltre investigato l’uso del modello Transformer nel contesto della generazione automatica di descrizioni di immagini e viene proposta una nuova architettura nella quale vengono completamente abbandonate le reti ricorrenti precedentemente usate in questo contesto. Gli approcci classici di descrizione automatica non forniscono alcun controllo su quali regioni dell’immagine vengono descritte e quale importanza è data a ciascuna di esse. Questa mancanza di controllabilità limita l’applicabilità degli algoritmi di descrizione automatica a scenari complessi in cui è necessaria una qualche forma di controllo sul processo di generazione. Per affrontare questi problemi, viene presentato un modello in grado di generare descrizioni in linguaggio naturale diversificate sulla base di un segnale di controllo dato nella forma di un insieme di regioni dell’immagine che devono essere descritte. Su una linea differente, viene anche esplorata la possibilità di nominare con il proprio nome i personaggi presenti nei film, necessitando anche in questo caso di un certo grado di controllabilità sul modello di descrizione automatica. Nell’ultima parte della tesi, vengono presentate soluzioni di “cross-modal retrieval”, un’altra attività che combina visione e linguaggio e che consiste nel trovare le immagini corrispondenti ad una query testuale e viceversa. Infine, viene mostrata l’applicazione di queste tecniche di retrieval nel contesto dei beni culturali e delle digital humanities, ottenendo risultati promettenti sia con modelli supervisionati che non supervisionati.

2020 Tesi di dottorato

Learn to See by Events: Color Frame Synthesis from Event and RGB Cameras

Authors: Pini, Stefano; Borghi, Guido; Vezzani, Roberto

Event cameras are biologically-inspired sensors that gather the temporal evolution of the scene. They capture pixel-wise brightness variations and output … (Read full abstract)

Event cameras are biologically-inspired sensors that gather the temporal evolution of the scene. They capture pixel-wise brightness variations and output a corresponding stream of asynchronous events. Despite having multiple advantages with respect to traditional cameras, their use is partially prevented by the limited applicability of traditional data processing and vision algorithms. To this aim, we present a framework which exploits the output stream of event cameras to synthesize RGB frames, relying on an initial or a periodic set of color key-frames and the sequence of intermediate events. Differently from existing work, we propose a deep learning-based frame synthesis method, consisting of an adversarial architecture combined with a recurrent module. Qualitative results and quantitative per-pixel, perceptual, and semantic evaluation on four public datasets confirm the quality of the synthesized images.

2020 Relazione in Atti di Convegno

Mercury: a vision-based framework for Driver Monitoring

Authors: Borghi, Guido; Pini, Stefano; Vezzani, Roberto; Cucchiara, Rita

In this paper, we propose a complete framework, namely Mercury, that combines Computer Vision and Deep Learning algorithms to continuously … (Read full abstract)

In this paper, we propose a complete framework, namely Mercury, that combines Computer Vision and Deep Learning algorithms to continuously monitor the driver during the driving activity. The proposed solution complies to the require-ments imposed by the challenging automotive context: the light invariance, in or-der to have a system able to work regardless of the time of day and the weather conditions. Therefore, infrared-based images, i.e. depth maps (in which each pixel corresponds to the distance between the sensor and that point in the scene), have been exploited in conjunction with traditional intensity images. Second, the non-invasivity of the system is required, since driver’s movements must not be impeded during the driving activity: in this context, the use of camer-as and vision-based algorithms is one of the best solutions. Finally, real-time per-formance is needed since a monitoring system must immediately react as soon as a situation of potential danger is detected.

2020 Relazione in Atti di Convegno

Meshed-Memory Transformer for Image Captioning

Authors: Cornia, Marcella; Stefanini, Matteo; Baraldi, Lorenzo; Cucchiara, Rita

Published in: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability … (Read full abstract)

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M² - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M² Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at :https://github.com/aimagelab/meshed-memory-transformer.

2020 Relazione in Atti di Convegno

Multi-omics Classification on Kidney Samples Exploiting Uncertainty-Aware Models

Authors: Lovino, Marta; Bontempo, Gianpaolo; Cirrincione, Giansalvo; Ficarra, Elisa

Due to the huge amount of available omic data, classifying samples according to various omics is a complex process. One … (Read full abstract)

Due to the huge amount of available omic data, classifying samples according to various omics is a complex process. One of the most common approaches consists of creating a classifier for each omic and subsequently making a consensus among the classifiers that assign to each sample the most voted class among the outputs on the individual omics. However, this approach does not consider the confidence in the prediction ignoring that biological information coming from a certain omic may be more reliable than others. Therefore, it is here proposed a method consisting of a tree-based multi-layer perceptron (MLP), which estimates the class-membership probabilities for classification. In this way, it is not only possible to give relevance to all the omics, but also to label as Unknown those samples for which the classifier is uncertain in its prediction. The method was applied to a dataset composed of 909 kidney cancer samples for which these three omics were available: gene expression (mRNA), microRNA expression (miRNA), and methylation profiles (meth) data. The method is valid also for other tissues and on other omics (e.g. proteomics, copy number alterations data, single nucleotide polymorphism data). The accuracy and weighted average f1-score of the model are both higher than 95%. This tool can therefore be particularly useful in clinical practice, allowing physicians to focus on the most interesting and challenging samples.

2020 Relazione in Atti di Convegno

Multimodal Hand Gesture Classification for the Human-Car Interaction

Authors: D’Eusanio, Andrea; Simoni, Alessandro; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita

Published in: INFORMATICS

2020 Articolo su rivista

On Gaze Deployment to Audio-Visual Cues of Social Interactions

Authors: Boccignone, G.; Cuculo, V.; D'Amelio, A.; Grossi, G.; Lanzarotti, R.

Published in: IEEE ACCESS

Attention supports our urge to forage on social cues. Under certain circumstances, we spend the majority of time scrutinising people, … (Read full abstract)

Attention supports our urge to forage on social cues. Under certain circumstances, we spend the majority of time scrutinising people, markedly their eyes and faces, and spotting persons that are talking. To account for such behaviour, this article develops a computational model for the deployment of gaze within a multimodal landscape, namely a conversational scene. Gaze dynamics is derived in a principled way by reformulating attention deployment as a stochastic foraging problem. Model simulation experiments on a publicly available dataset of eye-tracked subjects are presented. Results show that the simulated scan paths exhibit similar trends of eye movements of human observers watching and listening to conversational clips in a free-viewing condition

2020 Articolo su rivista

Online Continual Learning under Extreme Memory Constraints

Authors: Fini, Enrico; Lathuilière, Stéphane; Sangineto, Enver; Nabi, Moin; Ricci, Elisa

Published in: LECTURE NOTES IN COMPUTER SCIENCE

2020 Relazione in Atti di Convegno

Page 41 of 106 • Total publications: 1054