Publications

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Aligning Text and Document Illustrations: towards Visually Explainable Digital Humanities

Authors: Baraldi, Lorenzo; Cornia, Marcella; Grana, Costantino; Cucchiara, Rita

While several approaches to bring vision and language together are emerging, none of them has yet addressed the digital humanities … (Read full abstract)

While several approaches to bring vision and language together are emerging, none of them has yet addressed the digital humanities domain, which, nevertheless, is a rich source of visual and textual data. To foster research in this direction, we investigate the learning of visual-semantic embeddings for historical document illustrations, devising both supervised and semi-supervised approaches. We exploit the joint visual-semantic embeddings to automatically align illustrations and textual elements, thus providing an automatic annotation of the visual content of a manuscript. Experiments are performed on the Borso d'Este Holy Bible, one of the most sophisticated illuminated manuscript from the Renaissance, which we manually annotate aligning every illustration with textual commentaries written by experts. Experimental results quantify the domain shift between ordinary visual-semantic datasets and the proposed one, validate the proposed strategies, and devise future works on the same line.

2018 Relazione in Atti di Convegno

Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era

Authors: Cornia, Marcella; Abati, Davide; Baraldi, Lorenzo; Palazzi, Andrea; Calderara, Simone; Cucchiara, Rita

Published in: INTELLIGENZA ARTIFICIALE

Estimating the focus of attention of a person looking at an image or a video is a crucial step which … (Read full abstract)

Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades both in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we discuss the effectiveness of convolutional neural networks (CNNs) models in saliency prediction. We present a set of Deep Learning architectures developed by us, which can combine both bottom-up cues and higher-level semantics, and extract spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. We will show how these deep networks closely recall the early saliency models, although improved with the semantics learned from the human ground-truth. Eventually, we will present a use-case in which saliency prediction is used to improve the automatic description of images.

2018 Articolo su rivista

Automatic Image Cropping and Selection using Saliency: an Application to Historical Manuscripts

Authors: Cornia, Marcella; Pini, Stefano; Baraldi, Lorenzo; Cucchiara, Rita

Published in: COMMUNICATIONS IN COMPUTER AND INFORMATION SCIENCE

Automatic image cropping techniques are particularly important to improve the visual quality of cropped images and can be applied to … (Read full abstract)

Automatic image cropping techniques are particularly important to improve the visual quality of cropped images and can be applied to a wide range of applications such as photo-editing, image compression, and thumbnail selection. In this paper, we propose a saliency-based image cropping method which produces significant cropped images by only relying on the corresponding saliency maps. Experiments on standard image cropping datasets demonstrate the benefit of the proposed solution with respect to other cropping methods. Moreover, we present an image selection method that can be effectively applied to automatically select the most representative pages of historical manuscripts thus improving the navigation of historical digital libraries.

2018 Relazione in Atti di Convegno

Colorectal Cancer Classification using Deep Convolutional Networks. An Experimental Study

Authors: Ponzio, Francesco; Macii, Enrico; Ficarra, Elisa; Di Cataldo, Santa

The analysis of histological samples is of paramount importance for the early diagnosis of colorectal cancer (CRC). The traditional visual … (Read full abstract)

The analysis of histological samples is of paramount importance for the early diagnosis of colorectal cancer (CRC). The traditional visual assessment is time-consuming and highly unreliable because of the subjectivity of the evaluation. On the other hand, automated analysis is extremely challenging due to the variability of the architectural and colouring characteristics of the histological images. In this work, we propose a deep learning technique based on Convolutional Neural Networks (CNNs) to differentiate adenocarcinomas from healthy tissues and benign lesions. Fully training the CNN on a large set of annotated CRC samples provides good classification accuracy (around 90% in our tests), but on the other hand has the drawback of a very computationally intensive training procedure. Hence, in our work we also investigate the use of transfer learning approaches, based on CNN models pre-trained on a completely different dataset (i.e. the ImageNet). In our results, transfer learning considerably outperforms the CNN fully trained on CRC samples, obtaining an accuracy of about 96% on the same test dataset.

2018 Relazione in Atti di Convegno

Comportamento non verbale intergruppi “oggettivo”: una replica dello studio di Dovidio, kawakami e Gaertner (2002)

Authors: Di Bernardo, Gian Antonio; Vezzali, Loris; Giovannini, Dino; Palazzi, Andrea; Calderara, Simone; Bicocchi, Nicola; Zambonelli, Franco; Cucchiara, Rita; Cadamuro, Alessia; Cocco, Veronica Margherita

Vi è una lunga tradizione di ricerca che ha analizzato il comportamento non verbale, anche considerando relazioni intergruppi. Solitamente, questi … (Read full abstract)

Vi è una lunga tradizione di ricerca che ha analizzato il comportamento non verbale, anche considerando relazioni intergruppi. Solitamente, questi studi si avvalgono di valutazioni di coder esterni, che tuttavia sono soggettive e aperte a distorsioni. Abbiamo condotto uno studio in cui si è preso come riferimento il celebre studio di Dovidio, Kawakami e Gaertner (2002), apportando tuttavia alcune modifiche e considerando la relazione tra bianchi e neri. Partecipanti bianchi, dopo aver completato misure di pregiudizio esplicito e implicito, incontravano (in ordine contro-bilanciato) un collaboratore bianco e uno nero. Con ognuno di essi, parlavano per tre minuti di un argomento neutro e di un argomento saliente per la distinzione di gruppo (in ordine contro-bilanciato). Tali interazioni erano registrate con una telecamera kinect, che è in grado di tenere conto della componente tridimensionale del movimento. I risultati hanno rivelato vari elementi di interesse. Anzitutto, si sono creati indici oggettivi, a partire da un’analisi della letteratura, alcuni dei quali non possono essere rilevati da coder esterni, quali distanza interpersonale e volume di spazio tra le persone. I risultati hanno messo in luce alcuni aspetti rilevanti: (1) l’atteggiamento implicito è associato a vari indici di comportamento non verbale, i quali mediano sulle valutazioni dei partecipanti fornite dai collaboratori; (2) le interazioni vanno considerate in maniera dinamica, tenendo conto che si sviluppano nel tempo; (3) ciò che può essere importante è il comportamento non verbale globale, piuttosto che alcuni indici specifici pre-determinati dagli sperimentatori.

2018 Abstract in Atti di Convegno

Connected Components Labeling on DRAGs

Authors: Bolelli, Federico; Baraldi, Lorenzo; Cancilla, Michele; Grana, Costantino

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

In this paper we introduce a new Connected Components Labeling (CCL) algorithm which exploits a novel approach to model decision … (Read full abstract)

In this paper we introduce a new Connected Components Labeling (CCL) algorithm which exploits a novel approach to model decision problems as Directed Acyclic Graphs with a root, which will be called Directed Rooted Acyclic Graphs (DRAGs). This structure supports the use of sets of equivalent actions, as required by CCL, and optimally leverages these equivalences to reduce the number of nodes (decision points). The advantage of this representation is that a DRAG, differently from decision trees usually exploited by the state-of-the-art algorithms, will contain only the minimum number of nodes required to reach the leaf corresponding to a set of condition values. This combines the benefits of using binary decision trees with a reduction of the machine code size. Experiments show a consistent improvement of the execution time when the model is applied to CCL.

2018 Relazione in Atti di Convegno

Deep construction of an affective latent space via multimodal enactment

Authors: Boccignone, Giuseppe; Conte, Donatello; Cuculo, Vittorio; D'Amelio, Alessandro; Grossi, Giuliano; Lanzarotti, Raffaella

Published in: IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS

We draw on a simulationist approach to the analysis of facially displayed emotions, e.g., in the course of a face-to-face … (Read full abstract)

We draw on a simulationist approach to the analysis of facially displayed emotions, e.g., in the course of a face-to-face interaction between an expresser and an observer. At the heart of such perspective lies the enactment of the perceived emotion in the observer. We propose a novel probabilistic framework based on a deep latent representation of a continuous affect space, which can be exploited for both the estimation and the enactment of affective states in a multimodal space (visible facial expressions and physiological signals). The rationale behind the approach lies in the large body of evidence from affective neuroscience showing that when we observe emotional facial expressions, we react with congruent facial mimicry. Further, in more complex situations, affect understanding is likely to rely on a comprehensive representation grounding the reconstruction of the state of the body associated with the displayed emotion. We show that our approach can address such problems in a unified and principled perspective, thus avoiding ad hoc heuristics while minimizing learning efforts.

2018 Articolo su rivista

Deep Head Pose Estimation from Depth Data for In-car Automotive Applications

Authors: Venturelli, Marco; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita

Published in: LECTURE NOTES IN ARTIFICIAL INTELLIGENCE

Recently, deep learning approaches have achieved promising results in various fields of computer vision. In this paper, we tackle the … (Read full abstract)

Recently, deep learning approaches have achieved promising results in various fields of computer vision. In this paper, we tackle the problem of head pose estimation through a Convolutional Neural Network (CNN). Differently from other proposals in the literature, the described system is able to work directly and based only on raw depth data. Moreover, the head pose estimation is solved as a regression problem and does not rely on visual facial features like facial landmarks. We tested our system on a well known public dataset, extit{Biwi Kinect Head Pose}, showing that our approach achieves state-of-art results and is able to meet real time performance requirements.

2018 Relazione in Atti di Convegno

Deep Learning-Based Method for Vision-Guided Robotic Grasping of Unknown Objects

Authors: Bergamini, Luca; Sposato, Mario; Peruzzini, Margherita; Vezzani, Roberto; Pellicciari, Marcello

Published in: ADVANCES IN TRANSDISCIPLINARY ENGINEERING

Collaborative robots must operate safely and efficiently in ever-changing unstructured environments, grasping and manipulating many different objects. Artificial vision has … (Read full abstract)

Collaborative robots must operate safely and efficiently in ever-changing unstructured environments, grasping and manipulating many different objects. Artificial vision has proved to be collaborative robots' ideal sensing technology and it is widely used for identifying the objects to manipulate and for detecting their optimal grasping. One of the main drawbacks of state of the art robotic vision systems is the long training needed for teaching the identification and optimal grasps of each object, which leads to a strong reduction of the robot productivity and overall operating flexibility. To overcome such limit, we propose an engineering method, based on deep learning techniques, for the detection of the robotic grasps of unknown objects in an unstructured environment, which should enable collaborative robots to autonomously generate grasping strategies without the need of training and programming. A novel loss function for the training of the grasp prediction network has been developed and proved to work well also with low resolution 2-D images, then allowing the use of a single, smaller and low cost camera, that can be better integrated in robotic end-effectors. Despite the availability of less information (resolution and depth) a 75% of accuracy has been achieved on the Cornell data set and it is shown that our implementation of the loss function does not suffer of the common problems reported in literature. The system has been implemented using the ROS framework and tested on a Baxter collaborative robot.

2018 Relazione in Atti di Convegno

DEEP METRIC AND HASH-CODE LEARNING FOR CONTENT-BASED RETRIEVAL OF REMOTE SENSING IMAGES

Authors: Roy, S; Sangineto, E; Demir, B; Sebe, N

The growing volume of Remote Sensing (RS) image archives demands for feature learning techniques and hashing functions which can: (1) … (Read full abstract)

The growing volume of Remote Sensing (RS) image archives demands for feature learning techniques and hashing functions which can: (1) accurately represent the semantics in the RS images; and (2) have quasi real-time performance during retrieval. This paper aims to address both challenges at the same time, by learning a semantic-based metric space for content based RS image retrieval while simultaneously producing binary hash codes for an efficient archive search. This double goal is achieved by training a deep network using a combination of different loss functions which, on the one hand, aim at clustering semantically similar samples (i.e., images), and, on the other hand, encourage the network to produce final activation values (i.e., descriptors) that can be easily binarized. Moreover, since RS annotated training images are too few to train a deep network from scratch, we propose to split the image representation problem in two different phases. In the first we use a general-purpose, pre-trained network to produce an intermediate representation, and in the second we train our hashing network using a relatively small set of training images. Experiments on two aerial benchmark archives show that the proposed method outperforms previous state-of-the-art hashing approaches by up to 5.4% using the same number of hash bits per image.

2018 Relazione in Atti di Convegno

Page 49 of 106 • Total publications: 1054