Publications - AImageLab

DEEP METRIC AND HASH-CODE LEARNING FOR CONTENT-BASED RETRIEVAL OF REMOTE SENSING IMAGES

Authors: Roy, S; Sangineto, E; Demir, B; Sebe, N

The growing volume of Remote Sensing (RS) image archives demands for feature learning techniques and hashing functions which can: (1) … (Read full abstract)

The growing volume of Remote Sensing (RS) image archives demands for feature learning techniques and hashing functions which can: (1) accurately represent the semantics in the RS images; and (2) have quasi real-time performance during retrieval. This paper aims to address both challenges at the same time, by learning a semantic-based metric space for content based RS image retrieval while simultaneously producing binary hash codes for an efficient archive search. This double goal is achieved by training a deep network using a combination of different loss functions which, on the one hand, aim at clustering semantically similar samples (i.e., images), and, on the other hand, encourage the network to produce final activation values (i.e., descriptors) that can be easily binarized. Moreover, since RS annotated training images are too few to train a deep network from scratch, we propose to split the image representation problem in two different phases. In the first we use a general-purpose, pre-trained network to produce an intermediate representation, and in the second we train our hashing network using a relatively small set of training images. Experiments on two aerial benchmark archives show that the proposed method outperforms previous state-of-the-art hashing approaches by up to 5.4% using the same number of hash bits per image.

2018 Relazione in Atti di Convegno

DOI IRIS

Deformable GANs for Pose-Based Human Image Generation

Authors: Siarohin, Aliaksandr; Sangineto, Enver; Lathuiliere, Stephane; Sebe, Nicu

In this paper we address the problem of generating person images conditioned on a given pose. Specifically, given an image … (Read full abstract)

In this paper we address the problem of generating person images conditioned on a given pose. Specifically, given an image of a person and a target pose, we synthesize a new image of that person in the novel pose. In order to deal with pixel-to-pixel misalignments caused by the pose differences, we introduce deformable skip connections in the generator of our Generative Adversarial Network. Moreover, a nearest-neighbour loss is proposed instead of the common L 1 and L 2 losses in order to match the details of the generated image with the target image. We test our approach using photos of persons in different poses and we compare our method with previous work in this area showing state-of-the-art results in two benchmarks. Our method can be applied to the wider field of deformable object generation, provided that the pose of the articulated object can be extracted using a keypoint detector.

2018 Relazione in Atti di Convegno

DOI IRIS

Plug-and-play CNN for crowd motion analysis: An application in abnormal event detection

Authors: Ravanbakhsh, M.; Nabi, M.; Mousavi, H.; Sangineto, E.; Sebe, N.

Most of the crowd abnormal event detection methods rely on complex hand-crafted features to represent the crowd motion and appearance. … (Read full abstract)

Most of the crowd abnormal event detection methods rely on complex hand-crafted features to represent the crowd motion and appearance. Convolutional Neural Networks (CNN) have shown to be a powerful instrument with excellent representational capacities, which can leverage the need for hand-crafted features. In this paper, we show that keeping track of the changes in the CNN feature across time can be used to effectively detect local anomalies. Specifically, we propose to measure local abnormality by combining semantic information (inherited from existing CNN models) with low-level optical-flow. One of the advantages of this method is that it can be used without the fine-tuning phase. The proposed method is validated on challenging abnormality detection datasets and the results show the superiority of our approach compared with the state-of-the art methods.

2018 Relazione in Atti di Convegno

DOI IRIS

Semantic-Fusion Gans for Semi-Supervised Satellite Image Classification

Authors: Subhankar, Roy; Sangineto, E.; Demir, B.; Sebe, N.

Published in: PROCEEDINGS - INTERNATIONAL CONFERENCE ON IMAGE PROCESSING

Most of the public satellite image datasets contain only a small number of annotated images. The lack of a sufficient … (Read full abstract)

Most of the public satellite image datasets contain only a small number of annotated images. The lack of a sufficient quantity of labeled data for training is a bottleneck for the use of modern deep-learning based classification approaches in this domain. In this paper we propose a semi -supervised approach to deal with this problem. We use the discriminator $(D)$ of a Generative Adversarial Network (GAN) as the final classifier, and we train $D$ using both labeled and unlabeled data. The main novelty we introduce is the representation of the visual information fed to $D$ by means of two different channels: the original image and its “semantic” representation, the latter being obtained by means of an external network trained on ImageNet. The two channels are fused in $D$ and jointly used to classify fake images, real labeled and real unlabeled images. We show that using only 100 labeled images, the proposed approach achieves an accuracy close to 69% and a significant improvement with respect to other GAN-based semi-supervised methods. Although we have tested our approach only on satellite images, we do not use any domain-specific knowledge. Thus, our method can be applied to other semi-supervised domains.

2018 Relazione in Atti di Convegno

DOI IRIS

Abnormal event detection in videos using generative adversarial nets

Authors: Ravanbakhsh, M.; Nabi, M.; Sangineto, E.; Marcenaro, L.; Regazzoni, C.; Sebe, N.

Published in: PROCEEDINGS - INTERNATIONAL CONFERENCE ON IMAGE PROCESSING

In this paper we address the abnormality detection problem in crowded scenes. We propose to use Generative Adversarial Nets (GANs), … (Read full abstract)

In this paper we address the abnormality detection problem in crowded scenes. We propose to use Generative Adversarial Nets (GANs), which are trained using normal frames and corresponding optical-flow images in order to learn an internal representation of the scene normality. Since our GANs are trained with only normal data, they are not able to generate abnormal events. At testing time the real data are compared with both the appearance and the motion representations reconstructed by our GANs and abnormal areas are detected by computing local differences. Experimental results on challenging abnormality detection datasets show the superiority of the proposed method compared to the state of the art in both frame-level and pixel-level abnormality detection tasks.

2017 Relazione in Atti di Convegno

DOI IRIS

FOIL it! Find One mismatch between Image and Language caption

Authors: Shekhar, Ravi; Pezzelle, Sandro; Klimovich, Yauhen; Herbelot, Aurelie; Nabi, Moin; Sangineto, Enver; Bernardi, Raffaella

In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the … (Read full abstract)

In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MS-COCO dataset, FOIL-COCO, which associates images with both correct and ‘foil’ captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake (‘foil word’). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.

2017 Relazione in Atti di Convegno

DOI IRIS

Vision and language integration: Moving beyond objects

Authors: Shekhar, R.; Pezzelle, S.; Herbelot, A.; Nabi, M.; Sangineto, E.; Bernardi, R.

The last years have seen an explosion of work on the integration of vision and language data. New tasks like … (Read full abstract)

The last years have seen an explosion of work on the integration of vision and language data. New tasks like Image Captioning and Visual Questions Answering have been proposed and impressive results have been achieved. There is now a shared desire to gain an in-depth understanding of the strengths and weaknesses of those models. To this end, several datasets have been proposed to try and challenge the state-of-the-art. Those datasets, however, mostly focus on the interpretation of objects (as denoted by nouns in the corresponding captions). In this paper, we reuse a previously proposed methodology to evaluate the ability of current systems to move beyond objects and deal with attributes (as denoted by adjectives), actions (verbs), manner (adverbs) and spatial relations (prepositions). We show that the coarse representations given by current approaches are not informative enough to interpret attributes or actions, whilst spatial relations somewhat fare better, but only in attention models.

2017 Relazione in Atti di Convegno

IRIS

Bad teacher or unruly student: Can deep learning say something in Image Forensics analysis?

Authors: Rota, P.; Sangineto, E.; Conotter, V.; Pramerdorfer, C.

Published in: INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION

The pervasive availability of the Internet, coupled with the development of increasingly powerful technologies, has led digital images to be … (Read full abstract)

The pervasive availability of the Internet, coupled with the development of increasingly powerful technologies, has led digital images to be the primary source of visual information in nowadays society. However, their reliability as a true representation of reality cannot be taken for granted, due to the affordable powerful graphics editing softwares that can easily alter the original content, leaving no visual trace of any modification on the image making them potentially dangerous. This motivates developing technological solutions able to detect media manipulations without a prior knowledge or extra information regarding the given image. At the same time, the huge amount of available data has also led to tremendous advances of data-hungry learning models, which have already demonstrated in last few years to be successful in image classification. In this work we propose a deep learning approach for tampered image classification. To our best knowledge, this the first attempt to use the deep learning paradigm in an image forensic scenario. In particular, we propose a new blind deep learning approach based on Convolutional Neural Networks (CNN) able to learn invisible discriminative artifacts from manipulated images that can be exploited to automatically discriminate between forged and authentic images. The proposed approach not only detects forged images but it can be extended to localize the tampered regions within the image. This method outperforms the state-of-the-art in terms of accuracy on CASIA TIDE v2.0 dataset. The capability of automatically crafting discriminant features can lead to surprising results. For instance, detecting image compression filters used to create the dataset. This argument is also discussed within this paper.

2016 Relazione in Atti di Convegno

DOI IRIS

Learning Personalized Models for Facial Expression Analysis and Gesture Recognition

Authors: Zen, Gloria; Porzi, Lorenzo; Sangineto, Enver; Ricci, Elisa; Sebe, Niculae

Published in: IEEE TRANSACTIONS ON MULTIMEDIA

Facial expression and gesture recognition algorithms are key enabling technologies for human-computer interaction (HCI) systems. State of the art approaches … (Read full abstract)

Facial expression and gesture recognition algorithms are key enabling technologies for human-computer interaction (HCI) systems. State of the art approaches for automatic detection of body movements and analyzing emotions from facial features heavily rely on advanced machine learning algorithms. Most of these methods are designed for the average user, but the assumption “one-size-fits-all” ignores diversity in cultural background, gender, ethnicity, and personal behavior, and limits their applicability in real-world scenarios. A possible solution is to build personalized interfaces, which practically implies learning person-specific classifiers and usually collecting a significant amount of labeled samples for each novel user. As data annotation is a tedious and time-consuming process, in this paper we present a framework for personalizing classification models which does not require labeled target data. Personalization is achieved by devising a novel transfer learning approach. Specifically, we propose a regression framework which exploits auxiliary (source) annotated data to learn the relation between person-specific sample distributions and parameters of the corresponding classifiers. Then, when considering a new target user, the classification model is computed by simply feeding the associated (unlabeled) sample distribution into the learned regression function. We evaluate the proposed approach in different applications: pain recognition and action unit detection using visual data and gestures classification using inertial measurements, demonstrating the generality of our method with respect to different input data types and basic classifiers. We also show the advantages of our approach in terms of accuracy and computational time both with respect to user-independent approaches and to previous personalization techniques.

2016 Articolo su rivista

DOI IRIS

FaceCept3D: Real Time 3D Face Tracking and Analysis

Authors: Tulyakov, S; Vieriu, R; Sangineto, E; Sebe, N

Published in: PROCEEDINGS IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION

2015 Relazione in Atti di Convegno

DOI IRIS

Publications by Enver Sangineto

DEEP METRIC AND HASH-CODE LEARNING FOR CONTENT-BASED RETRIEVAL OF REMOTE SENSING IMAGES

Deformable GANs for Pose-Based Human Image Generation

Plug-and-play CNN for crowd motion analysis: An application in abnormal event detection

Semantic-Fusion Gans for Semi-Supervised Satellite Image Classification

Abnormal event detection in videos using generative adversarial nets

FOIL it! Find One mismatch between Image and Language caption

Vision and language integration: Moving beyond objects

Bad teacher or unruly student: Can deep learning say something in Image Forensics analysis?

Learning Personalized Models for Facial Expression Analysis and Gesture Recognition

FaceCept3D: Real Time 3D Face Tracking and Analysis