Publications

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images

Authors: Amoroso, Roberto; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Del Bimbo, Alberto; Cucchiara, Rita

Published in: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS

Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these … (Read full abstract)

Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models. Firstly, we conduct a comprehensive analysis of the performance of contrastive and classification-based visual features, respectively, extracted from CLIP-based models and ResNet or Vision Transformer (ViT)-based architectures trained on image classification datasets. Our results demonstrate that fake images share common low-level cues, which render them easily recognizable. Further, we devise a multimodal setting wherein fake images are synthesized by different textual captions, which are used as seeds for a generator. Under this setting, we quantify the performance of fake detection strategies and introduce a contrastive-based disentangling method that lets us analyze the role of the semantics of textual descriptions and low-level perceptual cues. Finally, we release a new dataset, called COCOFake, containing about 1.2 million images generated from the original COCO image–caption pairs using two recent text-to-image diffusion models, namely Stable Diffusion v1.4 and v2.0.

2024 Articolo su rivista

Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments

Authors: Barsellotti, Luca; Bigazzi, Roberto; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS

In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth … (Read full abstract)

In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth can be attributed to the recent availability of large navigation datasets in photo-realistic simulated environments, like Gibson and Matterport3D. However, the navigation tasks supported by these datasets are often restricted to the objects present in the environment at acquisition time. Also, they fail to account for the realistic scenario in which the target object is a user-specific instance that can be easily confused with similar objects and may be found in multiple locations within the environment. To address these limitations, we propose a new task denominated Personalized Instance-based Navigation (PIN), in which an embodied agent is tasked with locating and reaching a specific personal object by distinguishing it among multiple instances of the same category. The task is accompanied by PInNED, a dedicated new dataset composed of photo-realistic scenes augmented with additional 3D objects. In each episode, the target object is presented to the agent using two modalities: a set of visual reference images on a neutral background and manually annotated textual descriptions. Through comprehensive evaluations and analyses, we showcase the challenges of the PIN task as well as the performance and shortcomings of currently available methods designed for object-driven navigation, considering modular and end-to-end agents.

2024 Relazione in Atti di Convegno

Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis

Authors: Bucciarelli, Davide; Moratelli, Nicholas; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen … (Read full abstract)

The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen a convergence between image captioning research and the development of Large Language Models (LLMs) and Multimodal LLMs - like GPT-4V and Gemini - which extend the capabilities of text-only LLMs to multiple modalities. This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks. We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods, including prompt learning, prefix tuning, and low-rank adaptation. Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their generalization capabilities intact remains challenging. We discuss the implications of these findings for future research in image captioning and the development of more adaptable Multimodal LLMs.

2024 Relazione in Atti di Convegno

PIK3R1 fusion drives chemoresistance in ovarian cancer by activating ERK1/2 and inducing rod and ring-like structures

Authors: Rausio, H.; Cervera, A.; Heuser, V. D.; West, G.; Oikkonen, J.; Pianfetti, E.; Lovino, M.; Ficarra, E.; Taimen, P.; Hynninen, J.; Lehtonen, R.; Hautaniemi, S.; Carpen, O.; Huhtinen, K.

Published in: NEOPLASIA

Gene fusions are common in high-grade serous ovarian cancer (HGSC). Such genetic lesions may promote tumorigenesis, but the pathogenic mechanisms … (Read full abstract)

Gene fusions are common in high-grade serous ovarian cancer (HGSC). Such genetic lesions may promote tumorigenesis, but the pathogenic mechanisms are currently poorly understood. Here, we investigated the role of a PIK3R1-CCDC178 fusion identified from a patient with advanced HGSC. We show that the fusion induces HGSC cell migration by regulating ERK1/2 and increases resistance to platinum treatment. Platinum resistance was associated with rod and ring-like cellular structure formation. These structures contained, in addition to the fusion protein, CIN85, a key regulator of PI3K-AKT-mTOR signaling. Our data suggest that the fusion-driven structure formation induces a previously unrecognized cell survival and resistance mechanism, which depends on ERK1/2-activation.

2024 Articolo su rivista

Predicting engagement of older people’s virtual teams from video call analysis

Authors: Noceti, Nicoletta; Campisi, Simone; Chirico, Alice; Cuculo, Vittorio; Grossi, Giuliano; Michelotto, Monica; Odone, Francesca; Gaggioli, Andrea; Lanzarotti, Raffaella

Published in: INTERNATIONAL JOURNAL OF HUMAN-COMPUTER INTERACTION

2024 Articolo su rivista

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

Authors: Moratelli, Nicholas; Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence … (Read full abstract)

The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics.

2024 Relazione in Atti di Convegno

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Authors: Poppi, Samuele; Poppi, Tobia; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to … (Read full abstract)

Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever "toxic" linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.

2024 Relazione in Atti di Convegno

Saliency-driven Experience Replay for Continual Learning

Authors: Bellitto, Giovanni; Proietto Salanitri, Federica; Pennisi, Matteo; Boschini, Matteo; Bonicelli, Lorenzo; Porrello, Angelo; Calderara, Simone; Palazzo, Simone; Spampinato, Concetto

Published in: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS

2024 Relazione in Atti di Convegno

SDFR: Synthetic Data for Face Recognition Competition

Authors: Shahreza, H. O.; Ecabert, C.; George, A.; Unnervik, A.; Marcel, S.; Di Domenico, N.; Borghi, G.; Maltoni, D.; Boutros, F.; Vogel, J.; Damer, N.; Sanchez-Perez, A.; Mas-Candela, E.; Calvo-Zaragoza, J.; Biesseck, B.; Vidal, P.; Granada, R.; Menotti, D.; Deandres-Tame, I.; La Cava, S. M.; Concas, S.; Melzi, P.; Tolosana, R.; Vera-Rodriguez, R.; Perelli, G.; Orru, G.; Marcialis, G. L.; Fierrez, J.

Large-scale face recognition datasets are collected by crawling the Internet and without individuals' consent, raising legal, ethical, and privacy concerns. … (Read full abstract)

Large-scale face recognition datasets are collected by crawling the Internet and without individuals' consent, raising legal, ethical, and privacy concerns. With the recent advances in generative models, recently several works proposed generating synthetic face recognition datasets to mitigate concerns in web-crawled face recognition datasets. This paper presents the summary of the Synthetic Data for Face Recognition (SDFR) Competition held in conjunction with the 18th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2024) and established to investigate the use of synthetic data for training face recognition models. The SDFR competition was split into two tasks, allowing participants to train face recognition systems using new synthetic datasets and/or existing ones. In the first task, the face recognition backbone was fixed and the dataset size was limited, while the second task provided almost complete freedom on the model backbone, the dataset, and the training pipeline. The submitted models were trained on existing and also new synthetic datasets and used clever methods to improve training with synthetic data. The submissions were evaluated and ranked on a diverse set of seven benchmarking datasets. The paper gives an overview of the submitted face recognition models and reports achieved performance compared to baseline models trained on real and synthetic datasets. Furthermore, the evaluation of submissions is extended to bias assessment across different demography groups. Lastly, an outlook on the current state of the research in training face recognition models using synthetic data is presented, and existing problems as well as potential future directions are also discussed.

2024 Relazione in Atti di Convegno

Self-Labeling the Job Shop Scheduling Problem

Authors: Corsini, Andrea; Porrello, Angelo; Calderara, Simone; Dell'Amico, Mauro

Published in: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS

This work proposes a self-supervised training strategy designed for combinatorial problems. An obstacle in applying supervised paradigms to such problems … (Read full abstract)

This work proposes a self-supervised training strategy designed for combinatorial problems. An obstacle in applying supervised paradigms to such problems is the need for costly target solutions often produced with exact solvers. Inspired by semi- and self-supervised learning, we show that generative models can be trained by sampling multiple solutions and using the best one according to the problem objective as a pseudo-label. In this way, we iteratively improve the model generation capability by relying only on its self-supervision, eliminating the need for optimality information. We validate this Self-Labeling Improvement Method (SLIM) on the Job Shop Scheduling (JSP), a complex combinatorial problem that is receiving much attention from the neural combinatorial community. We propose a generative model based on the well-known Pointer Network and train it with SLIM. Experiments on popular benchmarks demonstrate the potential of this approach as the resulting models outperform constructive heuristics and state-of-the-art learning proposals for the JSP. Lastly, we prove the robustness of SLIM to various parameters and its generality by applying it to the Traveling Salesman Problem.

2024 Relazione in Atti di Convegno

Page 17 of 106 • Total publications: 1054