Publications by Federico Cocchi

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Federico Cocchi

Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval

Authors: Caffagni, Davide.; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella.; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual … (Read full abstract)

Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual analysis. However, retrieval systems often struggle when training data are scarce, as is the case for low-resource languages or specialized domains such as ancient texts. To address this challenge, we propose a novel paradigm for domain-specific sentence similarity search, where the embedding space is shaped by a combination of limited real data and a large amount of synthetic data generated by Large Language Models (LLMs). Specifically, we employ LLMs to generate domain-specific sentence pairs and fine-tune a sentence embedding model, effectively distilling knowledge from the LLM to the retrieval model. We validate our method through a case study on biblical intertextuality in Latin, demonstrating that synthetic data augmentation significantly improves retrieval effectiveness in a domain with scarce annotated resources. More broadly, our approach offers a scalable and adaptable framework for enhancing retrieval in domain-specific contexts. Source code and trained models are available at https://github.com/aimagelab/biblical-retrieval-synthesis.

2026 Relazione in Atti di Convegno

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Authors: Cocchi, Federico; Moratelli, Nicholas; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. … (Read full abstract)

Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge and predict the relevance of information retrieved from an external database. Tokens are trained following a two-stage two-model training recipe. This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed. Through our experiments, we demonstrate the efficacy of ReflectiVA for knowledge-based visual question answering, highlighting its superior performance compared to existing methods. Source code and trained models are publicly available at https://github.com/aimagelab/ReflectiVA.

2025 Relazione in Atti di Convegno

Benchmarking BERT-based Models for Latin: A Case Study on Biblical References in Ancient Christian Literature

Authors: Caffagni, Davide; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella; Cucchiara, Rita

Published in: CEUR WORKSHOP PROCEEDINGS

Transformer-based language models like BERT have revolutionized Natural Language Processing (NLP) research, but their application to historical languages remains underexplored. … (Read full abstract)

Transformer-based language models like BERT have revolutionized Natural Language Processing (NLP) research, but their application to historical languages remains underexplored. This paper investigates the adaptation of BERT-based embedding models for Latin, a language central to the study of the sacred texts of Christianity. Focusing on Jerome’s Vulgate, pre-Vulgate Latin translations of the Bible, and patristic commentaries such as Augustine’s De Genesi ad litteram, we address the challenges posed by Latin’s complex syntax, specialized vocabulary, and historical variations at the orthographic, morphological, and semantic levels. In particular, we propose fine-tuning existing BERT-based embedding models on annotated Latin corpora, using self-generated hard negatives to improve performance in detecting biblical references in early Christian literature in Latin. Experimental results demonstrate the ability of BERT-based models to identify citations of and allusions to the Bible(s) in ancient Christian commentaries while highlighting the complexities and challenges of this field. By integrating NLP techniques with humanistic expertise, this work provides a case study on intertextual analysis in Latin patristic works. It underscores the transformative potential of interdisciplinary approaches, advancing computational tools for sacred text studies and bridging the gap between philology and computational analysis.

2025 Relazione in Atti di Convegno

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Authors: Cocchi, Federico; Moratelli, Nicholas; Caffagni, Davide; Sarto, Sara; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita

2025 Relazione in Atti di Convegno

Adapt to Scarcity: Few-Shot Deepfake Detection via Low-Rank Adaptation

Authors: Cappelletti, Silvia; Baraldi, Lorenzo; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

The boundary between AI-generated images and real photographs is becoming increasingly narrow, thanks to the realism provided by contemporary generative … (Read full abstract)

The boundary between AI-generated images and real photographs is becoming increasingly narrow, thanks to the realism provided by contemporary generative models. Such technological progress necessitates the evolution of existing deepfake detection algorithms to counter new threats and protect the integrity of perceived reality. Although the prevailing approach among deepfake detection methodologies relies on large collections of generated and real data, the efficacy of these methods in adapting to scenarios characterized by data scarcity remains uncertain. This obstacle arises due to the introduction of novel generation algorithms and proprietary generative models that impose restrictions on access to large-scale datasets, thereby constraining the availability of generated images. In this paper, we first analyze how the performance of current deepfake methodologies, based on the CLIP embedding space, adapt in a few-shot situation over four state-of-the-art generators. Being the CLIP embedding space not specifically tailored for the task, a fine-tuning stage is desirable, although the amount of data needed is often unavailable in a data scarcity scenario. To address this issue and limit possible overfitting, we introduce a novel approach through the Low-Rank Adaptation (LoRA) of the CLIP architecture, tailored for few-shot deepfake detection scenarios. Remarkably, the LoRA-modified CLIP, even when fine-tuned with merely 50 pairs of real and fake images, surpasses the performance of all evaluated deepfake detection models across the tested generators. Additionally, when LoRA CLIP is benchmarked against other models trained on 1,000 samples and evaluated on generative models not seen during training it exhibits superior generalization capabilities.

2024 Relazione in Atti di Convegno

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Authors: Baraldi, Lorenzo; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Nicolosi, Alessandro; Cucchiara, Rita

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses … (Read full abstract)

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedding space is optimized for global image-to-text alignment and is not inherently designed for deepfake detection, neglecting the potential benefits of tailored training and local image features. In this study, we propose CoDE (Contrastive Deepfake Embeddings), a novel embedding space specifically designed for deepfake detection. CoDE is trained via contrastive learning by additionally enforcing global-local similarities. To sustain the training of our model, we generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9.2 million images produced by using four different generators. Experimental results demonstrate that CoDE achieves state-of-the-art accuracy on the newly collected dataset, while also showing excellent generalization capabilities to unseen image generators. Our source code, trained models, and collected dataset are publicly available at: https://github.com/aimagelab/CoDE.

2024 Relazione in Atti di Convegno

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Authors: Poppi, Samuele; Poppi, Tobia; Cocchi, Federico; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to … (Read full abstract)

Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever "toxic" linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.

2024 Relazione in Atti di Convegno

The Revolution of Multimodal Large Language Models: A Survey

Authors: Caffagni, Davide; Cocchi, Federico; Barsellotti, Luca; Moratelli, Nicholas; Sarto, Sara; Baraldi, Lorenzo; Baraldi, Lorenzo; Cornia, Marcella; Cucchiara, Rita

Published in: PROCEEDINGS OF THE CONFERENCE - ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING

Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of … (Read full abstract)

Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-following capabilities. In this paper, we provide a comprehensive review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques. We also conduct a detailed analysis of these models across a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. Additionally, we compile and describe training datasets and evaluation benchmarks, conducting comparisons among existing models in terms of performance and computational requirements. Overall, this survey offers a comprehensive overview of the current state of the art, laying the groundwork for future MLLMs.

2024 Relazione in Atti di Convegno

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

Authors: Caffagni, Davide; Cocchi, Federico; Moratelli, Nicholas; Sarto, Sara; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS

Multimodal LLMs are the natural evolution of LLMs and enlarge their capabilities so as to work beyond the pure textual … (Read full abstract)

Multimodal LLMs are the natural evolution of LLMs and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach termed Wiki-LLaVA aims at integrating an external knowledge source of multimodal documents which is accessed through a hierarchical retrieval pipeline. Relevant passages using this approach are retrieved from the external knowledge source and employed as additional context for the LLM augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.

2024 Relazione in Atti di Convegno