Publications - AImageLab

Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval

Authors: Caffagni, Davide.; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella.; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual … (Read full abstract)

Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual analysis. However, retrieval systems often struggle when training data are scarce, as is the case for low-resource languages or specialized domains such as ancient texts. To address this challenge, we propose a novel paradigm for domain-specific sentence similarity search, where the embedding space is shaped by a combination of limited real data and a large amount of synthetic data generated by Large Language Models (LLMs). Specifically, we employ LLMs to generate domain-specific sentence pairs and fine-tune a sentence embedding model, effectively distilling knowledge from the LLM to the retrieval model. We validate our method through a case study on biblical intertextuality in Latin, demonstrating that synthetic data augmentation significantly improves retrieval effectiveness in a domain with scarce annotated resources. More broadly, our approach offers a scalable and adaptable framework for enhancing retrieval in domain-specific contexts. Source code and trained models are available at https://github.com/aimagelab/biblical-retrieval-synthesis.

2026 Relazione in Atti di Convegno

DOI IRIS

GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution

Authors: D'Oronzio, Fabio; Putamorsi, Federico; Zini, Leonardo; Cornia, Marcella; Baraldi, Lorenzo

Despite recent advances, single-image super-resolution (SR) remains challenging, especially in real-world scenarios with complex degradations. Diffusion-based SR methods, particularly those … (Read full abstract)

Despite recent advances, single-image super-resolution (SR) remains challenging, especially in real-world scenarios with complex degradations. Diffusion-based SR methods, particularly those built on Stable Diffusion, leverage strong generative priors but commonly rely on text conditioning derived from semantic captioning. Such textual descriptions provide only high-level semantics and lack the spatially aligned visual information required for faithful restoration, leading to a representation gap between abstract semantics and spatially aligned visual details. To address this limitation, we propose GramSR, a one-step diffusion-based SR framework that replaces text conditioning with dense visual features extracted from the low-resolution input using a pre-trained DINOv3 encoder. GramSR adopts a three-stage LoRA architecture, where pixel-level, semantic-level, and texture-level LoRA modules are trained sequentially. The pixel-level module focuses on degradation removal using L2 loss, the semantic-level module enhances perceptual details via LPIPS and CSD losses, and the texture-level module enforces feature correlation consistency through a Gram matrix loss computed from DINOv3 features. At inference, independent guidance scales enable flexible control over degradation removal, semantic enhancement, and texture preservation. Extensive experiments on standard SR benchmarks demonstrate that GramSR consistently outperforms existing one-step diffusion-based methods, achieving superior structural fidelity and texture realism.

2026 Relazione in Atti di Convegno

IRIS

Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling

Authors: Cappelletti, Silvia; Poppi, Tobia; Poppi, Samuele; Yong, Zheng-Xin; Garcia-Olano, Diego; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Large Language Models (LLMs) are traditionally evaluated on multiple-choice question answering (MCQA) tasks using First-Token Probability (FTP), which selects the … (Read full abstract)

Large Language Models (LLMs) are traditionally evaluated on multiple-choice question answering (MCQA) tasks using First-Token Probability (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (misalignment) or use a valid token merely as part of a generic preamble rather than as a clear answer choice (misinterpretation), undermining the reliability of symbolic evaluation. We propose a simple solution: output prefilling, a structured natural-language prefix (e.g., 'The correct option is:') prepended to the model output. Originally explored in AI safety as an attack strategy, we repurpose prefilling to steer the model to respond with a clean, valid option, without modifying its parameters. Through extensive evaluation, we find that the FTP with prefilling strategy substantially improves accuracy, calibration, and output consistency across a broad set of LLMs and MCQA benchmarks. It outperforms standard FTP and often matches the performance of open-ended generation approaches that require full decoding and external classifiers, while being significantly more efficient. Our analysis suggests that prefilling is a simple, robust, and zero-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.

2026 Relazione in Atti di Convegno

IRIS

Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Authors: Lobba, Davide; Sanguigni, Fulvio; Ren, Bin; Cornia, Marcella; Cucchiara, Rita; Sebe, Nicu

Virtual try-on (VTON) has been widely explored for rendering garments onto person images, while its inverse task, virtual try-off (VTOFF), … (Read full abstract)

Virtual try-on (VTON) has been widely explored for rendering garments onto person images, while its inverse task, virtual try-off (VTOFF), remains largely overlooked. VTOFF aims to recover standardized product images of garments directly from photos of clothed individuals. This capability is of great practical importance for e-commerce platforms, large-scale dataset curation, and the training of foundation models. Unlike VTON, which must handle diverse poses and styles, VTOFF naturally benefits from a consistent output format in the form of flat garment images. However, existing methods face two major limitations: (i) exclusive reliance on visual cues from a single photo often leads to ambiguity, and (ii) generated images usually suffer from loss of fine details, limiting their real-world applicability. To address these challenges, we introduce TEMU-VTOFF, a Text-Enhanced MUlti-category framework for VTOFF. Our architecture is built on a dual DiT-based backbone equipped with a multimodal attention mechanism that jointly exploits image, text, and mask information to resolve visual ambiguities and enable robust feature learning across garment categories. To explicitly mitigate detail degradation, we further design an alignment module that refines garment structures and textures, ensuring high-quality outputs. Extensive experiments on VITON-HD and Dress Code show that TEMU-VTOFF achieves new state-of-the-art performance, substantially improving both visual realism and consistency with target garments. Code and models are available at: https://temu-vtoff-page.github.io/.

2026 Relazione in Atti di Convegno

IRIS

Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing

Authors: Baldrati, Alberto; Morelli, Davide; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita

Published in: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS

Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations … (Read full abstract)

Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.

2026 Articolo su rivista

IRIS

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

Authors: Mattioli, Gabriele; Turri, Evelyn; Sarto, Sara; Baraldi, Lorenzo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources — such as … (Read full abstract)

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources — such as APIs, computational utilities, and specialized models — to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.

2026 Relazione in Atti di Convegno

IRIS

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Authors: Compagnoni, Alberto; Morini, Marco; Sarto, Sara; Cocchi, Federico; Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual … (Read full abstract)

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.

2026 Relazione in Atti di Convegno

IRIS

Sketch2Stitch: GANs for Abstract Sketch-Based Dress Synthesis

Authors: Farooq Khan, Faizan; Mohamed Bakr, Eslam; Morelli, Davide; Cornia, Marcella; Cucchiara, Rita; Elhoseiny, Mohamed

In the realm of creative expression, not everyone possesses the gift of effortlessly translating their imaginative visions into flawless sketches. … (Read full abstract)

In the realm of creative expression, not everyone possesses the gift of effortlessly translating their imaginative visions into flawless sketches. More often than not, the outcome resembles an abstract, perhaps even slightly distorted representation. The art of producing impeccable sketches is not only challenging but also a time-consuming process. Our work is the first of this kind in transforming abstract, sometimes deformed garment sketches into photorealistic catalog images, to empower the everyday individual to become their own fashion designer. We create Sketch2Stitch, a dataset featuring over 65,000 abstract sketch images generated from garments of DressCode and VITONHD, two benchmark datasets in the virtual try-on task. Sketch2Stitch is the first dataset in the literature to provide abstract sketches in the fashion domain. We propose a StyleGAN-based generative framework that bridges freehand sketching with photorealistic garment synthesis. We demonstrate that our framework allows users to sketch rough outlines and optionally provide color hints, producing realistic designs in seconds. Experimental results demonstrate, both quantitatively and qualitatively, that the proposed framework achieves superior performance against various baselines and existing methods on both subsets of our dataset. Our work highlights a pathway toward AI-assisted fashion design tools, democratizing garment ideation for students, independent designers, and casual creators.

2026 Relazione in Atti di Convegno

IRIS

The Biblical Heritage in Ancient Latin Christian Literature: Advancing Intertextual Mapping Through Sentence Embeddings

Authors: Mambelli, Anna; Bigoni, Laura; Dainese, Davide; Tutrone, Fabio; Caffagni, Davide; Cocchi, Federico; Zanella, Marco; Cornia, Marcella; Cucchiara, Rita

Published in: UMANISTICA DIGITALE

This study presents an interdisciplinary methodology for detecting biblical references in Latin patristic literature through an innovative combination of rigorous … (Read full abstract)

This study presents an interdisciplinary methodology for detecting biblical references in Latin patristic literature through an innovative combination of rigorous philological approach and Natural Language Processing (NLP) techniques. Focusing on one of the most influential ancient Christian commentaries on the Bible, Augustine of Hippo’s De Genesi ad litteram, and its relationship with Latin biblical texts (specifically, Jerome’s Vulgate and pre-Vulgate versions), this research introduces a token-based classification system for intertextual references, enriched with semantic annotations and supported by the INCEpTION platform. The first section shows how this numerical classification system accounts for exact matches, lemmatized forms, roots, synonyms, and other forms of semantic parallels (here referred to as “structures”), capturing a wide spectrum of textual similarity. To enhance automatic retrieval of these intertextual connections, we fine-tune BERT-based language models for Latin, incorporating contrastive learning and hard negative mining. In the second section, experimental results show that finetuned models significantly outperform baseline models at various levels of textual similarity. This work highlights the utility of computational models in overcoming the traditional dichotomy between explicit quotations and implicit allusions, embracing multiple intermediate nuances of similarity and offering a scalable approach to the study of intertextuality in ancient writings.

2026 Articolo su rivista

DOI IRIS

Tiny Inference-Time Scaling with Latent Verifiers

Authors: Bucciarelli, Davide; Turri, Evelyn; Baraldi, Lorenzo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to … (Read full abstract)

Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.

2026 Relazione in Atti di Convegno

IRIS

Publications by Marcella Cornia

Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval

GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution

Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling

Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Sketch2Stitch: GANs for Abstract Sketch-Based Dress Synthesis

The Biblical Heritage in Ancient Latin Christian Literature: Advancing Intertextual Mapping Through Sentence Embeddings

Tiny Inference-Time Scaling with Latent Verifiers