Publications by Rita Cucchiara

Explore our research publications: papers, articles, and conference proceedings from AImageLab.

Tip: type @ to pick an author and # to pick a keyword.

Active filters (Clear): Author: Rita Cucchiara

Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval

Authors: Caffagni, Davide.; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella.; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual … (Read full abstract)

Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual analysis. However, retrieval systems often struggle when training data are scarce, as is the case for low-resource languages or specialized domains such as ancient texts. To address this challenge, we propose a novel paradigm for domain-specific sentence similarity search, where the embedding space is shaped by a combination of limited real data and a large amount of synthetic data generated by Large Language Models (LLMs). Specifically, we employ LLMs to generate domain-specific sentence pairs and fine-tune a sentence embedding model, effectively distilling knowledge from the LLM to the retrieval model. We validate our method through a case study on biblical intertextuality in Latin, demonstrating that synthetic data augmentation significantly improves retrieval effectiveness in a domain with scarce annotated resources. More broadly, our approach offers a scalable and adaptable framework for enhancing retrieval in domain-specific contexts. Source code and trained models are available at https://github.com/aimagelab/biblical-retrieval-synthesis.

2026 Relazione in Atti di Convegno

Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Authors: Lobba, Davide; Sanguigni, Fulvio; Ren, Bin; Cornia, Marcella; Cucchiara, Rita; Sebe, Nicu

Virtual try-on (VTON) has been widely explored for rendering garments onto person images, while its inverse task, virtual try-off (VTOFF), … (Read full abstract)

Virtual try-on (VTON) has been widely explored for rendering garments onto person images, while its inverse task, virtual try-off (VTOFF), remains largely overlooked. VTOFF aims to recover standardized product images of garments directly from photos of clothed individuals. This capability is of great practical importance for e-commerce platforms, large-scale dataset curation, and the training of foundation models. Unlike VTON, which must handle diverse poses and styles, VTOFF naturally benefits from a consistent output format in the form of flat garment images. However, existing methods face two major limitations: (i) exclusive reliance on visual cues from a single photo often leads to ambiguity, and (ii) generated images usually suffer from loss of fine details, limiting their real-world applicability. To address these challenges, we introduce TEMU-VTOFF, a Text-Enhanced MUlti-category framework for VTOFF. Our architecture is built on a dual DiT-based backbone equipped with a multimodal attention mechanism that jointly exploits image, text, and mask information to resolve visual ambiguities and enable robust feature learning across garment categories. To explicitly mitigate detail degradation, we further design an alignment module that refines garment structures and textures, ensuring high-quality outputs. Extensive experiments on VITON-HD and Dress Code show that TEMU-VTOFF achieves new state-of-the-art performance, substantially improving both visual realism and consistency with target garments. Code and models are available at: https://temu-vtoff-page.github.io/.

2026 Relazione in Atti di Convegno

Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing

Authors: Baldrati, Alberto; Morelli, Davide; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita

Published in: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS

Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations … (Read full abstract)

Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.

2026 Articolo su rivista

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Authors: Compagnoni, Alberto; Morini, Marco; Sarto, Sara; Cocchi, Federico; Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual … (Read full abstract)

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.

2026 Relazione in Atti di Convegno

Sketch2Stitch: GANs for Abstract Sketch-Based Dress Synthesis

Authors: Farooq Khan, Faizan; Mohamed Bakr, Eslam; Morelli, Davide; Cornia, Marcella; Cucchiara, Rita; Elhoseiny, Mohamed

In the realm of creative expression, not everyone possesses the gift of effortlessly translating their imaginative visions into flawless sketches. … (Read full abstract)

In the realm of creative expression, not everyone possesses the gift of effortlessly translating their imaginative visions into flawless sketches. More often than not, the outcome resembles an abstract, perhaps even slightly distorted representation. The art of producing impeccable sketches is not only challenging but also a time-consuming process. Our work is the first of this kind in transforming abstract, sometimes deformed garment sketches into photorealistic catalog images, to empower the everyday individual to become their own fashion designer. We create Sketch2Stitch, a dataset featuring over 65,000 abstract sketch images generated from garments of DressCode and VITONHD, two benchmark datasets in the virtual try-on task. Sketch2Stitch is the first dataset in the literature to provide abstract sketches in the fashion domain. We propose a StyleGAN-based generative framework that bridges freehand sketching with photorealistic garment synthesis. We demonstrate that our framework allows users to sketch rough outlines and optionally provide color hints, producing realistic designs in seconds. Experimental results demonstrate, both quantitatively and qualitatively, that the proposed framework achieves superior performance against various baselines and existing methods on both subsets of our dataset. Our work highlights a pathway toward AI-assisted fashion design tools, democratizing garment ideation for students, independent designers, and casual creators.

2026 Relazione in Atti di Convegno

The aporetic dialogs of Modena on gender differences: Is it all about testosterone? Episode III: Mathematics

Authors: Brigante, G.; Costantino, F.; Bellelli, A.; Boni, S.; Furini, C.; Cucchiara, R.; Simoni, M.

Published in: ANDROLOGY

This report is the transcript of what was discussed in a convention at the Endocrinology Unit in Modena, Italy, in … (Read full abstract)

This report is the transcript of what was discussed in a convention at the Endocrinology Unit in Modena, Italy, in the form of the aporetic dialogs of ancient Greece. It is the third episode of a series of four discussions on the differences between males and females, with a multidisciplinary approach. In this work, the role of testosterone in gender differences in the aptitude for mathematics is explored. First, the definitions of mathematical abilities were provided together with any gender difference in the distribution of females and males in science, technology, engineering, and mathematics subjects. A clear predominance of males is evident at most science, technology, engineering, and mathematics education levels, especially in advanced academic careers. Then, the discussants were divided into two groups: group 1, which illustrated the thesis that testosterone promotes the development of logical‒mathematical skills, and group 2, which, in contrast, asserted the inconsistency of a direct role of testosterone in improving cognitive abilities and that socio-cultural factors should be considered on the basis of this gender gap. In the end, an expert referee (a female engineer) tried to resolve the aporia: are the two theories equivalent or is one superior?.

2026 Articolo su rivista

A Second-Order Perspective on Model Compositionality and Incremental Learning

Authors: Porrello, Angelo; Bonicelli, Lorenzo; Buzzega, Pietro; Millunzi, Monica; Calderara, Simone; Cucchiara, Rita

2025 Relazione in Atti di Convegno

AIGeN-Llama: An Adversarial Approach for Instruction Generation in VLN using Llama2 Model

Authors: Rawal, Niyati; Baraldi, Lorenzo; Cucchiara, Rita

Published in: CEUR WORKSHOP PROCEEDINGS

2025 Relazione in Atti di Convegno

Alfie: Democratising RGBA image generation with no $$$

Authors: Quattrini, Fabio; Pippi, Vittorio; Cascianelli, Silvia; Cucchiara, Rita

Published in: LECTURE NOTES IN COMPUTER SCIENCE

2025 Relazione in Atti di Convegno

Augmenting and Mixing Transformers with Synthetic Data for Image Captioning

Authors: Caffagni, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: IMAGE AND VISION COMPUTING

Image captioning has attracted significant attention within the Computer Vision and Multimedia research domains, resulting in the development of effective … (Read full abstract)

Image captioning has attracted significant attention within the Computer Vision and Multimedia research domains, resulting in the development of effective methods for generating natural language descriptions of images. Concurrently, the rise of generative models has facilitated the production of highly realistic and high-quality images, particularly through recent advancements in latent diffusion models. In this paper, we propose to leverage the recent advances in Generative AI and create additional training data that can be effectively used to boost the performance of an image captioning model. Specifically, we combine real images with their synthetic counterparts generated by Stable Diffusion using a Mixup data augmentation technique to create novel training examples. Extensive experiments on the COCO dataset demonstrate the effectiveness of our solution in comparison to different baselines and state-of-the-art methods and validate the benefits of using synthetic data to augment the training stage of an image captioning model and improve the quality of the generated captions. Source code and trained models are publicly available at: https://github.com/aimagelab/synthcap_pp.

2025 Articolo su rivista
2 3 »

Page 1 of 51 • Total publications: 509