Publications - AImageLab

Evaluating synthetic pre-Training for handwriting processing tasks

Authors: Pippi, V.; Cascianelli, S.; Baraldi, L.; Cucchiara, R.

Published in: PATTERN RECOGNITION LETTERS

In this work, we explore massive pre-training on synthetic word images for enhancing the performance on four benchmark downstream handwriting … (Read full abstract)

In this work, we explore massive pre-training on synthetic word images for enhancing the performance on four benchmark downstream handwriting analysis tasks. To this end, we build a large synthetic dataset of word images rendered in several handwriting fonts, which offers a complete supervision sig-nal. We use it to train a simple convolutional neural network (ConvNet) with a fully supervised objective. The vector representations of the images obtained from the pre-trained ConvNet can then be consid-ered as encodings of the handwriting style. We exploit such representations for Writer Retrieval, Writer Identification, Writer Verification, and Writer Classification and demonstrate that our pre-training strat-egy allows extracting rich representations of the writers' style that enable the aforementioned tasks with competitive results with respect to task-specific State-of-the-Art approaches.& COPY; 2023 Elsevier B.V. All rights reserved.

2023 Articolo su rivista

DOI IRIS

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Authors: Moratelli, Nicholas; Barraco, Manuele; Morelli, Davide; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Published in: SENSORS

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this … (Read full abstract)

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.

2023 Articolo su rivista

DOI IRIS

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Authors: Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cascianelli, Silvia; Fiameni, Giuseppe; Cucchiara, Rita

Published in: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted … (Read full abstract)

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.

2023 Articolo su rivista

DOI IRIS

Fully-Attentive Iterative Networks for Region-based Controllable Image and Video Captioning

Authors: Cornia, Marcella; Baraldi, Lorenzo; Ayellet, Tal; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios … (Read full abstract)

Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios of image captioning algorithms. In this task, a captioner is conditioned on an external control signal, which needs to be followed during the generation of the caption. We aim to overcome the limitations of current controllable captioning methods by proposing a fully-attentive and iterative network that can generate grounded and controllable captions from a control signal given as a sequence of visual regions from the image. Our architecture is based on a set of novel attention operators, which take into account the hierarchical nature of the control signal, and is endowed with a decoder which explicitly focuses on each part of the control signal. We demonstrate the effectiveness of the proposed approach by conducting experiments on three datasets, where our model surpasses the performances of previous methods and achieves a new state of the art on both image and video controllable captioning.

2023 Articolo su rivista

DOI IRIS

Handwritten Text Generation from Visual Archetypes

Authors: Pippi, V.; Cascianelli, S.; Cucchiara, R.

Published in: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

Generating synthetic images of handwritten text in a writer-specific style is a challenging task, especially in the case of unseen … (Read full abstract)

Generating synthetic images of handwritten text in a writer-specific style is a challenging task, especially in the case of unseen styles and new words, and even more when these latter contain characters that are rarely encountered during training. While emulating a writer's style has been recently addressed by generative models, the generalization towards rare characters has been disregarded. In this work, we devise a Transformer-based model for Few-Shot styled handwritten text generation and focus on obtaining a robust and informative representation of both the text and the style. In particular, we propose a novel representation of the textual content as a sequence of dense vectors obtained from images of symbols written as standard GNU Unifont glyphs, which can be considered their visual archetypes. This strategy is more suitable for generating characters that, despite having been seen rarely during training, possibly share visual details with the frequently observed ones. As for the style, we obtain a robust representation of unseen writers' calligraphy by exploiting specific pre-training on a large synthetic dataset. Quantitative and qualitative results demonstrate the effectiveness of our proposal in generating words in unseen styles and with rare characters more faithfully than existing approaches relying on independent one-hot encodings of the characters.

2023 Relazione in Atti di Convegno

DOI IRIS

How to Choose Pretrained Handwriting Recognition Models for Single Writer Fine-Tuning

Authors: Pippi, V.; Cascianelli, S.; Kermorvant, C.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on both modern and … (Read full abstract)

Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on both modern and historical manuscripts in large benchmark datasets. Nonetheless, those models struggle to obtain the same performance when applied to manuscripts with peculiar characteristics, such as language, paper support, ink, and author handwriting. This issue is very relevant for valuable but small collections of documents preserved in historical archives, for which obtaining sufficient annotated training data is costly or, in some cases, unfeasible. To overcome this challenge, a possible solution is to pretrain HTR models on large datasets and then fine-tune them on small single-author collections. In this paper, we take into account large, real benchmark datasets and synthetic ones obtained with a styled Handwritten Text Generation model. Through extensive experimental analysis, also considering the amount of fine-tuning lines, we give a quantitative indication of the most relevant characteristics of such data for obtaining an HTR model able to effectively transcribe manuscripts in small collections with as little as five real fine-tuning lines.

2023 Relazione in Atti di Convegno

DOI IRIS

HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

Authors: Pippi, V.; Quattrini, F.; Cascianelli, S.; Cucchiara, R.

2023 Relazione in Atti di Convegno

IRIS

Input Perturbation Reduces Exposure Bias in Diffusion Models

Authors: Ning, M.; Sangineto, E.; Porrello, A.; Calderara, S.; Cucchiara, R.

Published in: PROCEEDINGS OF MACHINE LEARNING RESEARCH

Denoising Diffusion Probabilistic Models have shown an impressive generation quality although their long sampling chain leads to high computational costs. … (Read full abstract)

Denoising Diffusion Probabilistic Models have shown an impressive generation quality although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the exposure bias problem in autoregressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. To alleviate this problem, we propose a very simple but effective training regularization, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that, without affecting the recall and precision, the proposed input perturbation leads to a significant improvement in the sample quality while reducing both the training and the inference times. For instance, on CelebA 64×64, we achieve a new state-of-the-art FID score of 1.27, while saving 37.5% of the training time. The code is available at https://github.com/forever208/DDPM-IP.

2023 Relazione in Atti di Convegno

IRIS

LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

Authors: Morelli, Davide; Baldrati, Alberto; Cartella, Giuseppe; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita

The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the … (Read full abstract)

The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model's characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task.

2023 Relazione in Atti di Convegno

DOI IRIS

Let's stay close: An examination of the effects of imagined contact on behavior toward children with disability

Authors: Cocco, V. M.; Bisagno, E.; Bernardo, G. A. D.; Bicocchi, N.; Calderara, S.; Palazzi, A.; Cucchiara, R.; Zambonelli, F.; Cadamuro, A.; Stathi, S.; Crisp, R.; Vezzali, L.

Published in: SOCIAL DEVELOPMENT

In line with current developments in indirect intergroup contact literature, we conducted a field study using the imagined contact paradigm … (Read full abstract)

In line with current developments in indirect intergroup contact literature, we conducted a field study using the imagined contact paradigm among high-status (Italian children) and low-status (children with foreign origins) group members (N = 122; 53 females, mean age = 7.52 years). The experiment aimed to improve attitudes and behavior toward a different low-status group, children with disability. To assess behavior, we focused on an objective measure that captures the physical distance between participants and a child with disability over the course of a five-minute interaction (i.e., while playing together). Results from a 3-week intervention revealed that in the case of high-status children imagined contact, relative to a no-intervention control condition, improved outgroup attitudes and behavior, and strengthened helping and contact intentions. These effects however did not emerge among low-status children. The results are discussed in the context of intergroup contact literature, with emphasis on the implications of imagined contact for educational settings.

2023 Articolo su rivista

DOI IRIS

Publications by Rita Cucchiara

Evaluating synthetic pre-Training for handwriting processing tasks

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Fully-Attentive Iterative Networks for Region-based Controllable Image and Video Captioning

Handwritten Text Generation from Visual Archetypes

How to Choose Pretrained Handwriting Recognition Models for Single Writer Fine-Tuning

HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

Input Perturbation Reduces Exposure Bias in Diffusion Models

LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

Let's stay close: An examination of the effects of imagined contact on behavior toward children with disability