Publications - AImageLab

Fully-Attentive Iterative Networks for Region-based Controllable Image and Video Captioning

Authors: Cornia, Marcella; Baraldi, Lorenzo; Ayellet, Tal; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios … (Read full abstract)

Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios of image captioning algorithms. In this task, a captioner is conditioned on an external control signal, which needs to be followed during the generation of the caption. We aim to overcome the limitations of current controllable captioning methods by proposing a fully-attentive and iterative network that can generate grounded and controllable captions from a control signal given as a sequence of visual regions from the image. Our architecture is based on a set of novel attention operators, which take into account the hierarchical nature of the control signal, and is endowed with a decoder which explicitly focuses on each part of the control signal. We demonstrate the effectiveness of the proposed approach by conducting experiments on three datasets, where our model surpasses the performances of previous methods and achieves a new state of the art on both image and video controllable captioning.

2023 Articolo su rivista

DOI IRIS

Handwritten Text Generation from Visual Archetypes

Authors: Pippi, V.; Cascianelli, S.; Cucchiara, R.

Published in: PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION

Generating synthetic images of handwritten text in a writer-specific style is a challenging task, especially in the case of unseen … (Read full abstract)

Generating synthetic images of handwritten text in a writer-specific style is a challenging task, especially in the case of unseen styles and new words, and even more when these latter contain characters that are rarely encountered during training. While emulating a writer's style has been recently addressed by generative models, the generalization towards rare characters has been disregarded. In this work, we devise a Transformer-based model for Few-Shot styled handwritten text generation and focus on obtaining a robust and informative representation of both the text and the style. In particular, we propose a novel representation of the textual content as a sequence of dense vectors obtained from images of symbols written as standard GNU Unifont glyphs, which can be considered their visual archetypes. This strategy is more suitable for generating characters that, despite having been seen rarely during training, possibly share visual details with the frequently observed ones. As for the style, we obtain a robust representation of unseen writers' calligraphy by exploiting specific pre-training on a large synthetic dataset. Quantitative and qualitative results demonstrate the effectiveness of our proposal in generating words in unseen styles and with rare characters more faithfully than existing approaches relying on independent one-hot encodings of the characters.

2023 Relazione in Atti di Convegno

DOI IRIS

How to Choose Pretrained Handwriting Recognition Models for Single Writer Fine-Tuning

Authors: Pippi, V.; Cascianelli, S.; Kermorvant, C.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on both modern and … (Read full abstract)

Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on both modern and historical manuscripts in large benchmark datasets. Nonetheless, those models struggle to obtain the same performance when applied to manuscripts with peculiar characteristics, such as language, paper support, ink, and author handwriting. This issue is very relevant for valuable but small collections of documents preserved in historical archives, for which obtaining sufficient annotated training data is costly or, in some cases, unfeasible. To overcome this challenge, a possible solution is to pretrain HTR models on large datasets and then fine-tune them on small single-author collections. In this paper, we take into account large, real benchmark datasets and synthetic ones obtained with a styled Handwritten Text Generation model. Through extensive experimental analysis, also considering the amount of fine-tuning lines, we give a quantitative indication of the most relevant characteristics of such data for obtaining an HTR model able to effectively transcribe manuscripts in small collections with as little as five real fine-tuning lines.

2023 Relazione in Atti di Convegno

DOI IRIS

HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

Authors: Pippi, V.; Quattrini, F.; Cascianelli, S.; Cucchiara, R.

2023 Relazione in Atti di Convegno

IRIS

Inferior Alveolar Canal Automatic Detection with Deep Learning CNNs on CBCTs: Development of a Novel Model and Release of Open-Source Dataset and Algorithm

Authors: Di Bartolomeo, Mattia; Pellacani, Arrigo; Bolelli, Federico; Cipriano, Marco; Lumetti, Luca; Negrello, Sara; Allegretti, Stefano; Minafra, Paolo; Pollastri, Federico; Nocini, Riccardo; Colletti, Giacomo; Chiarini, Luigi; Grana, Costantino; Anesi, Alexandre

Published in: APPLIED SCIENCES

Introduction: The need of accurate three-dimensional data of anatomical structures is increasing in the surgical field. The development of convolutional … (Read full abstract)

Introduction: The need of accurate three-dimensional data of anatomical structures is increasing in the surgical field. The development of convolutional neural networks (CNNs) has been helping to fill this gap by trying to provide efficient tools to clinicians. Nonetheless, the lack of a fully accessible datasets and open-source algorithms is slowing the improvements in this field. In this paper, we focus on the fully automatic segmentation of the Inferior Alveolar Canal (IAC), which is of immense interest in the dental and maxillo-facial surgeries. Conventionally, only a bidimensional annotation of the IAC is used in common clinical practice. A reliable convolutional neural network (CNNs) might be timesaving in daily practice and improve the quality of assistance. Materials and methods: Cone Beam Computed Tomography (CBCT) volumes obtained from a single radiological center using the same machine were gathered and annotated. The course of the IAC was annotated on the CBCT volumes. A secondary dataset with sparse annotations and a primary dataset with both dense and sparse annotations were generated. Three separate experiments were conducted in order to evaluate the CNN. The IoU and Dice scores of every experiment were recorded as the primary endpoint, while the time needed to achieve the annotation was assessed as the secondary end-point. Results: A total of 347 CBCT volumes were collected, then divided into primary and secondary datasets. Among the three experiments, an IoU score of 0.64 and a Dice score of 0.79 were obtained thanks to the pre-training of the CNN on the secondary dataset and the creation of a novel deep label propagation model, followed by proper training on the primary dataset. To the best of our knowledge, these results are the best ever published in the segmentation of the IAC. The datasets is publicly available and algorithm is published as open-source software. On average, the CNN could produce a 3D annotation of the IAC in 6.33 s, compared to 87.3 s needed by the radiology technician to produce a bidimensional annotation. Conclusions: To resume, the following achievements have been reached. A new state of the art in terms of Dice score was achieved, overcoming the threshold commonly considered of 0.75 for the use in clinical practice. The CNN could fully automatically produce accurate three-dimensional segmentation of the IAC in a rapid setting, compared to the bidimensional annotations commonly used in the clinical practice and generated in a time-consuming manner. We introduced our innovative deep label propagation method to optimize the performance of the CNN in the segmentation of the IAC. For the first time in this field, the datasets and the source codes used were publicly released, granting reproducibility of the experiments and helping in the improvement of IAC segmentation.

2023 Articolo su rivista

DOI IRIS

Inferring Causal Factors of Core Affect Dynamics on Social Participation through the Lens of the Observer

Authors: D'Amelio, Alessandro; Patania, Sabrina; Buršić, Sathya; Cuculo, Vittorio; Boccignone, Giuseppe

Published in: SENSORS

A core endeavour in current affective computing and social signal processing research is the construction of datasets embedding suitable ground … (Read full abstract)

A core endeavour in current affective computing and social signal processing research is the construction of datasets embedding suitable ground truths to foster machine learning methods. This practice brings up hitherto overlooked intricacies. In this paper, we consider causal factors potentially arising when human raters evaluate the affect fluctuations of subjects involved in dyadic interactions and subsequently categorise them in terms of social participation traits. To gauge such factors, we propose an emulator as a statistical approximation of the human rater, and we first discuss the motivations and the rationale behind the approach.The emulator is laid down in the next section as a phenomenological model where the core affect stochastic dynamics as perceived by the rater are captured through an Ornstein-Uhlenbeck process; its parameters are then exploited to infer potential causal effects in the attribution of social traits. Following that, by resorting to a publicly available dataset, the adequacy of the model is evaluated in terms of both human raters' emulation and machine learning predictive capabilities. We then present the results, which are followed by a general discussion concerning findings and their implications, together with advantages and potential applications of the approach.

2023 Articolo su rivista

DOI IRIS

Input Perturbation Reduces Exposure Bias in Diffusion Models

Authors: Ning, M.; Sangineto, E.; Porrello, A.; Calderara, S.; Cucchiara, R.

Published in: PROCEEDINGS OF MACHINE LEARNING RESEARCH

Denoising Diffusion Probabilistic Models have shown an impressive generation quality although their long sampling chain leads to high computational costs. … (Read full abstract)

Denoising Diffusion Probabilistic Models have shown an impressive generation quality although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the exposure bias problem in autoregressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. To alleviate this problem, we propose a very simple but effective training regularization, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that, without affecting the recall and precision, the proposed input perturbation leads to a significant improvement in the sample quality while reducing both the training and the inference times. For instance, on CelebA 64×64, we achieve a new state-of-the-art FID score of 1.27, while saving 37.5% of the training time. The code is available at https://github.com/forever208/DDPM-IP.

2023 Relazione in Atti di Convegno

IRIS

LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

Authors: Morelli, Davide; Baldrati, Alberto; Cartella, Giuseppe; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita

The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the … (Read full abstract)

The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model's characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task.

2023 Relazione in Atti di Convegno

DOI IRIS

Let's stay close: An examination of the effects of imagined contact on behavior toward children with disability

Authors: Cocco, V. M.; Bisagno, E.; Bernardo, G. A. D.; Bicocchi, N.; Calderara, S.; Palazzi, A.; Cucchiara, R.; Zambonelli, F.; Cadamuro, A.; Stathi, S.; Crisp, R.; Vezzali, L.

Published in: SOCIAL DEVELOPMENT

In line with current developments in indirect intergroup contact literature, we conducted a field study using the imagined contact paradigm … (Read full abstract)

In line with current developments in indirect intergroup contact literature, we conducted a field study using the imagined contact paradigm among high-status (Italian children) and low-status (children with foreign origins) group members (N = 122; 53 females, mean age = 7.52 years). The experiment aimed to improve attitudes and behavior toward a different low-status group, children with disability. To assess behavior, we focused on an objective measure that captures the physical distance between participants and a child with disability over the course of a five-minute interaction (i.e., while playing together). Results from a 3-week intervention revealed that in the case of high-status children imagined contact, relative to a no-intervention control condition, improved outgroup attitudes and behavior, and strengthened helping and contact intentions. These effects however did not emerge among low-status children. The results are discussed in the context of intergroup contact literature, with emphasis on the implications of imagined contact for educational settings.

2023 Articolo su rivista

DOI IRIS

Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation

Authors: Betti, Federico; Staiano, Jacopo; Baraldi, Lorenzo; Baraldi, Lorenzo; Cucchiara, Rita; Sebe, Nicu

2023 Relazione in Atti di Convegno

DOI IRIS