Publications - AImageLab

Monocular per-object distance estimation with Masked Object Modeling

Authors: Panariello, Aniello; Mancusi, Gianluca; Haj Ali, Fedy; Porrello, Angelo; Calderara, Simone; Cucchiara, Rita

Published in: COMPUTER VISION AND IMAGE UNDERSTANDING

2025 Articolo su rivista

DOI IRIS

Mosaic-SR: An Adaptive Multi-step Super-Resolution Method for Low-Resolution 2D Barcodes

Authors: Vezzali, Enrico; Vorabbi, Lorenzo; Grana, Costantino; Bolelli, Federico

QR and Datamatrix codes are widely used in warehouse logistics and high-speed production pipelines. Still, distant or small barcodes often … (Read full abstract)

QR and Datamatrix codes are widely used in warehouse logistics and high-speed production pipelines. Still, distant or small barcodes often yield low-pixel-density images that are hard to read. Conventional solutions rely on costly hardware or enhanced lighting, raising expenses and potentially reducing depth of field. We propose Mosaic-SR, a multi-step, adaptive super-resolution (SR) method that devotes more computation to barcode regions than uniform backgrounds. For each patch, it predicts an uncertainty value to decide how many refinement steps are required. Our experiments show that Mosaic-SR surpasses state-of-the-art SR models on 2D barcode images, achieving higher PSNR and decoding rates in less time. All code and trained models are publicly available at https://github.com/Henvezz95/mosaic-sr.

2025 Relazione in Atti di Convegno

DOI IRIS

Multimodal Dialogue for Empathetic Human-Robot Interaction

Authors: Rawal, Niyati; Singh Maharjan, Rahul; Salici, Giacomo; Catalini, Riccardo; Romeo, Marta; Bigazzi, Roberto; Baraldi, Lorenzo; Vezzani, Roberto; Cucchiara, Rita; Cangelosi, Angelo

2025 Relazione in Atti di Convegno

IRIS

Multimodal Emotion Recognition in Conversation via Possible Speaker's Audio and Visual Sequence Selection

Authors: Singh Maharjan, Rahul; Rawal, Niyati; Romeo, Marta; Baraldi, Lorenzo; Cucchiara, Rita; Cangelosi, Angelo

Published in: PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING

2025 Relazione in Atti di Convegno

DOI IRIS

No More Slice Wars: Towards Harmonized Brain MRI Synthesis for the BraSyn Challenge

Authors: Carpentiero, Omar; Marchesini, Kevin; Grana, Costantino; Bolelli, Federico

The synthesis of missing MRI modalities has emerged as a critical solution to address incomplete multi-parametric imaging in brain tumor … (Read full abstract)

The synthesis of missing MRI modalities has emerged as a critical solution to address incomplete multi-parametric imaging in brain tumor diagnosis and treatment planning. While recent advances in generative models, especially GANs and diffusion-based approaches, have demonstrated promising results in cross-modality MRI generation, challenges remain in preserving anatomical fidelity and minimizing synthesis artifacts. In this work, we build upon the Hybrid Fusion GAN (\hfgan) framework, introducing several enhancements aimed at improving synthesis quality and generalization across tumor types. Specifically, we incorporate z-score normalization, optimize network components for faster and more stable training, and extend the pipeline to support multi-view generation across various brain tumor categories, including gliomas, metastases, and meningiomas. Our approach focuses on refining 2D slice-based generation to ensure intra-slice coherence and reduce intensity inconsistencies, ultimately supporting more accurate and robust tumor segmentation in scenarios with missing imaging modalities. Our source code is available at https://github.com/AImageLab-zip/BraSyn25.

2025 Relazione in Atti di Convegno

IRIS

One transformer for all time series: representing and training with time-dependent heterogeneous tabular data

Authors: Luetto, S.; Garuti, F.; Sangineto, E.; Forni, L.; Cucchiara, R.

Published in: MACHINE LEARNING

There is a recent growing interest in applying Deep Learning techniques to tabular data in order to replicate the success … (Read full abstract)

There is a recent growing interest in applying Deep Learning techniques to tabular data in order to replicate the success of other Artificial Intelligence areas in this structured domain. Particularly interesting is the case in which tabular data have a time dependence, such as, for instance, financial transactions. However, the heterogeneity of the tabular values, in which categorical elements are mixed with numerical features, makes this adaptation difficult. In this paper we propose UniTTab, a Transformer based architecture whose goal is to uniformly represent heterogeneous time-dependent tabular data, in which both numerical and categorical features are described using continuous embedding vectors. Moreover, differently from common approaches, which use a combination of different loss functions for training with both numerical and categorical targets, UniTTab is uniformly trained with a unique Masked Token pretext task. Finally, UniTTab can also represent time series in which the individual row components have a variable internal structure with a variable number of fields, which is a common situation in many application domains, such as in real world transactional data. Using extensive experiments with five datasets of variable size and complexity, we empirically show that UniTTab consistently and significantly improves the prediction accuracy over several downstream tasks and with respect to both Deep Learning and more standard Machine Learning approaches. Our code and our models are available at: https://github.com/fabriziogaruti/UniTTab.

2025 Articolo su rivista

DOI IRIS

Optimizing Resource Allocation in Public Healthcare: A Machine Learning Approach for Length-of-Stay Prediction

Authors: Perliti Scorzoni, Paolo; Giovanetti, Anita; Bolelli, Federico; Grana, Costantino

Published in: LECTURE NOTES IN COMPUTER SCIENCE

Effective hospital resource management hinges on established metrics such as Length of Stay (LOS) and Prolonged Length of Stay (pLOS). … (Read full abstract)

Effective hospital resource management hinges on established metrics such as Length of Stay (LOS) and Prolonged Length of Stay (pLOS). Reducing pLOS is associated with improved patient outcomes and optimized resource utilization (e.g., bed allocation). This study investigates several Machine Learning (ML) models for both LOS and pLOS prediction. We conducted a retrospective study analyzing data from general inpatients discharged between 2022 and 2023 at a northern Italian hospital. Sixteen regression and twelve classification algorithms were compared in forecasting LOS as either a continuous or multi-class variable (1-3 days, 4-10 days, >10 days). Additionally, the same models were assessed for pLOS prediction (defined as LOS exceeding 8 days). All models were evaluated using two variants of the same dataset: one containing only structured data (e.g., demographics and clinical information), and a second one also containing features extracted from free-text diagnosis. Ensemble models, leveraging the combined strengths of multiple ML algorithms, demonstrated superior accuracy in predicting both LOS and pLOS compared to single-algorithm models, particularly when utilizing both structured and unstructured data extracted from diagnoses. Integration of ML, particularly ensemble models, has the potential to significantly improve LOS prediction and identify patients at high risk of pLOS. Such insights can empower healthcare professionals and bed managers to optimize patient care and resource allocation, promoting overall healthcare efficiency and sustainability.

2025 Relazione in Atti di Convegno

DOI IRIS

OXA-MISS: A Robust Multimodal Architecture for Chemotherapy Response Prediction under Data Scarcity

Authors: Miccolis, Francesca; Marinelli, Fabio; Pipoli, Vittorio; Afenteva, Daria; Virtanen, Anni; Lovino, Marta; Ficarra, Elisa

2025 Relazione in Atti di Convegno

IRIS

Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries

Authors: Amoroso, Roberto; Zhang, Gengyuan; Koner, Rajat; Baraldi, Lorenzo; Cucchiara, Rita; Tresp, Volker

Published in: IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION

Video Question Answering (Video QA) is a critical and challenging task in video understanding, necessitating models to comprehend entire videos, … (Read full abstract)

Video Question Answering (Video QA) is a critical and challenging task in video understanding, necessitating models to comprehend entire videos, identify the most pertinent information based on the contextual cues from the question, and reason accurately to provide answers. Initial endeavors in harnessing Multimodal Large Language Models (MLLMs) have cast new light on Visual QA, particularly highlighting their commonsense and temporal reasoning capacities. Models that effectively align visual and textual elements can offer more accurate answers tailored to visual inputs. Nevertheless, an unresolved question persists regarding video content: How can we efficiently extract the most relevant information from videos over time and space for enhanced VQA? In this study, we evaluate the efficacy of various temporal modeling techniques in conjunction with MLLMs and introduce a novel component, T-Former, designed as a question-guided temporal querying transformer. T-Former bridges frame-wise visual perception and the reasoning capabilities of LLMs. Our evaluation across various VideoQA benchmarks shows that T-Former, with its linear computational complexity, competes favorably with existing temporal modeling approaches and aligns with the latest advancements in Video QA tasks.

2025 Relazione in Atti di Convegno

DOI IRIS

Pixels of Faith: Exploiting Visual Saliency to Detect Religious Image Manipulation

Authors: Cartella, G.; Cuculo, V.; Cornia, M.; Papasidero, M.; Ruozzi, F.; Cucchiara, R.

Published in: LECTURE NOTES IN COMPUTER SCIENCE

2025 Relazione in Atti di Convegno

DOI IRIS