Rita Cucchiara's talk at ELLIS Sofia: Rethinking Multimodal Foundation Models

Seminars

Prof. Rita Cucchiara, Head of AImageLab and coordinator of the Modena ELLIS Unit, and Sara Sarto, post-doctoral researcher at AImageLab, participated in the ELLIS Workshop on Computer Vision and Machine Learning, held in Sofia, Bulgaria, on April 27, 2024. Rita Cucchiara's talk, titled "Rethinking Multimodal Foundation Models: From Retrieval to Reflection and Reasoning," outlined AImageLab's vision and research agenda for next-generation multimodal large language models (MLLMs).

The presentation opened with a critical reflection on the current paradigm of foundation models, where knowledge is encoded in parameters through scaling — a successful but costly approach in terms of environmental, economic, and scientific resources. The AImageLab perspective, framed as a distinctly European one, argues that efficiency, precision, and trustworthiness should drive the next wave of progress, rather than scale alone.

The talk then walked through the lab's four-stage research roadmap: from the development of multimodal LLMs, to retrieval-augmented generation (RAG) for connecting models to external knowledge, to self-reflective architectures that let models decide when and whether to retrieve, and finally to reasoning-augmented generation that structures decision-making over retrieved evidence. Highlighted contributions included the Retrieval-Augmented Transformer for image captioning (CBMI 2022 Best Paper), the Wiki-LLaVA hierarchical RAG model (CVPR Workshops 2024), the ReflectiVA self-reflective model (CVPR 2025), and the newly introduced ReAG framework (CVPR 2026), which integrates Popperian-inspired reasoning — conjecture, refutation, and corroboration — into the generation process, achieving state-of-the-art results on knowledge-based visual question answering benchmarks.

The talk also touched on the lab's involvement in major EU-funded initiatives — ELLIOT, ELIAS, MINERVA, ELSA, and the newly launched ELLE MSCA Doctoral Network on Trustworthy and Agentic Multimodal LLMs — and the recently inaugurated UNIMORE AI Center hosting the Modena ELLIS Unit.

Back to News