Enhance Multimodal Retrieval-Augmented Generation Using Multimodal Knowledge Graph

Citation

How, Shue Kei and Ong, Lee Yeng and Leow, Meng Chew (2025) Enhance Multimodal Retrieval-Augmented Generation Using Multimodal Knowledge Graph. Emerging Science Journal, 9 (6). pp. 3349-3361. ISSN 2610-9182

[img] Text
25.pdf - Published Version
Restricted to Repository staff only

Download (1MB)

Abstract

Large Language Models (LLMs) have shown impressive capabilities in natural language understanding and generation tasks. However, their reliance on text-only input limits their ability to handle tasks that require multimodal reasoning. To overcome this, Multimodal Large Language Models (MLLMs) have been introduced, enabling inputs such as images, text, video and audio. While MLLMs address some limitations, they often suffer from hallucinations because of over reliance on internal knowledge and face high computational costs. Traditional vector-based multimodal RAG systems attempt to mitigate these issues by retrieving supporting information, but often suffer from cross-modal misalignment, where independently retrieved text and image content cannot align meaningfully. Motivated by the structured retrieval capabilities of text-based knowledge graph RAG, this paper proposes VisGraphRAG to address the challenge by modelling structured relationships between images and text within a unified MMKG. This structure enables more accurate retrieval and better alignment across modalities, resulting in more relevant and complete responses. The experimental results show that VisGraphRAG significantly outperforms the vector database-based baseline RAG, achieving a higher answer accuracy of 0.7629 compared to 0.6743. Besides accuracy, VisGraphRAG also shows superior performance in key RAGAS metrics such as multimodal relevance (0.8802 vs 0.7912), showing its stronger ability to retrieve relevance information across modalities. These results underscore the effectiveness of the proposed Multimodal Knowledge Graph (MMKG) methods in enhancing cross-modal alignment and supporting more accurate, context-aware generation in complex multimodal tasks.

Item Type: Article
Uncontrolled Keywords: Multimodal Retrieval-Augmented Generation, Multimodal Knowledge Graph, Multimodal Large Language Models, Cross-Modality Alignment
Subjects: L Education > L Education (General)
Divisions: Faculty of Information Science and Technology (FIST)
Depositing User: Ms Suzilawati Abu Samah
Date Deposited: 07 Jan 2026 06:43
Last Modified: 07 Jan 2026 08:43
URII: http://shdl.mmu.edu.my/id/eprint/15181

Downloads

Downloads per month over past year

View ItemEdit (login required)