Development of Deep Learning Methods for Visual Document Classification Using Hybrid Vision Transformer–EfficientNet Architecture

Citation

Ashimgaliyev, Medet and Emerson Raja, Joseph and Zhumadillayeva, Ainur and Baimakhanova, Aigerim (2026) Development of Deep Learning Methods for Visual Document Classification Using Hybrid Vision Transformer–EfficientNet Architecture. IEEE Access. p. 1. ISSN 2169-3536

[img] Text
1.pdf - Published Version
Restricted to Repository staff only

Download (1MB)

Abstract

The rapid expansion of digital archives and scanned document collections has underscored the importance of reliable and efficient document classification techniques. Traditional methods that combine optical character recognition (OCR) with classical machine learning often fall short when processing diverse, low-quality, or unstructured archival documents. In response, this study introduces a hybrid deep learning framework that merges a Vision Transformer (ViT) with EfficientNet for classifying visual documents. The EfficientNet component captures detailed local features, while the ViT component focuses on broader contextual information. These complementary insights are unified through a feature fusion mechanism, resulting in improved classification accuracy. Tested on a dataset of archival materials, the HybridViT model reached an overall accuracy of 98.2%, surpassing standard CNN (92.3%) and standalone ViT (94.1%) models. Additionally, both precision and recall saw gains of around 3–5%, and the model demonstrated enhanced resilience to noise and distortions. A prototype information system was also created to incorporate the classification engine into a user-friendly interface backed by a structured database. These findings highlight the promise of hybrid transformer - CNN models in pushing forward the automation of document classification in digital repositories and enhancing access to extensive archival datasets. Unlike earlier YOLO-based methods that concentrated on natural imagery or artificial document datasets, this research specifically addresses manually scanned archival documents. It conducts a structured comparison of YOLOv4, YOLOv5, and YOLOv8 using a unified training setup, evaluating both detection metrics and deployment-relevant factors on real archival scan data.

Item Type: Article
Uncontrolled Keywords: Transformers, EfficientNET, multi-headed neural network, optical character reader, complex document classification
Subjects: T Technology > T Technology (General)
Divisions: Faculty of Engineering and Technology (FET)
Depositing User: Ms Suzilawati Abu Samah
Date Deposited: 07 Jan 2026 06:26
Last Modified: 06 Apr 2026 03:53
URII: http://shdl.mmu.edu.my/id/eprint/15179

Downloads

Downloads per month over past year

View ItemEdit (login required)