Text to image synthesis using generative adversarial network


Tan, Yong Xuan (2022) Text to image synthesis using generative adversarial network. Masters thesis, Multimedia University.

Full text not available from this repository.
Official URL: http://erep.mmu.edu.my/


The text-to-image synthesis will synthesise images based on the given text description; the content of synthesised images is matched with the text description. The existing text-to-image synthesis approaches mainly develop based on Generative Adversarial Networks (GANs) for optimal performance. However, text-to-image synthesis remains challenging in synthesising realistic and semantically consistent images. In this thesis, text-to-image synthesis frameworks are proposed to synthesise highly realistic images that are conditioned on the text description. This research proposes GANs-based text-to-image synthesis frameworks to utilise advanced model architecture in synthesising large-scale realistic images. Moreover, self-supervised learning is the first to be investigated into text-toimage synthesis to explore high-level structural information in synthesising complex objects. In this work, three novel text-to-image synthesise frameworks are designed which are referred to as: (1) Self-supervised Residual Generative Adversarial Network (SResGAN), (2) Multi-scale Self-supervised Residual Generative Adversarial Network (MSResGAN), and (3) Multi-scale Refined Self-supervised Residual Generative Adversarial Network (MRSResGAN). SResGAN investigates advanced model architecture with the residual network to produce large scale and realistic images. For better visual realism, MSResGAN and MRSResGAN employed multi-scale GAN architecture to synthesise images from a lower scale to a larger scale. All designed frameworks manage to synthesise highly realistic images based on the received text description. Besides that, the proposed frameworks are integrated with self-supervised learning via a rotation task to eliminate low-data regime and diversify the model learned representation. In doing so, the models are able to maximise the high-level structural information throughout the network and synthesise more diverse image content. Furthermore, the proposed frameworks are integrated with feature matching, L1 distance loss, and one-sided label smoothing to stabilise the model training. All three frameworks are evaluated on two benchmark text-to-image synthesis datasets, namely Oxford-102 and CUB-200-2011. The performance of the frameworks is measured by three evaluation metrics including Inception Score, Fréchet Inception Distance, and Structural Similarity Index. Based on the experiment results, three frameworks are managed to outperform several existing text-to-image synthesis approaches on all two benchmark text-to-image synthesis datasets.

Item Type: Thesis (Masters)
Additional Information: Call No.: Q325.5 .T36 2022
Uncontrolled Keywords: Machine learning
Subjects: Q Science > Q Science (General) > Q300-390 Cybernetics
Divisions: Faculty of Information Science and Technology (FIST)
Depositing User: Ms Nurul Iqtiani Ahmad
Date Deposited: 18 Jul 2023 05:23
Last Modified: 18 Jul 2023 05:23
URII: http://shdl.mmu.edu.my/id/eprint/11543


Downloads per month over past year

View ItemEdit (login required)