Leveraging LLMs for optimised feature selection and embedding in structured data: A case study on graduate employment classification

Citation

Haque, Radiah and Goh, Hui-Ngo and Ting, Choo-Yee and Quek, Albert and Hasan, M.D. Rakibul (2025) Leveraging LLMs for optimised feature selection and embedding in structured data: A case study on graduate employment classification. Computers and Education: Artificial Intelligence, 8. p. 100356. ISSN 2666920X

[img] Text
1-s2.0-S2666920X24001590-main.pdf - Published Version
Restricted to Repository staff only

Download (1MB)

Abstract

The application of Machine Learning (ML) for predicting graduate student employability is a growing area of research, driven by the need to align educational outcomes with job market requirements. In this context, this paper investigates the application of Large Language Models (LLMs) for tabular data transformation and embedding, specifically using Bidirectional Encoder Representations from Transformers (BERT), to enhance the performance of ML models in binary classification tasks for student employability prediction. The primary objective is to determine whether converting structured data into text format improves model accuracy. The study involves several ML models including Artificial Neural Networks (ANN), CatBoost, and BERT classifier. The focus is on predicting the employment status of graduate students based on demographic, academic, and graduate tracer study data, collected from over 4000 university graduates. Feature selection methods, including Boruta and Extra Tree Classifier (ETC) are employed to identify the optimal feature set, guided by a sliding window algorithm for automatic feature selection. The models are trained in four stages: 1) original dataset without feature selection or word embedding, 2) dataset with selected optimal features, 3) transformed data with word embedding, and 4) transformed data with feature selection applied both before and after word embedding. The baseline model (without feature selection and embedding) achieved the highest accuracy with the ANN model (79%). Subsequently, applying ETC for feature selection improved accuracy, with CatBoost achieving 83%. Further transformation with BERT-based embeddings raised the highest accuracy to 85% using the BERT classifier. Finally, the optimal accuracy of 88% was obtained by applying feature selection before and after embedding, with the BERT-Boruta model. The findings from this study demonstrate that using the dual-stage feature selection approach in combination with BERT embedding significantly increases the classification accuracy. This highlights the potential of LLMs in transforming tabular data for enhanced graduate employment prediction

Item Type: Article
Uncontrolled Keywords: Machine learning, Student employability prediction, Feature selection, Large language models, BERT, Tabular data
Subjects: Q Science > QA Mathematics > QA71-90 Instruments and machines
Divisions: Faculty of Computing and Informatics (FCI)
Depositing User: Ms Suzilawati Abu Samah
Date Deposited: 18 Feb 2025 01:35
Last Modified: 18 Feb 2025 01:35
URII: http://shdl.mmu.edu.my/id/eprint/13460

Downloads

Downloads per month over past year

View ItemEdit (login required)