Data wrangling framework on clickstream for enhancing seat sales prediction


Alauddin, Md (2021) Data wrangling framework on clickstream for enhancing seat sales prediction. Masters thesis, Multimedia University.

Full text not available from this repository.
Official URL:


Revenue Management is one of the essential functions in every airline business, and the seat (ticket) is the main product of an airline. The purpose of revenue management is to maximize the revenue of each airline routes based on demand. This demand defined by seat sales depends on the factors such as historical transaction data, seasonality, ticket pricing based on advanced purchase trends, competitors pricing, and customer behaviour. Prediction of passenger seat sales helps to estimate revenue on future flights and allows the airline to generate optimal prices for the corresponding flights. Current prediction models use structured transactional and operational data to predict airline seat sales or passenger demand. As the airlines are undergoing a digital transformation in the past two decades, large volumes of user activity data are becoming available to airline companies. In this study, the efficacy of a third data source, namely, digital clickstream data, in providing improved airline seat sales prediction is presented. The digital data has been thus far ignored in most research works due to the lack of proper extraction and processing pipeline of this massive volume of available but unstructured data. This study developed a suitable ETL framework for data wrangling and identified 191 features from transactional, operational and digital data to create the analytical dataset. The wrapper-based Boruta algorithm was chosen through experimentation that selects 22 features as input to the prediction models (10, 10 and 2 features from transactional, digital and operational data sources, respectively). Ten models, namely, 1) Linear regression, 2) Support Vector Machine (SVM), 3) Generalized Linear Model (GLM), 4) CART, 5) GBRT, 6) Random Forest (RF), 7) Histogram Gradient Boosting Regressor, 8) Extreme Gradient Boosting Regressor (XGBRegressor), 9) Light GBM Regressor (LGBMRegressor), and 10) Category Boosting Regressor (CatBoostRegressor) have been studied and experimented on the analytical dataset. With hyperparameter tuning, the CatBoostRegressor, LGBMRegressor, and XGBRegressor tree-based models were found to be most effective in predicting airline sales 30 and 60 days prior to departure, with 91-94% accuracy. The contribution of including digital data sources can be observed as a 2-6% improvement of MAPE compared to that without digital data.

Item Type: Thesis (Masters)
Additional Information: Call No.: QA76.9.Q36 M33 2021
Uncontrolled Keywords: Predictive analytics
Subjects: Q Science > QA Mathematics > QA71-90 Instruments and machines > QA75.5-76.95 Electronic computers. Computer science > QA76.75-76.765 Computer software
Divisions: Faculty of Computing and Informatics (FCI)
Depositing User: Ms Nurul Iqtiani Ahmad
Date Deposited: 16 Jan 2023 06:17
Last Modified: 17 Apr 2023 07:03


Downloads per month over past year

View ItemEdit (login required)