Predicting software reuse using machine learning techniques—A case study on open-source Java software systems

Citation

Konys, Agnieszka and Yeow, Matthew Yit Hang and Chong, Chun Yong and Lim, Mei Kuan and Yuen, Yee Yen (2025) Predicting software reuse using machine learning techniques—A case study on open-source Java software systems. PLOS ONE, 20 (2). e0314512. ISSN 1932-6203

[img] Text
Predicting software reuse using machine learning techniques—A case study on open-source Java software systems.pdf - Published Version
Restricted to Repository staff only

Download (1MB)

Abstract

Software reuse is an essential practice to increase efficiency and reduce costs in software production. Software reuse practices range from reusing artifacts, libraries, components, packages, and APIs. Identifying suitable software for reuse requires pinpointing potential candidates. However, there are no objective methods in place to measure software reuse. This makes it challenging to identify highly reusable software. Software reuse research mainly addresses two hurdles: 1) identifying reusable candidates effectively and efficiently, and 2) selecting high-quality software components that improve maintainability and extensibility. This paper proposes automating software reuse prediction by leveraging machine learning (ML) algorithms, enabling future research and practitioners to better identify highly reusable software. Our approach uses cross-project code clone detection to establish the ground truth for software reuse, identifying code clones across popular GitHub projects as indicators of potential reuse candidates. Software metrics were extracted from Maven artifacts and used to train classification and regression models to predict and estimate software reuse. The average F1-score of the ML classification models is 77.19%. The best-performing model, Ridge Regression, achieved an F1-score of 79.17%. Additionally, this research aims to assist developers by identifying key metrics that significantly impact software reuse. Our findings suggest that the file-level PUA (Public Undocumented API) metric is the most important factor influencing software reuse. We also present suitable value ranges for the top five important metrics that developers can follow to create highly reusable software. Furthermore, we developed a tool that utilizes the trained models to predict the reuse potential of existing GitHub projects and rank Maven artifacts by their domain.

Item Type: Article
Uncontrolled Keywords: Machine learning
Subjects: Q Science > Q Science (General) > Q300-390 Cybernetics
Divisions: Faculty of Business (FOB)
Depositing User: Ms Rosnani Abd Wahab
Date Deposited: 05 Mar 2025 23:56
Last Modified: 05 Mar 2025 23:56
URII: http://shdl.mmu.edu.my/id/eprint/13565

Downloads

Downloads per month over past year

View ItemEdit (login required)