Software fault prediction of unlabeled program modules




















Experiments reveal that unsupervised software fault prediction can be fully automated and effective results can be produced using X-means clustering with software metrics thresholds. Three datasets, collected from Turkish white-goods manufacturer developing embedded controller software, have been used for the validation.

Index Terms — Clustering, metrics thresholds, software fault prediction, and X-means clustering. Documents: Advanced Search Include Citations. Authors: Advanced Search Include Citations. Catal , U. Sevim , B. Commun ACM 32 12 — IEEE, pp — In: 35th international conference on software engineering ICSE.

In: Advances in neural information processing systems. In: 3rd international symposium on empirical software engineering and measurement, pp — In: IEEE international conference on software maintenance, pp 1— Inf Softw Technol 55 8 — Rathore SS, Kumar S A decision tree logic based recommendation system to select software fault prediction techniques.

Computing 99 3 — Inf Softw Technol 55 10 — Rousseeuw PJ Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math — Seliya N, Khoshgoftaar TM Software quality analysis of unlabeled program modules with semisupervised clustering. Symmetry Empir Softw Eng 14 5 — Wahono RS A systematic literature review of software defect prediction: research trends, datasets, methods and frameworks.

J Softw Eng 1 1 :1— In: Proceedings of the 4th international workshop on predictor models in software engineering, ser. ACM, New York, pp 19— J Comput Sci Technol 34 5 — Yang J, Qian H Defect prediction on unlabeled datasets by using unsupervised clustering. ACM, pp — In: Proceedings 3rd IEEE symposium on application-specific systems and software engineering technology, pp 85— IEEE, pp 16— IEEE Access — Zimmermann T, Nagappan N Predicting defects using network analysis on dependency graphs.

Download references. You can also search for this author in PubMed Google Scholar. Correspondence to Sohail Sarwar. Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Reprints and Permissions. Rizwan, M. EkmEx - an extended framework for labeling an unlabeled fault dataset.

Multimed Tools Appl Download citation. Received : 30 July Revised : 10 August Accepted : 17 August Published : 08 January Anyone you share the following link with will be able to read this content:. Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative. Skip to main content. In this section, we will discuss different measurements for software defect prediction such as true positive TP , true negative TN , false positive FP and false negative FN.

TP denotes the number of defective software instances that are correctly classified as defective, while TN is the number of clean software instances that are correctly classified as clean. FP denotes the number of clean software instances that are wrongly classified as defective, and FN denotes the number of defective software instances that are mistakenly classified as clean.

One of the primary simple metrics to evaluate the performance of predictive models is classification accuracy, also called the correct classification rate. It is utilized to quantify the extent of the effectively classified instances to the aggregate instances.

F-score is a harmonic mean of precision and recall, and many studies in the literature used F-score metrics [56] [57]. G-measure is another measure used in software defect prediction.

It is defined as a harmonic mean of recall and specificity. For the experiments, 10 well-known software defect datasets [62] were selected. The majority of related works used these datasets to evaluate the performance of their SDP techniques and this is the reason behind selecting the above-mentioned dataset for further comparisons.

Table 1 reports the datasets used in the experiments along with the statistics. Boosting and bagging classifiers for all the base classifiers were also considered. The experiments were conducted on a Python environment. It is important to highlight that these metrics were computed using the weighted average. The intuition behind selecting the weighted average was to calculate metrics for each class label and take the label imbalance into the account.

Table 1. Dataset summaries. The performance of classifiers was evaluated based on fold cross-validation to split the datasets into 10 consecutive folds. One of them for testing and the remaining folds for training. Afterwards, features were standardized and scaled using the standard Scaler function in Python, which works by removing the mean and scaling the features into unit variance.

Since the datasets were very imbalanced, the oversampling approach using SMOTE was performed for the training data only, as it has been widely used in the literature to mitigate imbalance issues in training data for SDP. The following Algorithm 1 was used for the experiments. It began by providing a list of datasets and a list of classifiers and then proceeded to iterate over all datasets, as shown in Line 8.

The datasets were split into training and testing data based on fold cross-validation with shuffling of the data before splitting, as shown in Line 9.

One the dataset was split, the perform Standard Scaler function was utilized to standardize and scale the features.

Once the features were standardized, the training data for each fold were re-sampled using the SMOTE technique, as shown in Line The loop in Lines 12 - 25 aimed to train the classifiers, obtain predictions, and compute evaluation metrics. The average metrics were computed in Lines 27 - 31 as the datasets were split using folds. The process from Lines 9 - 31 was iterated throughout all provided datasets. Table 2 summarizes the performance of the different classifiers based on the classification accuracy.

The RF classifier achieved accuracies of 0. Moreover, the reported scores in Table 2 show that the bagging classifier with DS as a base learner performed well on the PC5, KC3, and MC2 datasets as compared to the other classifiers. Furthermore, the maximum accuracy attained for PC1 was 0.

Figure 1. Classification accuracy scores of different classifiers. Table 2. The accuracy scores obtained using different classifiers. Table 3 reports the F-scores attained using different classifiers. In general, it is apparent that the RF classifier was the best performing for six different datasets, as illustrated in Table 2 and Table 3.

Figure 2 illustrates bar plots of the F-scores attained using classifiers for all datasets. Figure 2. F-scores attained by different classifiers.

Table 3. The F-scores obtained using different classifier. For JM1, the highest F-score was 0. Additionally, the lowest score was 0. It is clear there is no dominant classifier and this may due to the nature of datasets.

Similar to other studies [6] [22] [58], our results recommend using ensembles as predictive models to detect software defects.

Additionally, their findings [6] [22] [58] agreed with our outcome that RF performed well. However, the experiments conducted by Hammouri et al. Table 4. Figure 3. In this section, we list some potential threats in our study and responses to construct validity.

One potential threat to validity is the selection of datasets where they might not be representative. In our study, this threat is mitigated by evaluating the performance of the classifiers on ten well-known datasets that are commonly used in the literature review. We have attempted to mitigate this threat by measuring the performance of the base learners, boosting, and bagging classifier on diverse datasets that have different sizes.

Kalai Magal et al. Feature selection was based on correlation computation and aimed to choose the ideal subset of features. The selected features using correlation-based feature selection were then used with RF to predict software defects. The outcome showed clear improvements obtained using the improved RF compared to the traditional RF. Venkata et al. They investigated the impact of attribute reduction on the performance of SDP models and attempted to combine PCA with different classification models which did not show any improvements.

However, the outcomes of the experimental results demonstrated that combining the correlation-based feature selection technique with 1-rule classifier led to improvements in classification accuracy. Various datasets were selected to assess the classifiers efficiency at detecting defective modules.

The conducted experiments demonstrated that the RF classifier outperformed the others. Moreover, Ge et al. The experiment results showed that LC outperformed the other classifiers. Aleem et al. The results showed that bagging and SVM performed well in the majority of datasets. Meanwhile, Wang et al. Perreault et al. The outcomes of the conducted experiments did not show a superior classifier at identifying software defects.

Hussain et al. The experimental results showed that StackingC performed well compared to the other classifiers. Hammouri et al. Three real debugging datasets were used in their study.

Measurements such as accuracy, precision, recall, F-measure, and RMSE were utilized to analyze the results. The results of their study showed that DS performed well. The above-mentioned approaches differ from the proposed approach in this paper in two ways.

Firstly, we compared the performance of different supervised and Ensemble methods on the oversampled training data, while other works such as Kalai Magal et al.



0コメント

  • 1000 / 1000