Skewness based Outlier Elimination Algorithm for Pima Diabetes Dataset

Journal: GRENZE International Journal of Engineering and Technology
Authors: Pushkar Joglekar, Tejaswini Katale, Surabhi Deshpande, Saurabh Kelgandre, Aishwarya Katale
Volume: 10 Issue: 1
Grenze ID: 01.GIJET.10.1.158 Pages: 2529-2535

Abstract

An Outlier is a data point which appears to be distinct from the majority of the other data points within a dataset. Various factors, such as measurement errors, data entry problems, natural variability, and extreme events, can lead to outliers. They may lead to erroneous evaluation measures, biased data analysis, and incorrect model performance resulting in in-suitable outcomes. Hence, identification and elimination of outliers is a crucial pre-processing step. This research paper suggests an algorithm for correctly removing outliers from the Pima Diabetes Dataset. It takes into consideration the correlation values of the features with the outcome column, further arranging them in descending order. The normal distribution graphs for each feature in the dataset are plotted for data analysis. For the Diabetes dataset, the features are either Positive (Right) skewed or Symmetrical. For the Symmetrical curve features, it eliminates data points which are lying beyond three standard deviations i.e (μ - 3σ) and (μ + 3σ), whereas for Positively skewed features, it eliminates those data points which are greater than (μ + 2.5σ). Starting from the feature having the highest correlation value, based on the skewness it eliminates the outliers by iterating the features one by one based on the order of correlation values. The proposed algorithm is successful in elimination of 108 rows from Pima Diabetes Dataset. To check the efficacy of the algorithm, it is compared with four techniques - Local Outlier Factor, Mahalanobis Distance, Multivariate Normal Distribution (N Dimensional) and DBSCAN. Also, for better evaluation the number of outliers eliminated by each technique lies in the same range. Furthermore, after elimination of outliers, three Machine Learning are implemented - Logistic Regression, Random Forest and Decision Tree. Among all the techniques, logistic regression on the proposed algorithm gives the highest accuracy of 84%, highest precision of 0.88 and highest recall of 0.65

Download Now << BACK

GIJET