Big Data Analytics for Diabetes Prediction on Apache Spark

Document Type : Original Article

Authors

1 Department of Computer Science and Engineering Faculty of Electronic Engineering menofia-Egypt

2 Department of Computer Science and Engineering Faculty of Electronic Engineering, menouf, Egypt

Abstract

Dangerous diseases like diabetes, in which blood glucose levels are too high, some machine learning models have been used to classify or predict the patient state. Currently, the collected dataset size increases dramatically. Therefore, big data analytics technology is an essential factor in building an efficient healthcare system that can fit for the future. This paper discusses the effect of using big data analytics with different dataset sizes by usinga  different number of processing cores over apache spark. The system has been evaluated using several performance evaluation metrics like accuracy, recall, precision, time, etc. A comparative study made among various algorithms such as Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT), and Random Forest (RF). The experimental result shows that the most accurate models were when using RF, and SVM, and the minimum time model was when using NB algorithm.

Keywords


[1]   D. Sisodia and D. S. Sisodia, “Prediction of Diabetes using Classification Algorithms,” Procedia Comput. Sci., vol. 132, no. Iccids, pp. 1578–1585, 2018.
[2]   T. A. S. Foundation, “Apache Spark Overview,” pp. 1–60, 2016.
[3]   “http://hadoop.apache.org/.last accessed: 18/8/2019”.
[4]   J. Zakir, “Issues in Information Systems,” vol. 16, no. Ii, pp. 81–90, 2015.
[5]   R. L. Leitheiser, “Data Quality in Health Care Data Warehouse Environments,” vol. 00, no. c, pp. 1–10, 2001.
[6]   P. Gulia, “Big Data Analytics,” vol. 4, no. 2, pp. 1–4, 2016.
[7]   N. Elgendy and A. Elragal, “Advances in Data Mining. Applications and Theoretical Aspects,” vol. 7987, no. August, 2013.
[8]   “Adams, M.N.: Perspectives on Data Mining. International Journal of Market Research 52(1), 11–19 (2010).”
[9]   J. J. (Jon. H. Park, H. C. Chao, H. Arabnia, and N. Y. Yen, “Advanced multimedia and ubiquitous engineering: Future information technology volume 2,” Lect. Notes Electr. Eng., vol. 354, pp. 9–16, 2016.
[10] A. Iyer, J. S, and R. Sumbaly, “Diagnosis of Diabetes Using Classification Mining Techniques,” Int. J. Data Min. Knowl. Manag. Process, vol. 5, no. 1, pp. 01–14, 2015.
[11] Q. Dai, C. Zhang, H. Wu, and S. Vocational, “Research of Decision Tree Classification Algorithm in Data Mining,” vol. 9, no. 5, pp. 1–8, 2016.
[12] M. Nabi, A. Wahid, and P. Kumar, “Performance Analysis of Classification Algorithms in Predicting Diabetes,” Int. J. Adv. Res. Comput. Sci., vol. 8, no. 3, pp. 456–461, 2017.
[13] R. Alsrraj, “random forest,” no. May, 2019.
[14] D. Sisodia, “ISVM for Face Recognition,” 2010.
[15] I. Rish, “An Empirical Study of the Naïve Bayes Classifier An empirical study of the naive Bayes classifier,” no. January 2001, 2014.
[16] N. M. Saravana Kumar, T. Eswari, P. Sampath, and S. Lavanya, “Predictive methodology for diabetic data analysis in big data,” Procedia Comput. Sci., vol. 50, pp. 203–208, 2015.
[17] P. S. Kumar and S. Pranavi, “Performance analysis of machine learning algorithms on diabetes dataset using big data analytics,” 2017 Int. Conf. Infocom Technol. Unmanned Syst. Trends Futur. Dir. ICTUS 2017, vol. 2018-Janua, no. Iddm, pp. 508–513, 2018.
[18] S. Perveen, M. Shahbaz, A. Guergachi, and K. Keshavjee, “Performance Analysis of Data Mining Classification Techniques to Predict Diabetes,” Procedia Comput. Sci., vol. 82, pp. 115–121, 2016.
[19] K. M. Orabi, Y. M. Kamal, and T. M. Rabah, Early predictive system for diabetes mellitus disease, vol. 9728. 2016.
[20] T. A. Rashid and S. Abdullah, “An Intelligent Approach for Diabetes Classification , Prediction and Description An Intelligent Approach for Diabetes Classification , Prediction and Description,” no. January 2016, 2015.
[21] D. M. Farid, M. A. Al-Mamun, B. Manderick, and A. Nowe, “An adaptive rule-based classifier for mining big biological data,” Expert Syst. Appl., vol. 64, pp. 305–316, 2016.
[22] U. Ali Zia and N. Khan, “Predicting Diabetes in Medical Datasets Using Machine Learning Techniques,” Int. J. Sci. Eng. Res., vol. 8, no. 5, pp. 1538–1551, 2017.
[23] “https://www.python.org/downloads/release/python-366/..last accessed: 18/8/2019” .
[24] “http://releases.ubuntu.com/16.04/..last accessed: 18/8/2019” .
[25] “https://www.kaggle.com/uciml/pima-indians-diabetes-database. .last accessed: 18/8/2019” .
Volume 28, ICEEM2019-Special Issue
ICEEM2019-Special Issue: 1st International Conference on Electronic Eng., Faculty of Electronic Eng., Menouf, Egypt, 7-8 Dec.
2019
Pages 355-360