Machine Learning Model for Cancer Diagnosis based on RNAseq Microarray

Document Type : Original Article

Authors

1 dept. computer science and engineeing Faculty of Eleronic Engineering, Menoufia University Menoufia, Menouf

2 Communications and Computers Engineering Department, Faculty of Engineering, Delta University for Science and Tecnology, Gamasa, Egypt.

Abstract

Microarray technology is one of the most important recent breakthroughs in experimental molecular biology. This novel technology for thousands of genes concurrently allows the supervising of expression levels in cells and has been increasingly used in cancer research to understand more of the molecular variations among tumors so that a more reliable classification becomes attainable. Machine learning techniques are loosely used to create substantial and precise classification models. In this paper, a function called Feature Reduction Classification Optimization (FeRCO) is proposed. FeRCO function uses machine learning techniques applied upon RNAseq microarray data for predicting whether the patient is diseased or not. The main purpose of FeRCO function is to define the minimum number of features using the most fitting reduction technique along with classification technique that give the highest classification accuracy. These techniques include Support Vector Machine (SVM) both linear and kernel, Decision Trees (DT), Random Forest (RF), K-Nearest Neighbours (KNN) and Naïve Bayes (NB). Principle Component Analysis (PCA) both linear and kernel, Linear Discriminant Analysis (LDA) and Factor Analysis (FA) along with different machine learning techniques were used to find a lower-dimensional subspace with better discriminatory features for better classification. The major outcomes of this research can be considered as a roadmap for interesting researchers in this field to be able to choose the most suitable machine learning algorithm whatever classification or reduction. The results show that FA and LPCA are the best reduction techniques to be used with the three datasets providing an accuracy up to 100% with TCGA and simulation datasets and accuracy up to 97.86% with WDBC datasets. LSVM is the best classification technique to be used with Linear PCA (LPCA), FA and LDA. RF is the best classification technique to be used with Kernel PCA (KPCA).

Keywords


[1] R. Siegel, C. DeSantis, K. Virgo, et al., “Cancer Treatment and Survivorship Statistics”, CA: A Cancer Journal for Clinicians, Vol. 62, No. 4, pp. 220-41, Jul-Aug 2012. [2] “Cancer Statistics". National Cancer Institute. Retrieved 2016-11-17. [3] R. Duda, P. Hart and D. Stork, “Pattern Classification, 2nd Edition”, Wiley, 2012. [4] C. Burges., “Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, chapter Geometric Methods for Feature Selection and Dimensional Reduction: A Guided Tour.”, Kluwer Academic Publishers, 2005. [5] W. Müller, T. Nocke and H. Schumann, “Enhancing the visualization process with principal component analysis to support the exploration of trends”, Proceedings of the 2006 Asia-Pacific Symposium on Information Visualisation, Australian Computer Society, Vol. 60, pp. 121–130, Inc., 2006. [6] A. Hyvärinen, J. Karhunen and E. Oja, “Independent Component Analysis”, Wiley-Interscience Publication, Vol. 46, 2004. [7] T. Hastie and R. Tibshirani, “Discriminant analysis by Gaussian mixtures”, Journal of the Royal Statistical Society. Series B(Methodological), Vol 58, No. 1, pp. 155–176, 1996. [9] I. Jolliffe and J. Cadima, “Principal component analysis: a review and recent developments”, Philosophical Transactions A Mathematical, Physical And Engineering Sciences, Vol. 374, Issue 2065, pp. 202, 2016. [9] C. Spearman, “General intelligence objectively determined and measured”, American Journal of Psychology, Vol. 15, No. 2, pp. 206–221, 1904. [10] M. Fajila and N. Fasmie, “CWIG: Consecutive Wrappers for Informative Gene Selection from Microarray Analysis in Cancer Detection and Classification”, Current Genomics, 2019. [11] R. Singh and M. Sivabalakrishnan, “Feature Selection of Gene Expression Data for Cancer Classification: A Review”, Procedia Computer Science, Vol. 50, pp. 52 – 57, 2015. [12] K. Kourou, T. Exarchos, K. Exarchos, M. Karamouzis and D. Fotiadis, “Machine learning applications in cancer prognosis and prediction”, Computational and Structural Biotechnology Journal Vol. 13, pp. 8–17, 2015. [13] P. Hall, J. Dean, I. Kabul, J. Silva, “An Overview of Machine Learning with SAS® Enterprise Miner™”, SAS Institute Inc., 2014. [14] A. Dey, “Machine Learning Algorithms: A Review”, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 7, pp. 1174-1179, 2016. [15] K. Rajput and B. Oza, “A Comparative Study of Classification Techniques in Data Mining”. International Journal of Creative Research Thoughts (IJCRT), Vol. 5, Issue 3, pp. 154-163, 2017. [16] A. Kadhim, “Survey on supervised machine learning techniques for automatic text classification”, Artificial Intelligence Review, No. 1, pp. 273–292, 2019. [17] K. Chomboon, P. Chujai, P. Teerarassamee, K. Kerdprasop, N. Kerdprasop, “An Empirical Study of Distance Metrics for k-Nearest Neighbor Algorithm”, Proceedings of the 3rd International Conference on Industrial Application Engineering, pp. 280-285, 2015. [18] S. Vani.M, S. Uma, Sherin.A and Saranya.K, “Survey on Classification Techniques Used in Data Mining and their Recent Advancements”, International Journal of Science, Engineering and Technology Research, Vol. 3, Issue 9, pp. 2380-2385, September 2014. [19] P. Kaviani1, S. Dhotre, “Short Survey on Naive Bayes Algorithm”, International Journal of Advance Engineering and Research Development (IJAERD), Vol. 4, pp. 607-611, 2017. [20] E. Zimányi, R. Kutsche, “Business Intelligence: 4th European Summer School, eBISS 2014, Berlin, Germany, July 6-11, 2014, Tutorial Lectures (Lecture Notes in Business Information Processing)”, Springer, Vol. 205, 2015. [21] L. Breiman, “Random forests”, Machine Learning, Vol. 45, Issue 1, pp. 5–32, 2001. [22] M. Rani and D. Devaraj, “Two-Stage Hybrid Gene Selection Using Mutual Information and Genetic Algorithm for Cancer Data Classification”, Journal of Medical Systems, Vol. 43: 235, Issue 8, pp. 1-11, 2019. [23] B. Sahu1, S. Mohanty and S. Rout, “A Hybrid Approach for Breast Cancer Classificationand Diagnosis”, EAI Endorsed Transactions on Scalable Information Systems, Vol. 6, Issue 20, 2019. [24] M. Sadhana, A. Sankareswari, M.C.A. and M. Phil., “A PROPORTIONAL LEARNING OF CLASSIFIERS USING BREAST CANCER DATASETS”, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.11, pp. 223-232, November 2014. [25] H. Xiea, J. Lia, Q. Zhanga and Y. Wanga, “Comparison among dimensionality reduction techniques based on Random Projection for cancer classification”, Computational biology and chemistry, Vol. 65, pp. 165-172, 2016. [26] H. Salem, G. Attiya and N. El-Fishawy, “Intelligent Decision Support System for Breast Cancer Diagnosis by Gene Expression Profiles”, NATIONAL RADIO SCIENCE CONFERENCE (NRSC) Arab Academy for Science, Technology & Maritime Transport, pp. 421-430, Feb 22‐25, 2016. [27] http://www.cbioportal.org/study?id=brca_tcga_pub2015#summary [28] https://archive.ics.uci.edu/ml/machine-learning databases/breast cancer-wisconsin/ [29]https://data.mendeley.com/datasets/v3cc2p38hb/1/files/892a740d-ed29-44a8-b591-8284e68fb9f6 [30] W. Wang and Y. Lu, “Analysis of the Mean Absolute Error (MAE) and the Root Mean Square Error (RMSE) in Assessing Rounding Model”, IOP Conference Series: Materials Science and Engineering, Vol. 324, pp. 012049-012058, 2018