DNA Sequences Classification with Deep Learning: A Survey

Document Type : Original Article

Authors

1 Department of Electronics and Electrical Communications Engineering, Faculty of Electronic Engineering, Menoufia University, Menouf

2 Faculty of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia

3 Department of Computer Science and Engineering, Faculty of Electronic Engineering, Menoufia University, Menouf

Abstract

Deep learning (DL) methods have been
achieving amazing results in solving a variety of
problems in many different fields especially in the area
of big data. With the advances of the big data era in
bioinformatics, applying DL techniques, the DNA
sequences can be classified with accurate and scalable
prediction. The strength of DL methods come from the
development of software and hardware, such as
processing abilities graphical processing units (GPU) for
the hardware and new learning or inference algorithms
for the software, which reducing the main primary
difficulties that faced the training process. In This work,
we start from the previous classification methods such as
alignment methods pointing out the problems, which are
face to use these methods.After that, we demonstrate
deep learning, from artificial neural networks to hyper
parameter tuning, and the most recent state-of-the-art
DL architectures used in DNA classification. After that,
the paper ended with limitations and suggestions.

[1] Anastassiou, D. “Genomic Signal Processing,” IEEE Signal Process. Mag., pp: 8–20, Jun. 2001. DOI:10.1109/79.939833.
[2] Hartwell, L.; Hood, L.; Goldberg, M. L.; Reynolds, A.; Silver, L. M.; Veres, R. C. “Genetics: From Genes to Genomes,” 2nd ed.; McGraw-Hill: New York, 2003.
[3] Aerssens, J.; Armstrong, M.; Gilissen, R.; Cohen, N. “The Human Genome: An Introduction,” Oncologist, pp: 100–109, June 2001.
[4] Xiong, J., “Essential bioinformatics”, Cambridge University Press: pp: 318-362, 2006.
[5] Attila Kertész-Farkas “Protein Classification in a Machine Learning Framework” Ph.D. thesis, Research Group on Artificial Intelligence, University of Szeged, August 2008.
[6] URL: http://en.wikipedia.org/wiki/homology_(biology).(Acess date 11 July 2018).
[7] Riccardo Rizzo, Antonino Fiannaca, Massimo La Rosa, Alfonso Urso, “Classification Experiments of DNA Sequences by Using a Deep Neural Network and Chaos Game Representation,” International Conference on Computer Systems and Technologies - CompSysTech’, Palermo, Italy, pp: 222-228, 23-24 June 2016.
[8] Susana Vinga, Jonas Almeida, “Alignment-FreeSequenceComparison-Areview,” Bioinformatics, Vol: 19, Issue: 4, pp: 513–523, 1 Mar 2003.
[9] Genta Aoki Yasubumi Sakakibara, “Convolutional Neural Networks for Classification of Alignments of Non-coding RNA Sequences,” Bioinformatics, Volume 34, Issue 13, pp: i237–i244, 1 July 2018.
[10] Christof Angermueller1, Tanel Pärnamaa, Leopold Parts & Oliver Stegle1, “Deep Learning for Computational Biology” Molecular Systems Biology, Jul 29, 2016.
[11] Seonwoo, M., Byunghan, L., Sungroh, Y.: “Deep learning in bioinformatics,” In: Briefings in Bioinformatics, 2016.
[12] Giosu´e Lo Bosco and Mattia Antonino Di Gangi, “Deep Learning Architectures for DNA Sequence Classification,” Fuzzy Logic and Soft Computing Applications, 11th International Workshop, Naples, Italy, pp. 162–171, 07 March 2017.
[13] GurjitS.Randhawa1, KathleenA.Hill andLilaKari “ML-DSP:Machine Learning with Digital Signal Processing for Ultrafast, Accurate, and Scalable Genome Classification at all Taxonomic Levels”Randhawaetal. BMCGenomics, 2019. Doi.org/10.1186/s12864-019-5571-y.
[14] Saptarshi Sengupta, Sanchita Basak, Pallabi Saikia, Sayak Paul, Vasilios Tsalavoutis, Frederick Ditliac Atiah,Vadlamani Ravi and Richard Alan Peter “A Review of Deep Learning with Special Emphasison Architectures, Applications and Recent Trends” IEEE Transactions, Mar. 2019.
[15] Andrzej Zielezinski, Susana Vinga, Jonas Almeida and Wojciech M. Karlowski,” Alignment-free sequence comparison: benefits, applications, and tools” Zielezinski et al. Genome Biology (2017) 18:186. DOI 10.1186/s13059-017-1319-7.
[16] Jie Ren, Xin Bai, Yang Young Lu, Kujin Tang, “Alignment-Free Sequence Analysis and Applications” Annual Review of Biomedical Data Science, pp:13:23, 16 April 2018.
[17] Aoki Sakakibara, G. Y. “Convolutional Neural Networks for Classification of Alignments of Non-Coding RNA Sequences” Bioinformatics2018, 34, i237–i244.DOI:10.1093/bioinformatics/bty228.
[18] Samia M. Abd –Alhalem, Naglaa F. Solimanb, Salah Eldin S. E. Abd Elrahman, Nabil A. Ismail, El-Sayed M. El-Rabaie, and
Fathi E. Abd El-Samie “Bacterial classification with convolutional neural networks based on different data reduction layers” Nucleosides, Nucleotides and Nucleic Acids, 16 Aug 2019. Doi.org/10.1080/15257770.2019.1645851.
[19] Angermueller C, Lee H, Reik W, Stegle, “Accurate Prediction of Single Cell DNA Methylation States using Deep Learning,” Genome Biology, 2016.
[20] Stephen F. Altschul et al. “Basic local alignment search tool”. In: Journal of Molecular Biology, pp. 403:410, Mar 1990.
[21] DJ Lipman and WR Pearson. “Rapid and Sensitive Protein Similarity Searches,” In: Science, pp. 1435-1441, 227.4693 (1985).
[22] Julie D. Thompson, Desmond G. Higgins, and Toby J. Gibson. “CLUSTALW: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice”. In: Nucleic Acids Research, pp. 4673-4680, 22.22 (1994).
[23] Robert C Edgar. “MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput” In: Nucleic Acids Research, pp. 1792-1797, 32.5 (2004).
[24] C.-K. K. Chan, A. L. Hsu, S.-L. Tang, and S. K. Halgamuge, “Using Growing Selforganising Maps to Improve the Binning Process in Environmental Whole-Genomeshotgun Sequencing,” BioMed Research International, vol. 2008, 2007.
[25] Miller, RT; Christoffels, AG; Gopalakrishnan, C; Burke, J; Ptitsyn, AA; Broveak, TR; Hide, WA. "A comprehensive Approach to Clustering of Expressed Human Gene Sequence: the Sequence Tag Alignment and Consensus Knowledge Base". Genome Research, 9 (11): 1143–55, 1999. Doi:10.1101/gr.9.11.1143. PMC 310831.
[26] Domazet-Lošo, M; Haubold, B “Alignment-Free Detection of Local Similarity Among Viral and Bacterial Genomes” Bioinformatics, 27 (11): 1466–72, 2011. Doi:10.1093/bioinformatics/btr176. PMID 21471011.
[27] Chan, CX; Ragan, MA. “Next-Generation Phylogenomics” Biology Direct. 8: 3, Jan 22, 2013. Doi:10.1186/1745-6150-8-3. PMC 3564786. PMID 23339707.
[28] Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B. “Genomic Signature: Characterization and Classification of Species Assessed by Chaos Game Representation of Sequences” Mol Biol Evol. 1999;16:1391–9.
[29] Chenglong Yu et al. “Real Time Classification of Viruses in 12 Dimensions". In: PLoS One 8.5 (2013).
[30] Daniel Struck et al. “COMET: Adaptive Context-Based Modeling for Ultrafast HIV-1 Subtype Identification” In: Nucleic Acids Research 42.18, 2014.
[31] Mohamed Amine Remita et al. “A machine learning Approach for Viral Genome Classification” In: BMC Bioinformatics 18.208 (2017).
[32] G. E. Sims and S.-H. Kim, “Whole-Genome Phylogeny of Escherichia Coli/Shigella Group by Feature Frequency Profiles (FFps),” Proceedings of the National Academy of Sciences, 2011.
[33] Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B. “Genomic Signature: Characterization and Classification of Species Assessed by Chaos Game Representation of Sequences,” Mol Biol Evol.;16:1391–9, 1999 .
[34] Seonwoo Min, Byunghan Lee, and Sungroh Yoon, “Deep Learning in Bioinformatics” BriefBioinform., pp:851-869, Sep 2017.doi: 10.1093/bib/bbw068.
[35] Ngoc Giang Nguyen, Vu Anh Tran1, Duc Luu Ngo, Dau Phan1, Favorisen Rosyking Lumbanraja, Mohammad Reza Faisal, Bahriddin Abapihi, Mamoru Kubo, Kenji Satou “DNASequence Classification by Convolutional Neural Network” J. Biomedical Science and Engineering, pp: 280-286,April 2016.
[36] Mitchell, T. M. “Machine learning,” Burr Ridge, IL: McGraw Hill,45(37):870–877, 1997.
[37] Kim, P. “MATLAB Deep Learning: With Machine Learning, Neural Networks and Artificial Intelligence,” Springer, Seoul, Soul-t’ukpyolsi, Korea (Republic of). DOI: 10.1007/978-1-48422845-6.
[38] Kotsiantis, S. B. “Supervised machine learning, A review of Classification Techniques,” Proceedings of the 2007 Conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies, pp: 3–24. IOS Press, 2007.
[39] James Martens. “Deep learning via hessian-free optimization,” In ICML, volume 27, pp: 735– 742, 2010.
[40] Douglas M Hawkins. “The problem of Overfitting” Journal of chemical information and computer sciences,” 44(1):1–12, 2004.
[41] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch, in: NIPS-W, 2017.
[42] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kud-lur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wat-tenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale machinelearning on heterogeneous systems, software available from tensorflow.org(2015). URL http://tensorflow.org/.
[43] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, ACM, New York, NY, USA, 2014, pp.675–678. doi:10.1145/2647868.2654889.
[44] S. Tokui, K. Oono, S. Hido, J. Clayton, Chainer: a next-generation open source framework for deep learning, in: Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015.
[45] F. Chollet, et al., Keras, https://github.com/fchollet/keras (2015).
[46] J. Dai, Y. Wang, X. Qiu, D. Ding, Y. Zhang, Y. Wang, X. Jia, C. Zhang,Y. Wan, Z. Li, J. Wang, S. Huang, Z. Wu, Y. Wang, Y. Yang, B. She, D. Shi, Q. Lu, K. Huang, G. Song, Bigdl: A distributed deep learningframework for big data, CoRR abs/1804.05839.
[47] B Yegnanarayana. “Artificial Neural Networks,” PHI Learning Pvt. Ltd., 2009.
[48] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet Classification with Deep Convolutional Neural Networks,” In Advances in neural information processing systems, pp: 1097–1105, 2012.
[49] Yann LeCun et al. “Lenet-5, Convolutional Neural Networks,” URL: http://yann.lecun. com/exdb/lenet, 2015.
[50] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech Recognition with Deep Recurrent Neural Networks,” In Acoustics, speech and signal processing (icassp), IEEE international conference on, pp: 6645–6649, 2013.
[51] Stephen Grossberg. “Recurrent Neural Networks,” Scholarpedia, 8(2):1888, 2013.
[52] Y. Bengio, “Practical recommendations for gradient-based training of deep architectures,” in Neural Networks:Tricks of the Trade. Heidelberg: Springer, 2012, pp. 437-478.
[53] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochasticoptimization,” Journal of Machine Learning Research, vol. 12, pp. 2121-2159, 2011.
[54] D. Kingma and J. Ba, “Adam: a method for stochastic optimization,” 2014 [Online]. Available: https://arxiv.org/pdf/1412.6980.pdf.
[55] https://rdp.cme.msu.edu.(accessed May 11, 2018).
[56] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud ArindraAdiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A.W.M. van der Laak, Bram Van Ginneken, Clara I. Sánchez, “A survey on deep learning in medical image analysis”, Med. Image Anal.,pp: 60–88, 42 (2017) .
[57] Fabio A. Spanhol, Luiz S. Oliveira, Caroline Petitjean, Laurent Heutte, “A dataset for breast cancer histopathological image classification,” IEEE Trans. Biomed. Eng., pp: 1455–1462, 63 (7) (2016).
[58] Devinder Kumar, Alexander Wong, David A Clausi, Lung nodule “Classification using deep features in CT images”, Computer and Robot Vision (CRV), 2015 12th Conference on, IEEE, pp: 133–138, 2015.
[59] Yu Cheng, Duo Wang, Pan Zhou, Tao Zhang, “A survey of model compression and acceleration for deep neural networks”, 2017. arXiv:1710.09282.
[60] Taco Cohen, Max Welling, Group equivariant convolutional networks, International Conference on Machine Learning, 2016, pp. 2990–2999