Towards the Conceptual Retrieval of Multimedia Documentary: A Survey

Document Type : Original Article

Authors

Dept. of Computer Science and Engineering, Faculty of Electronic Engineering, Menoufia University, Egypt.

Abstract

Billions of active online users are continuously feeding the world with multimedia Big Data through their smart phones and PCs. These heterogenous productions are existing in different social media platforms, such as Facebook and Twitter, delivering a composite message in the form of audio, visual and textual signals. Analyzing multimedia Big Data to understand the intended delivered message, had been a challenge to audio, image, video and text processing researchers. Thanks to the recent advances in deep learning algorithms, researchers had been able to improve the performance of multimedia Big Data analytics and understanding techniques This paper presents a survey on how a multimedia file is analyzed, key challenges facing multimedia analysis, and how deep learning is helping conquer and advance beyond those challenges. Future directions of multimedia analysis are also addressed. The aim is to stay objective all through this study, bringing both empowering enhancements and in addition inescapable shortcomings, wishing to bring up fresh questions and stimulating new research frontiers for the reader.

Keywords


width: 0px; "> </[1] S. H Khan, X. He, F. Porikli, M. Bennamoun, F. Sohel, and R. Togneri,
“Learning deep structured network for weakly supervised change detection,”
ArXiv e-prints, Jun. 2016. [2] Ahmed Ghozia, Gamal Attiya and Nawal A.Elfishawy “The power of deep
learning, current research and future trends,” Menoufia Journal of Electronic
Engineering Research (MJEER), VOL.28-NO.2 July 2019. [3] Axel Pinz Christoph Feichtenhofer, “Image and video understanding,”
Facebook AI Research. Facebook, pp. 402–418, 2014.
ize-adjust: auto; -webkit-text-stroke-[4] Tao Mei and Cha Zhang, “Deep learning for intelligent video analysis,”
Proceedings of the 2017 ACM on Multimedia Conference. ACM, pp. 1955–
1956, 2017. [5] Meng Wang, Wei Li, and Xiaogang Wang, “Transferring a generic pedestrian
detector towards specific scenes,” Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on. IEEE, pp. 3274–3281, 2012. [6] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul
Sukthankar, and Li Fei-Fei, “Large-scale video classification with
convolutional neural networks,” Proceedings of the IEEE conference on
Computer Vision and PatternRecognition, pp. 1725–1732, 2014. [7] Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John
R Hershey, Tim K Marks, and Kazuhiko Sumi, “Attention-based multimodal
fusion for video description,” Computer Vision (ICCV), 2017 IEEE
International Conference on. IEEE, pp. 4203–4212, 2017. [8] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach,
Subhashini Venugopalan, Kate Saenko, and Trevor Darrell, “Long-term
recurrent convolutional networks for visual recognition and description,”
Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 2625–2634, 2015. [9] Matthew D Zeiler and Rob Fergus, “Visualizing and understanding
convolutional networks,” European conference on computer vision. Springer,
pp. 818–833, 2014. [10] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama,
K. Saenko, and T. Darrell, “Long-term Recurrent Convolutional Networks for
Visual Recognition and Description,” ArXiv e-prints, Nov. 2014. [11] MoezBaccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and
AtillaBaskurt, “Sequential deep learning for human action recognition,”
International Workshop on Human Behavior Understanding. Springer, pp. 29–
39, 2011. [12] Karen Simonyan and Andrew Zisserman, “Two-stream convolutional networks
for action recognition in videos,” Advances in neural information processing
systems, pp. 568–576, 2014
xt-size-adjust: auto; -webkit-text-str[13] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov, “Unsupervised
learning of video representations using lstms,” International conference on
machine learning, pp. 843–852, 2015. [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David WardeFarley,SherjilOzair, Aaron Courville, and YoshuaBengio, “Generative
adversarial nets,” Advances in neural information processing systems, pp.
2672–2680, 2014. [15] Yoon Kim, Carl Denton, Loung Hoang, and Alexander M Rush, “Neural
machine translation by jointly learning to align and translate,” Proceedings of
ICLR, 2017. [16] Jan K Chorowski, DzmitryBahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and
YoshuaBengio, “Attention-based models for speech recognition,” Advances in
neural information processing systems, pp. 577–585, 2015. [17] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and
spell:A neural network for large vocabulary conversational speech
recognition,” Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE
International Conference on. IEEE, pp. 4960–4964, 2016. [18] DzmitryBahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and
YoshuaBengio, “End-to-end attention-based large vocabulary speech
recognition,” Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE
International Conference on. IEEE, pp. 4945–4949, 2016. [19] DzmitryBahdanau, Dmitriy Serdyuk, Phil´emonBrakel, Nan Rosemary Ke, Jan
Chorowski, Aaron Courville, and YoshuaBengio, “Task loss estimation for
sequence prediction,” arXiv preprint arXiv:1511.06456, 2015. [20] William Chan and Ian Lane, “On online attention-based speech recognition and
joint mandarin character-pinyin training.” INTERSPEECH, pp. 3404–3408,
2016. [21] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv
preprint arXiv:1211.3711, 2012. [22] Yann LeCun, L´eonBottou, YoshuaBengio, and Patrick Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol.
86, no. 11, pp. 2278–2324, 1998.
e-adjust: auto; -webkit-text-stroke-wi[23] Ossama Abdel-Hamid, Li Deng, and Dong Yu, “Exploring convolutional
neural network structures and optimization techniques for speech recognition.”
Interspeech, vol. 2013, pp. 1173–5, 2013. [24] Tara N Sainath, Abdel-rahman Mohamed, Brian Kingsbury, and
BhuvanaRamabhadran, “Deep convolutional neural networks for lvcsr,”
Acoustics, speech and signal processing (ICASSP), 2013 IEEE international
conference on. IEEE, pp. 8614–8618, 2013. [25] William Chan and Ian Lane, “Deep convolutional neural networks for acoustic
modeling in low resource languages,” Acoustics, Speech and Signal Processing
(ICASSP), 2015 IEEE International Conference on. IEEE, pp. 2056–2060,
2015. [26] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed,
Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N
Sainath et al., “Deep neural networks for acoustic modeling in speech
recognition: The shared views of four research groups,” IEEE Signal
processing magazine, vol. 29, no. 6, pp. 82–97, 2012. [27] Tara N Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George
E Dahl, George Saon, Hagen Soltau, Tomas Beran, Aleksandr Y Aravkin, and
BhuvanaRamabhadran, “Improvements to deep convolutional neural networks
for lvcsr,”IEEE Workshop onAutomaticSpeech Recognition and Understanding
(ASRU), 2013.IEEE, pp. 315–320,2013. [28] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks
for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [29] Tom Sercu, Christian Puhrsch, Brian Kingsbury, and Yann LeCun, “Very deep
multilingual convolutional neural networks for lvcsr,” Acoustics, Speech and
Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE,
pp. 4955– 4959, 2016. [30] Tom Sercu and Vaibhava Goel, “Advances in very deep convolutional neural
networks for lvcsr,” arXiv preprint arXiv:1604.01792, 2016. [31] Yanmin Qian, Mengxiao Bi, Tian Tan, and Kai Yu, “Very deep convolutional
neural networks for noise robust speech recognition,” IEEE/ACM Transactions
on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2263–2276,
2016.
text-size-adjust: auto; -webkit-text-s[32] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
Rabinovich, “Going deeper with convolutions,” Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1–9, 2015. [33] Min Lin, Qiang Chen, and Shuicheng Yan, “Network in network,” arXiv
preprint arXiv:1312.4400, 2013. [34] Salah El Hihi and YoshuaBengio, “Hierarchical recurrent neural networks for
longterm dependencies,” Advances in neural information processing systems,
pp. 493–499, 1996. [35] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015. [36] Dario Amodei, Sundaram Ananthanarayanan, RishitaAnubhai, Jingliang Bai,
EricBattenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng,
Guoliang Chen et al., “Deep speech 2: End-to-end speech recognition in
english and mandarin,” International Conference on Machine Learning, pp.
173–182, 2016. [37] Yu Zhang, William Chan, and Navdeep Jaitly, “Very deep convolutional
networks for end-to-end speech recognition,” Acoustics, Speech and Signal
Processing (ICASSP), 2017 IEEE International Conference on. IEEE, pp.
4845–4849, 2017. [38] Sepp Hochreiter and Ju¨rgenSchmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997. [39] Yu Zhang, William Chan, and Navdeep Jaitly, “Very deep convolutional
networks for end-to-end speech recognition,” Acoustics, Speech and Signal
Processing (ICASSP), 2017 IEEE International Conference on. IEEE, pp.
4845–4849, 2017. [40] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
ChengYang Fu, and Alexander C Berg, “Ssd: Single shot multibox detector,”
European conference on computer vision. Springer, pp. 21–37, 2016. [41] Rupesh K Srivastava, Klaus Greff, and Ju¨rgenSchmidhuber, “Training very
deep networks,” Advances in neural information processing systems, pp. 2377–
2385, 2015.