Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

Document Type : Original Article

Authors

1 Computer Science & Eng. Dept., Faculty of Electronic Eng., Menoufia University, Menouf 32952, Egypt

2 Computer Science & Eng. Dept., Faculty of Electronic Eng., Menoufia University, Menouf 32952, Egypt.

Abstract

Hadoop is an open-source framework written by java and used for big
data processing. It consists of two main components: Hadoop
Distributed File System (HDFS) and MapReduce. HDFS is used to
store data while MapReduce is used to distribute and process an
application tasks in a distributed processing form. Recently, several
researchers employ Hadoop for processing big data. The results
indicate that Hadoop performs well with Large Files (files larger than
Data Node block size). Nevertheless, Hadoop performance decreases
with small files that are less than its block size. This is because, small
files consume the memory of both the DataNode and the NameNode,
and increases the execution time of the applications (i.e. decreases
MapReduce performance). In this paper, the problem of the small files
in Hadoop is defined and the existing approaches to solve this problem
are classified and discussed. In addition, some open points that must
be considered when thinking of a better approach to improve the
Hadoop performance when processing the small files.

Keywords


1] Youssef M. ESSA, Gamal ATTIYA and Ayman EL-SAYED, "Mobile Agent
based New Framework for Improving Big Data Analysis", Proceeding of the
2013 IEEE International Conference on Cloud Computing and Big Data
(CloudCom-Asia 2013), Fuzhou, China, December 16-19, 2013.

"> [2] White, Tom, " Hadoop: The definitive guide", O'Reilly Media, Inc., 2012.
[3] Wang, Feng, et al. "Hadoop high availability through metadata replication"
Proceedings of the first international workshop on Cloud data management. ACM,
2009.
[4] Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung, "The Google file
system", ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.
[5] Mall, Nupur N., and Sheetal Rana, "Overview of Big Data and Hadoop",
Imperial Journal of Interdisciplinary Research 2.5, 2016.
[6] Manjunath, R., R. K. Channabasava, and S. Balaji, "A Big Data MapReduce
Hadoop distribution architecture for processing input splits to solve the small
data problem", Applied and Theoretical Computing and Communication
Technology (iCATccT), 2016 2nd International Conference on. IEEE, 2016.
[7] Mir, Mansoor Ahmad, and Jawed Ahmed, "An Optimal Solution for small file
problem in Hadoop", International Journal of Advanced Research in Computer
Science 8.5, 2017.
[8] Zheng, Tong, Weibin Guo, and Guisheng Fan, "A Method to Improve the
Performance for Storing Massive Small Files in Hadoop", 7th International
Conference on Computer Engineering and Networks, 2017.
[9] Qin, Dongxue, "Study on Processing of Massive Small Files Based on
Hadoop", Liaoning University, China, 2011.
[10] Sharma, Garima, and Anita Ganpati, "Performance evaluation of fair and
capacity scheduling in Hadoop YARN", Green Computing and Internet of
Things (ICGCIoT), 2015 International Conference on. IEEE, 2015.
[11] Fu, Songling, et al., "Performance Optimization for Managing Massive
Numbers of Small Files in Distributed File Systems", IEEE Transactions on
Parallel and Distributed Systems, Vol. 26, No. 12 pp. 3433-3448, 2015.
[12] Vorapongkitipun, Chatuporn, and Natawut Nupairoj, "Improving performance of
small-file accessing in Hadoop", Computer Science and Software Engineering
(JCSSE), 2014 11th International Joint Conference on. IEEE, 2014.
[13] Dev, Dipayan, and Ripon Patgiri, "HAR: Archive and metadata distribution!
Why not both?”, Computer Communication and Informatics (ICCCI), 2015
International Conference on. IEEE, 2015.
[14] Huang, Yicheng, et al., "Towards model-based approach to Hadoop
deployment and configuration", Web Information System and Application
Conference (WISA), 2015 12th. IEEE, 2015.
[15] Chintapalli, Sanket Reddy, "Analysis of Data Placement Strategy based on
Computing Power of Nodes on Heterogeneous Hadoop Clusters", Diss.
Auburn University, 2014.
[16] Xiong, A. P., and J. Y. Ma., "HDFS distributed metadata management
research", International Conference on Applied Science and Engineering
Innovation, 2015.
[17] Xie, Jiong, et al., "Improving mapreduce performance through data placement
in heterogeneous Hadoop clusters", Parallel & Distributed Processing,
Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on.
IEEE, 2010.
[18] Eltabakh, Mohamed Y., et al., "CoHadoop: flexible data placement and its
exploitation in Hadoop", Proceedings of the VLDB Endowment 4.9 (2011): 575-
585, 2011.
[19] Shahrivari, Saeed. "Beyond batch processing: towards real-time and streaming
big data", Computers 3.4 (2014): 117-129, 2014.
[20] Zhou, Fang, "Assessment of Multiple MapReduce Strategies for Fast Analytics
of Small Files", 2015.
[21] White, Tom, "The small files problem", Cloudera Blog, blog. cloudera.
Com/blog/2009/02/the-small-files problem, 2009.
[22] Gohil, Parth, and Bakul Panchal, "Efficient ways to improve the performance of
HDFS for small files", Computer Engineering and Intelligent Systems 5.1 (2014):
45-49, 2014.
[23] Team, Apache HBase, "Apache hbase reference guide", Apache, version 2.0,
2015.
[24] Harter, Tyler, et al., "Analysis of HDFS under HBase: a Facebook messages
case study", FAST. Vol. 14, 2014.
[25] Deyhim, Parviz, "Best Practices for Amazon", Technical report, 2013.
Vorapongkitipun, C., & Nupairoj, N., "Improving performance of small-file
accessing in Hadoop", In, 2014 11th IEEE International Joint Conference on
Computer Science and Software Engineering (JCSSE), pp. 200-205, May, 2014.