Replica Synchronization in Distributed File System

ABSTRACT – The Map Reduce framework provides a scalable model for large scale data intensive computing and fault tolerance. In this paper, we propose an algorithm to improve the I/O performance of the distributed file systems. The technique is used to reduce the communication bandwidth and increase the performance in the distributed file system. These challenges are addressed in the proposed algorithm by using adaptive replica synchronization. The adaptive replica synchronization among storage server consists of chunk list which holds the information about the relevant chunk. The proposed algorithm contributing to I/O data rate to write intensive workload. This experiments show the results to prove that the proposed algorithm show the good I/O performance with less synchronization applications.
Index terms – Big data, distributed file system, Map Reduce, Adaptive replica synchronization


The distributed environment which is used to improve the performance and system scalability in the file system known as distributed file system [1]. It consists of many I/O devices chunks of data file across the nodes. The client sends the request to the metadata server who manages all the whole system which gets the permission to access the file. The client will access the storage server which is corresponding to it, which handles the data management, to perform the real operation from the MDS
The distributed file system of MDS which manages all the information about the chunk replicas and replica synchronization is triggered when any one of the replica has been updated [2]. When the data are updated in the file system the newly written data are stored in the disk which becomes the bottleneck. To solve this problem we are using the adaptive replica synchronization in the MDS

Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing Service

MapReduce is which is the programming primitive , programmer can map the input set and obtaining the output and those output set send to the reducer to get the map output. In the MapReduce function it is written as the single node and it is synchronized by MapReduce framework [3]. In distributing programming models which perform the work of data splitting, synchronization and fault tolerance. MapReduce framework is the programming model which is associated with implementation for processing large data sets with distributed and parallel algorithm on a cluster of nodes.
Hadoop MapReduce is a framework for developing applications which can process large amounts of data up to even multiple terabytes of data-sets in parallel on large clusters which includes thousands of commodity nodes in a highly fault tolerant and reliable manner. The input and the output of the MapReduce job are stored in Hadoop Distributed File System (HDFS).


GPFS [4] which allocates the space for the multiple copies of data on the different storage server which supports the chunk replication and it writes the updates to all the location. GPFS keeps track of the file which been updated to the chunk replica to the primary storage server. Ceph[5] has replica synchronization similar ,the newly written data should be send to all the replicas which are stored in different storage server which is before responding to the client. Hadoop File System [6] the large data are spitted into different chunk and it is replicated and stored on storage servers, the copes of the any stripe are stored in the storage server and maintained by the MDS, so the replica synchronization are handled by the MDS, the process will be done when new data written on the replicas. In GFS [7], there are various chunk servers were the MDS manages the location and data layout. For the purpose of the reliability in the file system the chunk are replicated on multiple chunk servers; replica synchronization can be done in MDS. The Lustre file system [8], which is known for parallel file system, which has replication mechanism
For better performance Mosa Store [9] which is a dynamic replication for the data reliability. By the application when one new data block is created, the block at one of the SSs is stored in the MosaStore client, and the MDS replicate the new block to the other SSs to avoid the bottleneck when the new data block is created. Replica synchronization is done in the MDS of MosaStore.
The Gfarm file system [10] the replication mechanism is used for data replication for the reliability and availability. In the distributed and parallel file system, the MDS controls the data replication and send the data to the storage servers; this makes pressure to the MDS. Data replication which has the benefits to support for better data access was the data is required and provide data consistency. In the parallel file system [11], this improves the I/O throughput, data duration and availability by data replication. The proposed mechanism, according to the cost of analysis the data pattern are analysed a data replication is done, but replication synchronization is done in the MDS.
In the PARTE file system, the metadata file parts can be replicated to the storage servers to improve the availability of metadata for high service [12]. In detail we can say that in the PARTE file system, the metadata file parts can be distributed and replicated to the corresponding metadata into chunks on the storage servers, the file system in the client which keeps the some request of the metadata which have been sent to the server. If the active MDS crashed for any reason, then these client backup request are used to do the work bu the standby MDS to restore the metadata which are lost during the crash.
The adaptive replica synchronization mechanism is used to improve the I/O throughput, communication bandwidth and performance in the distributed file system. The MDS manages the information in the distributed file system which is split the large data into chunks replicas.
The main aim of using the mechanism adaptive replica synchronization because the storage server cannot withstand the large amount of the concurrent read request to the specific replica, adaptive replica is triggered to the up to chunk data to the other related SSs in the hadoop distributed file system [13][5].The adaptive replica synchronization will be preformed to satisfy heavy concurrent reads when the access frequency to the target replica is greater than the predefined threshold. The adaptive replica synchronization mechanism among SSs intends to enhance the I/O subsystems performance.

Fig 1: Architecture of replica synchronization mechanism
A. Big data Preparation and Distributed data Storage
Configure the storage server in distributed storage environment. Hadoop distributed file system consists of big data, Meta Data Servers (MDS), number of replica, Storage Server (SS). Configure the file system based on the above mentioned things with proper communication. Prepare the social network big data. It consists of respected user id, name, status, updates of the user. After the data set preparation, it should be stored in a distributed storage server.
B. Data update in distributed storage
The user communicates with distributed storage server to access the big data. After that, user accesses the big data using storage server (SS). Based on user query, update the big data in distributed storage database. By updating the data we can store that in the storage server.
C. Chunk list replication to storage servers
The chunk list consists of all the information about the replicas which belongs to the same chunk file and stored in the SSs. The primary storage server which has the chunk replica that is newly updated to conduct the adaptive replica synchronization , when there is a large amount of the read request which concurrently passes in a short while with minimum overhead to satisfy this that mechanism is used.
D. Adaptive replica synchronization
The replica synchronization will not perform synchronization when one of the replicas is modified at the same time. The proposed mechanism Adaptive replica synchronization which improve the I/O subsystem performance by reducing the write latency and the effectiveness of replica synchronization is improved because in the near future the target chunk might be written again, we
can say that the other replicas are necessary to update until the adaptive replica synchronization has been triggered by primary storage server.
In the distributed file system the adaptive replica synchronization is used to increase the performance and reduce the communication bandwidth during the large amount of concurrent read request. The main work of the adaptive synchronization is as follows: The first step is chunk is saved in the storage servers is initiated .In second step the write request is send one of the replicas after that the version and count are updated. Those SS update corresponding flag in the chunk list and reply an ACK to the SS. On the next step read/write request send to other overdue replicas .On other hand it should handle all the requests to the target chunk and the every count is incremented according to the read operation and frequency is computed. In addition, the remaining replica synchronization for updated chunks, which are not the hot spot objects after data modification, will be conducted while the SSs are not as busy as in working hours. As a result, a better I/O bandwidth can be obtained with minimum synchronization overhead. The proposed algorithm is shown in algorithm.
ALGORITHM: Adaptive replica synchronization
Precondition and Initialization:
1) MDS handles replica management without synchronization, such as creating a new replica;
2) Initialize [Replica Location] [Dirty], [cnt], and [ver] in Chunk List when the relevant chunk replicas have been created.
1: while Storage server is active do
2: if An access request to the chunk then
3: / Other Replica has been updated /
4: if [Dirty] == 1 then
5: Return the latest Replica Status;
6: break;
7: end if
8: if Write request received then
9: [ver] ← I/O request ID;
10: Broadcast Update Chunk List Request;
11: Conduct write operation;
12: if Receiving ACK to Update Request then
13: Initialize read count
14: [cnt] ← 1;
15: else
16: /Revoke content updates /
17: Undo the write operation;
18: Recover its own Chunk List;
19: end if
20: break;
21: end if
22: if Read request received then
23: Conduct read operation;
24: if [cnt] > 0 then
25: [cnt] ← [cnt] + 1;
26: Compute [Freq]
27: if [Freq] >= Configured Threshold then
28: Issue adaptive replica synchronization;
29: end if
30: end if
31: end if
32: else
33: if Update Chunk List Request received then
34: Update chunk List and ACK
35: [Dirty] ← 1; break;
36: end if
37: if Synchronization Request received then
38: Conduct replica synchronization;
39: end if
40: end if
The replica in the target chunk has been modified by the primary SSs will retransmits the updated to the other relevant replicas, and the write latency is which is required time for the each write ,by proposing new mechanism adaptive replica synchronization the write latency is measured by writing the data size.

Fig:2 Write latency
By the adaptive replica synchronization we can get the throughput of the read and write bandwidth in the file system. We will perform both I/O data rate and the time processing operation of the metadata.

Fig.3.I/ O data throughput
In this paper we have presented an efficient algorithm to process the large amount of the concurrent request in the distributed file system to increase the performance and reduce the I/O communication bandwidth. Our approach that is adaptive replica synchronization is applicable in distributed file system that achieves the performance enhancement and improves the I/O data bandwidth with less synchronization overhead. Furthermore the main contribution is to improve the feasibility, efficiency and applicability compared to other synchronization algorithm. In future, we can extend the analysis by enhancing the robustness of the chunk list
[1] Benchmarking Mapreduce implementations under different application scenarios Elif Dede Zacharia Fadika Madhusudhan,Lavanya ramakrishnan Grid and Cloud Computing Research Laboratory,Department of Computer Science, State University of New York (SUNY) at Binghamton and Lawrence Berkeley National Laboratory
[2] N. Nieuwejaar and D. Kotz, “The galley parallel file system,” Parallel Comput., vol. 23, no. 4/5, pp. 447–476, Jun. 1997.
[3] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop distributed file system,” in Proc. 26th IEEE Symp. MSST, 2010, pp. 1–10,
[4] M. P. I. Forum, “Mpi: A message-passing interface standard,” 1994.
[5] F. Schmuck and R. Haskin, “GPFS: A shared-disk file system for large computing clusters,” in Proc. Conf. FAST, 2002, pp. 231–244, USENIX Association.
[6] S. Weil, S. Brandt, E. Miller, D. Long, and C. Maltzahn, “Ceph: A scalable,high-performance distributed file system,” in Proc. 7th Symp. OSDI, 2006, pp. 307–320, USENIX Association.
[7] W. Tantisiriroj, S. Patil, G. Gibson, S. Son, and S. J. Lang, “On the duality of data-intensive file system design: Reconciling HDFS and PVFS,” in Proc. SC, 2011, p. 67.
[8] S. Ghemawat, H. Gobioff, and S. Leung, “The Google file system,” in Proc. 19th ACM SOSP, 2003, pp. 29–43.
[9] The Lustre file system. [Online]. Available:
[10] E. Vairavanathan, S. AlKiswany, L. Costa, Z. Zhang, D. S. Katz, M. Wilde, and M. Ripeanu, “A workflow-aware storage system: An opportunity study,” in Proc. Int. Symp. CCGrid, Ottawa, ON, Canada, 2012, pp. 326–334.
[12] A. Gharaibeh and M. Ripeanu, “Exploring data reliability tradeoffs in replicated storage systems,” in Proc. HPDC, 2009, pp. 217–226.
[13] J. Liao and Y. Ishikawa, “Partial replication of metadata to achieve high metadata availability in parallel file systems,” in Proc. 41st ICPP, 2012, pp. 168–1.