Google File System

Google File System. All persons who had interacted with the virtual world almost certainly know the name. Google is thought to have more than one million servers distributed around the world. But who would have millions of servers are not sophisticated machines latest release, but a series of Linux computers that rely on GFS (Google File System)?

Google uses GFS to organize and manipulate huge files and allow the application developers to research and develop the resources they need. GFS is only used by Google and not released to the public, but we can know the advantages and how it works.

Scalability Demands

Google has big files are difficult to manipulate by using the regular file system. They require a special file system to facilitate the work of application developers who work from this company. In addition, other important thing is scalability, which concerns the ease of adding capacity to the system. System is said to be scalable if the capacity is easily added.

Performance of the system also should not become soft due to its growth. Scalability is a fixed price for Google because the company requires a very large network to handle all the files.
Monitor and care for very large networks not a trivial matter. The programmer must automate as much administrative work to ensure continuity of the “ON” server. GFS is the key word of autonomous computing, a concept that allows the computer to diagnose problems and solve them in real-time without human intervention.
GFS Architecture

Google’s GFS organized into a cluster-cluster computer. Simply put, the cluster is a network computer. Each cluster can consist of hundreds, even thousands of machines. In the GFS cluster there are three entities, namely the client, master, and chunk-server.

Client is the entity requesting the file. Request here includes retrieving and manipulating the activity of an existing file to create a new file. Client could be another computer or application.

Master acts as the coordinator of the cluster. One of the master task is to maintain the operating log, track records of activities undertaken by the master himself. Log operations will help minimize interruptions if my rent-time master is destroyed, another server that monitors the operation log can be directly replaced.
Master also maintain records of metadata, information that describes chunkserver. Members know what the master files are owned by chunkserver. At startup, the master roll all chunk server in the cluster. Chunkserver responded by telling their inventory.

There is only one master is active in a cluster at a time even though each cluster has a master backup that has the same metadata record. If so, does not occur bottlenecks if any request from the client? GFS is where the advantages in dealing with traffic congestion situation data. Master transmit and receive only a very small data. Master in fact did not handle the data file at all, but entrusted to chunkserver.

Chunkserver is a coolie in the GFS. They served to keep the pieces of data at 64 MB! chunkserver send pieces of data to the master, but directly to the client are abused to his request. GFS copy of each piece of data several times and store it in a different chunkserver. Each copy is called a replica. By default, GFS made three replicas for each piece of data.

How it Works GFS

There is nothing special in the request read the file. To the master client sends request to find where the client can search for specific files in the system. Master responds by sending a request to the client the primary replica.
Master determines based on the primary replica LP chunkserver address client and the nearest point of the client to respond to client requests. Next, the client contacted directly chunkserver who has been appointed by the master.

Write request is a bit more complicated. Client still sends a request to the master replied by pointing out the location of the primary replica and the replica if the primary is not available to be changed, the client should re-consult the master before contacting chunkserver. The client then sends data changes to all replicas, starting and ending at the nearest replica replica furthest. No matter whether the replica is a replica closest to the main or second.

Once a replica receives the data, the primary replica begins processing the data changes in a sequence. changes are known as “mutation”. Once mutation is complete primary replica, then write the same request was sent to a second replica will run the same mutation.