google file system case study

DevOps Lifecycle
DevOps Roadmap
Docker Tutorial
Kubernetes Tutorials
Amazon Web Services [AWS] Tutorial
AZURE Tutorials
GCP Tutorials
Docker Cheat sheet
Kubernetes cheat sheet
AWS interview questions
Docker Interview Questions
Ansible Interview Questions
Jenkins Interview Questions

Google File System

Google Inc. developed the Google File System (GFS), a scalable distributed file system (DFS), to meet the company’s growing data processing needs. GFS offers fault tolerance, dependability, scalability, availability, and performance to big networks and connected nodes. GFS is made up of a number of storage systems constructed from inexpensive commodity hardware parts. The search engine, which creates enormous volumes of data that must be kept, is only one example of how it is customized to meet Google’s various data use and storage requirements.

The Google File System reduced hardware flaws while gains of commercially available servers.

GoogleFS is another name for GFS. It manages two types of data namely File metadata and File Data.

The GFS node cluster consists of a single master and several chunk servers that various client systems regularly access. On local discs, chunk servers keep data in the form of Linux files. Large (64 MB) pieces of the stored data are split up and replicated at least three times around the network. Reduced network overhead results from the greater chunk size.

Without hindering applications, GFS is made to meet Google’s huge cluster requirements. Hierarchical directories with path names are used to store files. The master is in charge of managing metadata, including namespace, access control, and mapping data. The master communicates with each chunk server by timed heartbeat messages and keeps track of its status updates.

More than 1,000 nodes with 300 TB of disc storage capacity make up the largest GFS clusters. This is available for constant access by hundreds of clients.

Components of GFS

A group of computers makes up GFS. A cluster is just a group of connected computers. There could be hundreds or even thousands of computers in each cluster. There are three basic entities included in any GFS cluster as follows:

GFS Clients: They can be computer programs or applications which may be used to request files. Requests may be made to access and modify already-existing files or add new files to the system.
GFS Master Server: It serves as the cluster’s coordinator. It preserves a record of the cluster’s actions in an operation log. Additionally, it keeps track of the data that describes chunks, or metadata. The chunks’ place in the overall file and which files they belong to are indicated by the metadata to the master server.
GFS Chunk Servers: They are the GFS’s workhorses. They keep 64 MB-sized file chunks. The master server does not receive any chunks from the chunk servers. Instead, they directly deliver the client the desired chunks. The GFS makes numerous copies of each chunk and stores them on various chunk servers in order to assure stability; the default is three copies. Every replica is referred to as one.

Features of GFS

Namespace management and locking.
Fault tolerance.
Reduced client and master interaction because of large chunk server size.
High availability.
Critical data replication.
Automatic and efficient data recovery.
High aggregate throughput.

Advantages of GFS

High accessibility Data is still accessible even if a few nodes fail. (replication) Component failures are more common than not, as the saying goes.
Excessive throughput. many nodes operating concurrently.
Dependable storing. Data that has been corrupted can be found and duplicated.

Disadvantages of GFS

Not the best fit for small files.
Master may act as a bottleneck.
unable to type at random.
Suitable for procedures or data that are written once and only read (appended) later.

Please Login to comment...

Improve your Coding Skills with Practice

What kind of Experience do you want to share?

Samuel Sorial's Blog

The Google File System - Case Study

10 min read

Introduction, assumptions, key points of architecture, metadata in master, consistency model, leases and mutations, namespace management, replica placement, garbage collection, stale replica detection, references:.

The Google file system's main goal is to support their applications' workload. Which affected their design decisions, they implemented what they actually need, rather than the de-facto distributed file system.

There are 4 main different decisions that they made in their design:

Failures are the default, not the exception. This means that the system is running assuming that some parts are down and won't come back alive again.

Files are huge, for a company like Google, it's more common to have GBs files rather than KBs files.

Access & Update Patterns: Files are read sequentially, not randomly. Also, it's more popular to update files by appending them.

Co-designing the application with the file system API gives huge flexibility to the design.

Note: Google doesn't use GFS anymore, as they invented another file system that they are using in their cloud.

The system is running on many commodity machines, which means that they fail, so they must detect and handle failures correctly. They store huge files and sometimes small files, but it's optimized for bigger ones.

Reads happen in two ways, large amounts of sequential data, and some random small data. It's optimized for the bigger ones, and it's left for the application developers to optimize their fetches of small data, by sorting and batching instead of going back and forth.

Write workload is typically huge data written to the end of the file, once a file is written, it's usually not edited. So, updating data randomly in given offsets is supported but not optimized.

As the main usage of it will be in the producer-consumer pattern, it's important to support synchronization ways to append files with the least amount of work.

Bandwidth is more important than latency, as the applications that process huge data don't pay attention to latency, instead, they look for high throughput of data.

While it share many operations with the standard API of file systems like POSIX, it had two additional important operations. Snapshot creates a copy of a file or directory in a low cost, and append that helps the clients to append concurrently to the same file without worrying about concurrency problems. It's very useful in the context of pub-sub applications, and multi-way merge results apps.

Architecture

The cluster contains a single master and multiple chunkservers that are accessed by multiple clients. Each chunk server is a low-cost Linux machine, and it's possible to run both chunkserver and client on the same machine but pay attention to the flaky code of applications that may increase the chances of the server being down.

Each file is divided into many fixed-size chunks, each chunk is identified by a unique 64-bit chunk handle that got assigned to it by the master in the creation of this chunk. Chunks are replicated among different chunkservers to keep them reliable, it's replicated into 3 by default but users can change the replication settings.

The master contains only meta-data about chunks, it also controls operations like chunk lease management, garbage collection of orphaned chunks, chunk migration between servers, and most importantly communication with chunk servers with HeartBeats to give instructions and collect stats.

The cache is not used widely in GFS, as the file sizes are huge, and it's not efficient to keep caching huge files. It plays an important role in simplifying the design by eliminating the hassle of keeping track of cache coherence. However, sometimes client uses a cache to prevent multiple calls to master in order to know where the chunk is located, and also the underlying Linux buffer cache is already caching some of the highly accessed data in the server.

Single master: It simplifies the design, and enables the master to make decisions using global knowledge. But it can become a bottleneck, so it was decided to minimize the interaction between it and the client. The client never reads from a master, it only asks the master for the location of a specific chunk, and this data is cached in the client.

Chunk size: Having a 64 MB chunk size may have been problematic if it was not paired with lazy allocation, as it reduces fragmentation. Also, it reduces the client interactions with the master, reduces network usage because the client can utilize a single TCP connection to retrieve a bigger amount of data, lastly, it keeps the size of metadata in the master low. Conversely, small files become problematic, as the chunk servers storing some small files consisting of single chunks may become hot spots rapidly. So in this case it's better to have a higher replication factor for small files that may become hot spots.

The master node stores: chunks and files namespaces, the mapping of files to chunk handles, and chunk handles mapping to chunk servers that contain it. It's important to keep the namespaces and files to chunk handles persistent on disk, so it was decided to add them in the operation log along with the in-memory data structures. Keeping those data in memory can be limiting, but as it only stores 64 bytes per 64MB chunk, it was an acceptable choice.

The master doesn't keep a persistent log of chunk locations, it's more reasonable to ask each chunk server about which chunks it contains in the master node restart. Making the chunk server single point of truth about which chunks it contains because it's the only one that can determine if it has a specific chunk or not.

Keeping an operation log is critical to preserve important data changes like the creation of chunks, or appending. It's also playing a huge part in keeping data consistent even on failures, by flushing logs to disk before responding to the client, with adding another layer of replication for the log, data losses become less. Checkpoints are made during execution, it's algorithm is designed to keep the master serving while the checkpoint is being taken.

GFS has a somehow relaxed consistency model that aims to serve the needs of the applications that use it. It handles namespace and chunk creation atomically as there is only one master node, with an operation log that flushes data to disk to preserve them in case of failures. There are 3 different states for file regions that exist in GFS after mutation:

Consistent : This means that reading the same region from all replicas will give the same data.

Defined: It's consistent and gives the whole data that was written in the latest successful mutation.

Inconsistent : Mutation failed to change data over the different replicas, so it becomes undefined for a while until the master resolves it.

GFS applications accommodate this relaxed model by implementing some techniques to keep sure it fits their needs such as using append instead of overwrite, checkpointing, and writing self-validating records with checksums inside it if needed. Record append is at-least-once, which means that it may duplicate data on some replicas, if it's not accepted in the application level, adding unique ids to regions appended can help applications to detect such duplications.

System Interactions

For each chunk handle, the master chooses a server that is responsible for organizing data mutations on this chunk, it's called primary. It grants it a lease with expiration after 60 seconds, after that, it chooses another one or gives the primary additional time if it has requested that piggybacking heartbeat request. It's important to note that the primary knows the expiration of the lease, and it refuses any mutations after that expiry. So if the master is dead, and the client is already changing something, it will stop at the expiration. The master can then choose another primary when it comes back.

Mutation steps:

The client asks the master for the primary chunk to be modified.

The master replies with the primary and secondaries holding replicas.

Client forward data (without mutation operation) to nearest node.

The client sends the operations it needs to primary.

Primary decides the order of operations and sends it to other replicas.

Replicas respond to primary if it's successfully applied the same operations in the same order.

Primary replies to clients with success.

Decoupling data flow from control flow is an intelligent decision to maximize the utilization of the network. It starts by choosing the nearest node to push data into it, which starts pipelining data to other nodes from the first byte it receives, making the best usage of TCP connections. Later when data is fully received on all nodes, the client can start sending commands to specify what operations it needs to be done with this data.

Master Operations

In order to support snapshotting, the master needs to utilize locks to prevent having incorrect data. Hence, it's important to design lock granularity carefully to reduce the waiting time for such operations. For example, appending to a file requires taking read locks on all directory paths that contain this file, and a write on that specific one. By following this scheme, it allows different locks to be taken on files in the same directory. Also, it prevents deadlocks from happening as locks are taken in a consistent order, from top to bottom, and lexicographically ordered on the file level.

Spreading data in different replicas is not enough to ensure that it's available and makes the best utilization of the network, it's also important to pick replicas such that data is replicated within different racks. This ensures that if something happens to a specific rack, data is not lost. It also has the advantage that reading data can make us of having it replicated on different racks, but it also has the drawback that mutation flow is passing through different racks, which is acceptable in this case.

When a file is deleted, the master logs the deletion but renames the file to a hidden name that includes the deletion timestamp. The file isn't physically removed until a regular scan of the file system namespace, which happens every three days (configurable). During this period, the file can be read under its new name or even undeleted. When finally removed, its in-memory metadata is erased, disconnecting it from all its chunks.

In a similar way, the master scans for orphaned chunks (those not associated with any file) and erases their metadata. Each chunk server reports a subset of its chunks during regular HeartBeat messages exchange with the master, which in turn instructs the chunk server which chunks to delete based on its metadata records.

This garbage collection approach simplifies storage reclamation in a large-scale distributed system, especially where component failures are common. It allows for the efficient cleanup of unnecessary replicas and merges storage reclamation into regular background activities, making it more reliable and cost-effective. This approach also provides a safety net against accidental irreversible deletion.

However, the delay in storage reclamation can be a disadvantage when fine-tuning usage in tight storage situations or when applications frequently create and delete temporary files. To address this, GFS allows for expedited storage reclamation if a deleted file is explicitly deleted again. It also offers users the ability to apply different replication and reclamation policies to different parts of the namespace.

chunk replicas can become stale if a chunkserver fails and misses updates to the chunk during its downtime. To manage this, the master maintains a chunk version number to differentiate between up-to-date and stale replicas.

Whenever the master grants a new lease on a chunk, it increases the chunk's version number and notifies the up-to-date replicas. Both the master and these replicas record the new version number in their persistent state before any client is informed, hence before any writing to the chunk can commence. If a replica is unavailable at this time, its version number remains unchanged. The master identifies stale replicas when the chunkserver restarts and reports its chunk and version numbers. If the master notices a version number higher than its record, it assumes its previous attempt to grant the lease failed, and considers the higher version as up-to-date.

The master removes stale replicas during its regular garbage collection, effectively treating them as non-existent when responding to client requests for chunk information. As an additional precaution, the master includes the chunk version number when informing clients about which chunkserver holds a lease on a chunk, or when it instructs a chunkserver to read the chunk from another chunkserver during a cloning operation. The client or chunkserver verifies this version number during operations, ensuring that it always accesses up-to-date data.

This case study is created after careful reading of the GFS paper written by Google engineers, if you think there's anything wrong please contact me.

Ghemawat, S., Gobioff, H., & Leung, S. T. (2003). The Google File System. In Proceedings of the nineteenth ACM symposium on Operating systems principles - SOSP '03 (pp. 29-43). ACM Press. doi.org/10.1145/945445.945450

The Polymathic Engineer

Google File System

A distributed storage case study: the google file system..

Hi Friends,

Welcome to the 78th issue of the Polymathic Engineer newsletter.

This week, we will discuss an interesting case study for distributed storage: the Google File System.

The outline will be as follows:

Introduction

Design Assumptions

Architecture

Architectural Considerations

Data Integrity

The Google File System (GFS) was designed and introduced in the early 2000s to meet the unique needs of Google’s extensive data processing requirements.

During that period, Google rapidly expanded its infrastructure to support various web services, including search, indexing, and analytics.

The existing file systems could not handle the scale and performance demands, leading Google to create GFS to meet its specific needs.

GFS was designed to provide high scalability and fault tolerance, ensuring data is stored reliably and accessed quickly, even if hardware failures and network issues occur.

Even if Google replaced it with a more advanced file system called Colossus, GFS significantly impacted the development of other distributed file systems.

For example, the Hadoop Distributed File System (HDFS), a key component of the Apache Hadoop project, was directly inspired by GFS. HDFS adopted many core principles of GFS, such as its distributed architecture, fault tolerance, and scalability, and became widely used for big data processing in the open-source community.

In the following sections, we'll delve into the most critical aspects of GFS.

GFS Design Goals

The goal of supporting Google's application workload affected several crucial design decisions. Here are some key assumptions:

Failures are the Default, Not the Exception: GFS assumes that component failures (e.g., disk failures, network partitions, server crashes) are common and must be managed automatically without human intervention.

Handling Large Files: Google typically works with large files, often several gigabytes. This influenced GFS to optimize for large files rather than small ones.

Access and Update Patterns: GFS is optimized for sequential reads and appends rather than random reads and writes. Most updates are performed by appending data, which aligns with Google’s data processing patterns.

GFS Architecture

GFS employs a distributed architecture composed of multiple nodes organized into a cluster.

There are three primary types of nodes in a GFS cluster:

Master Node : The master node acts as the central control unit of the GFS cluster. It maintains metadata about the file system, including the namespace and access control information. Files in GFS are divided into fixed-size chunks, identified by a unique 64-bit handler. The master assigns the handler to each chunk and keeps the files mapped to chunks. All metadata are stored in memory, but namespaces and file-to-chunk mapping are also persisted by logging the mutations to an operation log. This log is stored on the master’s disk and replicated on remote machines, making it possible to restore the master state simply and reliably in case of crashes.

Chunk Servers : These are the storage nodes where the data is stored. Each chunk server stores data in fixed-size chunks (typically 64MB), and each chunk is replicated across multiple chunk servers to ensure fault tolerance. Chunk servers regularly communicate with the master to report their status and the chunks they hold. GFS can scale horizontally by adding more chunk servers to the cluster if necessary.

This post is for paid subscribers

operating-systems
paper-summary

Original paper by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: https://static.googleusercontent.com/media/research.google.com/pt-BR//archive/gfs-sosp2003.pdf

When building a distributed file system there are many concerns to take into consideration: components failures (disk, memory, networking, power supply, and so on), concurrent file operations, bandwidth, latency, and many others. This paper presents the scalable, distributed file system built for meeting Google’s storage needs: GFS. Despite sharing some of the same goals of previous distributed file systems, GFS is designed with the assumption that components will fail, files are huge by default (multi-GB files are common), and that most files can be mutated by appending new data rather than overwriting existing data.

Based on these assumptions the Google File System architecture consists of a single master node and multiple chunk servers that stores the chunks of files in its local disks. The multiple clients of GFS interact only with the master node, which stores only the metadata related to the requested files and responds to clients which chunk servers the client should contact. This separation between write control and data flow reduces the chances of creating a system bottleneck in the master node and eliminates the necessity of synchronizing the master node with the chunk servers simplifying the overall system design.

To evaluate the performance of GFS, the authors executed micro-benchmarks on a cluster with one master, 16 chunk servers, and 16 clients. These benchmarks helped the authors to detect slow write operations and check how the system behaved with concurrent record append operations. The paper also demonstrates the GFS performance in a real-world google cluster, demonstrating the aggregate throughput showing graphs with read, write and append operation rate compared with the network bandwidth limits imposed by the cluster topology.

Some of the related work already proposed some ideas to increase fault tolerance in distributed file systems, but they differ from GFS mainly regarding the caching strategy (which GFS doesn’t use at all) and the centralized approach for controlling chunk placement and replication policies.

The paper offers a great contribution by describing how GFS supports large-scale data processing workloads implementing techniques for data replication, data corruption detection, and high aggregate data throughput separating file system control from data transfer. This system is an essential component of Google’s overall platform and the lessons learned can be definitely applied by others who aim to implement distributed file systems.

Stop Thinking, Just Do!

Sungsoo Kim's Blog

Tags Categories Archive

Sung-Soo Kim's Blog

The google file system (gfs).

design patterns 14
hadoop & mapreduce 168

29 April 2014

Article source.

Title: DISTRIBUTED SYSTEMS Concepts and Design Fifth Edition
Authors: George Coulouris, Jean Dollimore, Tim Kindberg and Gordon Blair

Chapter 12 presented a detailed study of the topic of distributed file systems, analyzing their requirements and their overall architecture and examining two case studies in detail, namely NFS and AFS. These file systems are general-purpose distributed file systems offering file and directory abstractions to a wide variety of applications in and across organizations. The Google File System (GFS) is also a distributed file system; it offers similar abstractions but is specialized for the very particular requirements that Google has in terms of storage and access to very large quantities of data [Ghemawat et al. 2003]. These requirements led to very different design decisions from those made in NFS and AFS (and indeed other distributed file systems), as we will see below. We start our discussion of GFS by examining the particular requirements identified by Google.

GFS requirements

The overall goal of GFS is to meet the demanding and rapidly growing needs of Google’s search engine and the range of other web applications offered by the company. From an understanding of this particular domain of operation, Google identified the following requirements for GFS (see Ghemawat et al. [2003]):

The first requirement is that GFS must run reliably on the physical architecture discussed in Section 21.3.1 – that is a very large system built from commodity hardware. The designers of GFS started with the assumption that components will fail (not just hardware components but also software components) and that the design must be sufficiently tolerant of such failures to enable application-level services to continue their operation in the face of any likely combination of failure conditions.

GFS is optimized for the patterns of usage within Google, both in terms of the types of files stored and the patterns of access to those files. The number of files stored in GFS is not huge in comparison with other systems, but the files tend to be massive. For example, Ghemawat et al. [2003] report the need for perhaps one million files averaging 100 megabytes in size, but with some files in the gigabyte range. The patterns of access are also atypical of file systems in general. Accesses are dominated by sequential reads through large files and sequential writes that append data to files, and GFS is very much tailored towards this style of access. Small, random reads and writes do occur (the latter very rarely) and are supported, but the system is not optimized for such cases. These file patterns are influenced, for example, by the storage of many web pages sequentially in single files that are scanned by a variety of data analysis programs. The level of concurrent access is also high in Google, with large numbers of concurrent appends being particularly prevalent, often accompanied by concurrent reads.

GFS must meet all the requirements for the Google infrastructure as a whole; that is, it must scale (particularly in terms of volume of data and number of clients), it must be reliable in spite of the assumption about failures noted above, it must perform well and it must be open in that it should support the development of new web applications. In terms of performance and given the types of data file stored, the system is optimized for high and sustained throughput in reading data, and this is prioritized over latency. This is not to say that latency is unimportant, rather, that this particular component (GFS) needs to be optimized for high-performance reading and appending of large volumes of data for the correct operation of the system as a whole.

These requirements are markedly different from those for NFS and AFS (for example), which must store large numbers of often small files and where random reads and writes are commonplace. These distinctions lead to the very particular design decisions discussed below.

GFS interface

GFS provides a conventional file system interface offering a hierarchical namespace with individual files identified by pathnames. Although the file system does not provide full POSIX compatibility, many of the operations will be familiar to users of such file systems (see, for example, Figure 12.4 [1]):

create – create a new instance of a file;

delete – delete an instance of a file;

open – open a named file and return a handle;

close – close a given file specified by a handle;

read – read data from a specified file;

write – write data to a specified file.

It can be seen that main GFS operations are very similar to those for the flat file service described in Chapter 12 (see Figure 12.6 [1]). We should assume that the GFS read and write operations take a parameter specifying a starting offset within the file, as is the case for the flat file service.

The API also offers two more specialized operations, snapshot and record append . The former operation provides an efficient mechanism to make a copy of a particular file or directory tree structure. The latter supports the common access pattern mentioned above whereby multiple clients carry out concurrent appends to a given file.

GFS architecture

The most influential design choice in GFS is the storage of files in fixed-size chunks , where each chunk is 64 megabytes in size. This is quite large compared to other file system designs. At one level this simply reflects the size of the files stored in GFS. At another level, this decision is crucial to providing highly efficient sequential reads and appends of large amounts of data. We return to this point below, once we have discussed more details of the GFS architecture.

Given this design choice, the job of GFS is to provide a mapping from files to chunks and then to support standard operations on files, mapping down to operations on individual chunks. This is achieved with the architecture shown in Figure 21.9, which shows an instance of a GFS file system as it maps onto a given physical cluster. Each GFS cluster has a single master and multiple chunkservers (typically on the order of hundreds), which together provide a file service to large numbers of clients concurrently accessing the data.

The role of the master is to manage metadata about the file system defining the namespace for files, access control information and the mapping of each particular file to the associated set of chunks. In addition, all chunks are replicated (by default on three independent chunkservers, but the level of replication can be specified by the programmer). The location of the replicas is maintained in the master. Replication is important in GFS to provide the necessary reliability in the event of (expected) hardware and software failures. This is in contrast to NFS and AFS, which do not provide replication with updates (see Chapter 12 [1]).

The key metadata is stored persistently in an operation log that supports recovery in the event of crashes (again enhancing reliability). In particular, all the information mentioned above is logged apart from the location of replicas (the latter is recovered by polling chunkservers and asking them what replicas they currently store).

Although the master is centralized, and hence a single point of failure, the operations log is replicated on several remote machines, so the master can be readily restored on failure. The benefit of having such a single, centralized master is that it has a global view of the file system and hence it can make optimum management decisions, for example related to chunk placement. This scheme is also simpler to implement, allowing Google to develop GFS in a relatively short period of time. McKusick and Quinlan [2010] present the rationale for this rather unusual design choice.

When clients need to access data starting from a particular byte offset within a file, the GFS client library will first translate this to a file name and chunk index pair (easily computed given the fixed size of chunks). This is then sent to the master in the form of an RPC request (using protocol buffers). The master replies with the appropriate chunk identifier and location of the replicas, and this information is cached in the client and used subsequently to access the data by direct RPC invocation to one of the replicated chunkservers. In this way, the master is involved at the start and is then completely out of the loop, implementing a separation of control and data flows – a separation that is crucial to maintaining high performance of file accesses. Combined with the large chunk size, this implies that, once a chunk has been identified and located, the 64 megabytes can then be read as fast as the file server and network will allow without any other interactions with the master until another chunk needs to be accessed. Hence interactions with the master are minimized and throughput optimized. The same argument applies to sequential appends.

Note that one further repercussion of the large chunk size is that GFS maintains proportionally less metadata (if a chunk size of 64 kilobytes was adopted, for example, the volume of metadata would increase by a factor of 1,000). This in turn implies that GFS masters can generally maintain all their metadata in main memory (but see below), thus significantly decreasing the latency for control operations.

As the system has grown in usage, problems have emerged with the centralized master scheme:

Despite the separation of control and data flow and the performance optimization of the master, it is emerging as a bottleneck in the design.

Despite the reduced amount of metadata stemming from the large chunk size, the amount of metadata stored by each master is increasing to a level where it is difficult to actually keep all metadata in main memory.

For these reasons, Google is now working on a new design featuring a distributed master solution.

As we saw in Chapter12, caching often plays a crucial role in the performance and scalability of a file system (see also the more general discussion on caching in Section 2.3.1). Interestingly, GFS does not make heavy use of caching. As mentioned above, information about the locations of chunks is cached at clients when first accessed, to minimize interactions with the master. Apart from that, no other client caching is used. In particular, GFS clients do not cache file data. Given the fact that most accesses involve sequential streaming, for example reading through web content to produce the required inverted index, such caches would contribute little to the performance of the system. Furthermore, by limiting caching at clients, GFS also avoids the need for cache coherency protocols.

GFS also does not provide any particular strategy for server-side caching (that is, on chunkservers) rather relying on the buffer cache in Linux to maintain frequently accessed data in memory.

GFS is a key example of the use of logging in Google to support debugging and performance analysis. In particular, GFS servers all maintain extensive diagnostic logs that store significant server events and all RPC requests and replies. These logs are monitored continually and used in the event of system problems to identify the underlying causes.

Managing consistency in GFS

Given that chunks are replicated in GFS, it is important to maintain the consistency of replicas in the face of operations that alter the data – that is, the write and record append operations. GFS provides an approach for consistency management that:

maintains the previously mentioned separation between control and data and hence allows high-performance updates to data with minimal involvement of masters;

provides a relaxed form of consistency recognizing, for example, the particular semantics offered by record append .

The approach proceeds as follows.

When a mutation (i.e., a write , append or delete operation) is requested for a chunk, the master grants a chunk lease to one of the replicas, which is then designated as the primary . This primary is responsible for providing a serial order for all the currently pending concurrent mutations to that chunk. A global ordering is thus provided by the ordering of the chunk leases combined with the order determined by that primary. In particular, the lease permits the primary to make mutations on its local copies and to control the order of the mutations at the secondary copies; another primary will then be granted the lease, and so on.

The steps involved in mutations are therefore as follows (slightly simplified):

On receiving a request from a client, the master grants a lease to one of the replicas (the primary) and returns the identity of the primary and other (secondary) replicas to the client.

The client sends all data to the replicas, and this is stored temporarily in a buffer cache and not written until further instruction (again, maintaining a separation of control flow from data flow coupled with a lightweight control regime based on leases).

Once all the replicas have acknowledged receipt of this data, the client sends a write request to the primary; the primary then determines a serial order for concurrent requests and applies updates in this order at the primary site.

The primary requests that the same mutations in the same order are carried out at secondary replicas and the secondary replicas send back an acknowledgement when the mutations have succeeded’

If all acknowledgements are received, the primary reports success back to the client; otherwise, a failure is reported indicating that the mutation succeeded at the primary and at some but not all of the replicas. This is treated as a failure and leaves the replicas in an inconsistent state. GFS attempts to overcome this failure by retrying the failed mutations. In the worst case, this will not succeed and therefore consistency is not guaranteed by the approach.

It is interesting to relate this scheme to the techniques for replication discussed in Chapter 18. GFS adopts a passive replication architecture with an important twist. In passive replication, updates are sent to the primary and the primary is then responsible for sending out subsequent updates to the backup servers and ensuring they are coordinated. In GFS, the client sends data to all the replicas but the request goes to the primary, which is then responsible for scheduling the actual mutations (the separation between data flow and control flow mentioned above). This allows the transmission of large quantities of data to be optimized independently of the control flow.

In mutations, there is an important distinction between write and record append operations. writes specify an offset at which mutations should occur, whereas record appends do not (the mutations are applied at the end of the file wherever this might be at a given point in time). In the former case the location is predetermined, whereas in the latter case the system decides. Concurrent writes to the same location are not serializable and may result in corrupted regions of the file. With record append operations, GFS guarantees the append will happen at least once and atomically (that is, as a contiguous sequence of bytes); the system does not guarantee, though, that all copies of the chunk will be identical (some may have duplicate data). Again, it is helpful to relate this to the material in Chapter 18 [1]. The replication strategies in Chapter 18 are all general-purpose, whereas this strategy is domain-specific and weakens the consistency guarantees, knowing the resultant semantics can be tolerated by Google applications and services (a further example of domain-specific replication – the replication algorithm by Xu and Liskov [1989] for tuple spaces can be found in Section 6.5.2 [1]).

[1] George Coulouris, Jean Dollimore, Tim Kindberg and Gordon Blair, DISTRIBUTED SYSTEMS Concepts and Design Fifth Edition, Pearson Education, Inc., 2012.

Sungsoo Kim Principal Research Scientist [email protected]

about me sungsoo's scoop sungsoo's facebook

CS 736 Reviews - Spring 2016

« FlashTier: A Lightweight, Consistent and Durable Storage Cache | Main | Optimistic Crash Consistency »

The Google File System

The Google File System . Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. In 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.

Posted by Michael Swift on March 29, 2016 08:16 AM | Permalink

1.Summary This paper describes the Google File system that is widely deployed within Google as a storage platform for the generation and processing of high volume of data used by Google services as well as research. It allows storage of hundreds of terabytes of data storage across thousands of disks on over a thousand commodity machines and can be concurrently accessed by hundreds of clients.

2.Problem Typically distributed file systems aim for scalability, reliability and availability. However at the time of this publication, conventional systems posed limitations and challenges that led to GFS design: * Component failures are norm rather than exception * Files are huge by traditional standards. Multi GB files are norm. * Most files are mutated by appending new data rather than overwriting existing data. Random writes are very rare. * High sustained bandwidth is more important than low latency.

3.Contributions The main contribution of the paper is the file system that can scale to a level unforeseen ever before, running on commodity hardware with built in monitoring, fault tolerance and automatic recovery. The paper illustrates the design and evaluation of the implementation of GFS with details on tackling the problems mentioned above. Some of the highlights of the architecture are: * single master: maintain metadata info, name space, ACL * many chunks server: store chunks of file as normal file of local fs * chunks are replicated across the system * crash recovery: during mutation, if one of server crashes, client retries and if master crashes, replay the log, and wait for chunk server to checkin, to rebuild the mapping

4.Evaluation The authors provide few micro benchmarks to illustrate the bottlenecks inherent in GFS architecture and implementation. The read rate gets 75-80% of the theoretical limits (set by the network bandwidth) and the write reaches about half of the limit. Further, several clients can simultaneously append file at a reasonably high rate (only to be bottlenecked by the network congestion). This shows the high sustained bandwidth the authors aim to achieve with the system. The authors also present data on the overhead at the master, distribution of workload at the chunkserver and and also on the recovery time after crash and show the scalability, fault tolerance and the automatic recovery aspect of the system. Overall, the authors measure and report important aspects of the design. It would have been useful if they did a comparison with another system (system in Google before GFS) and report how it is better or worse in different aspects (read / write rate vs number of machine, disk size etc).

5.Confusion What happens when Master fails? It seems like a single point of failure. Also How is consistency maintained across replicas? (i.e. what if replica has different length?)

Posted by: Udip | March 31, 2016 10:07 AM

Summary : This paper describes the design and implementation of the Google File System(GFS) which is a scalable distributed file system used for large distributed data-intensive applications. It provides fault tolerance, ability to run on inexpensive commodity hardware and high aggregate performance. The deviation of the nature of application workloads and technological environments observed in Google’s applications workloads from the earlier file system assumptions have driven many design decisions in GFS. GFS successfully fulfilled the storage needs of Google and was used for generating and processing data across multiple disks, machines concurrently accessed by many clients thereby providing hundreds of terabytes of storage. The paper also reports empirical data by evaluating GFS using micro-benchmarks and real world scenarios.

Problem : Google generates and processes large amounts data(files sizes in the range of gigabytes were common). a) Traditional file systems were not suitable for this large scale of storage as consistency wasn’t guaranteed in the case of multiple concurrent updates and use of locks for this would hinder scalability. b) The use of inexpensive commodity hardware for storage renders the system vulnerable to component failures frequently. GFS incorporated constant monitoring, error detection, fault tolerance and automatic recovery to address failures. c) In variety of data(large repositories, data streams, archival data) files are mutated mostly by appending new data than overwriting. Given this access pattern GFS aimed at optimizing appending performance and guaranteeing atomicity. d) GFS increased flexibility by co-designing the applications and file system API benefits.

Contributions : a) Simplified, new and relaxed design for mutations(file changes) involves a single master (makes global placement decisions, creates new chunks and replicas, coordinates system wide activities, balances load across chunkservers, reclaims unused storage, access control information and stores chunk metadata). However it is notable how GFS overcomes master being the bottleneck by keeping everything in memory, master involves only metadata operation and client caches(not data and with timeout) information reducing master-client interaction(only for metadata). b) Files divided into fixed size chunks and replicated on chunkservers. Each chunk has 3 replicas which improves read performance and reliability, but replication can be out of sync and incurs space overhead. Actual data read/write is served by chunkserver. c) Writes are pipelined and this is useful as it enables full utilization of machine’s bandwidth, avoids network bottlenecks and high latency links and it also minimizes latency by pipelining data over TCP connections and chunkserver immediately forwards data it receives. d) Consistency model - Files region can be consistent(all clients see same data regardless of the replica they are reading from) and defined(consistent and write updates are intact). Multiple concurrent writes can leave a file region in undefined state. GFS overcomes this by using record append operation. e) Large chunks are allocated using lazy allocation to reduce internal fragmentation. However problem of ‘hot spot’ can arise in case of small files consisting few chunks and many client accessing the file. f) In addition to storing master metadata in memory it is also stored on operation logs which helps in recovery by loading to a checkpoint and replaying log after that. Checkpoints are stored as compact B-trees and help keep log size in check. g) Lease is granted by master to primary replica - ensures operation ordering by adhering to lease grant number first and then serial numbers within each replica. h) Lazy deletion - when a file is deleted, it is logged immediately, name changed to hidden name with a timestamp, physical space lazily reclaimed by piggybacking deletion messages on HeartBeat messages. This is my opinion is good as it can be done in the background, is simple and useful in case of accidental failure. i)Data Integrity is ensured by maintaining and checking for checksum computed. j) Copy-on-Write is used for snapshots. However few shortcomings in the above design aspects is that master continues to be single point of failure and is limited by its memory capacity.

Evaluation : The authors evaluate the GFS using micro-benchmarks and real time workloads. They initially evaluated by using single master, 2 masters, 16 chunkservers and 16 clients. Aggregate read bandwidth was shown to scale well when client numbers were increased and the client performance showed negligible decrease. Similar trend was observed in case of aggregate writes also but client performance showed a more marked decrease owing to the greater contention for multiple updates on replicas. The GFS was then evaluated on research and development and data processing real time workloads. Memory consumption by metadata in master and chunkserver was minimal about 50MB to 100MB owing to the highly optimized in-memory data structures. GFS also helped to achieve desirable recovery time in the range of 24 minutes to restore 600GB of data replicated across chunkserver in case of a chunkserver failure. It is wise design choice to offload caching of data from client and chunkserver as the underlying Linux buffer cache anyways handles the data caching. Memory usage has also been optimized by using smart data structures such as smart B-trees for checkpoint etc. The authors also finally analyse the real time workloads and their characteristics to show how the GFS’s design aspects conform the the requirements. However it would have been more useful if the above empirical data would have been presented in comparison with measurements on other prevalent file systems instead of presenting absolute measurements as it would have helped analyze how GFS compares to other file systems.

Confusion : 1) When is it necessary for master to access a file with its hidden name once it has been deleted? 2)Why and how is it sufficient to check only checksum of first and last blocks to ensure data integrity during overwrite operation? 3)How does lazy space allocation and padding in case of record append coexist(don’t the two mechanisms contradict with each other) ?

Posted by: Shruthi Racha | March 31, 2016 09:09 AM

Summary The paper describes the Google File System (GFS), a distributed file system that uses inexpensive commodity hardware to reliably store data over the network in a fault-tolerant way and access it efficiently to achieve high execution throughput for processing large-scale distributed data-intensive applications, which generally interact with large-sized files through append operations.

Problem A study of application workloads and the technological resources at Google revealed that component failures of inexpensive commodity hardware were common, which required the system to display constant monitoring, error detection, fault tolerance and automatic recovery. The study also revealed that file sizes were generally large (multi-GB files were common) and that the append operation was the most common way to append these files. These observations necessitated revision of design assumptions regarding I/O operations and block sizes. Specifically, the common case of append needed to be optimized. As the study also showed that a a majority of files were read sequentially, caching of data blocks in the client was not going to be useful. The system was also required to implement well-defined semantics for concurrent appends to the same file efficiently and to focus on achieving high sustained bandwidth over individual application latency. None of the existing distributed system designs achieved all of these goals.

Contribution The authors at Google designed and developed their own distributed system that handled the above issues. The GFS design has influenced the design of the Hadoop Distributed File System, which also uses the MapReduce framework for running its applications. While not being POSIX compliant, GFS organizes files hierarchically and provides a standard interface for file access, along with special functions such as snapshot and record append. The key architecture of GFS consists of a single master node, multiple chunkservers and multiple clients. Files in GFS are stored in chunks of size of 64 MB. The chunkserver stores each chunk as Linux file, identified by a chunk handle and stores a checksum for every contiguous 64KB region in a chunk to protect against data corruption. The master node stores metadata that stores the file and chunk namespace, operation log and the locations of chunks . All master metadata except the mapping from chunk handle to chunk locations is stored persistently at the master and its replicas. The polling mechanism between the master and the chunkservers called the HeartBeat is used for updating chunk locations on the master and also for controlling chunk placement and monitoring chunkserver status. The snapshot operations creates a copy of the file and directory structure using a copy-on-write mechanism, while the record append operation ensures an append-at-least once semantics. To minimize intervention from the centralized master, GFS uses leases to maintain a consistent write order and forwards data from the client to chunkservers in a way that exploits the network topology. This design separates the control and data flow. GFS replicates chunks and uses shadow masters to ensure high availability of file data. It also uses a lazy garbage collection mecahnism for cleaning deleted files and restore space. Stale chunks are identified using version numbers. Extensive diagnostic logging is performed at all nodes to isolate problems, debug and for analysing performance.

Evaluation The authors present two sets of evaluations – The first of this involved measuring the rate of read / write / record append operations on a small cluster developed specifically for the purpose of GFS evaluation. The read/write/append rate was measured with varying the number of clients issuing these requests. While the rate of read operations indicated a healthy usage of the network bandwidth, the write rate was only half of the theoretical limit. This was because the network stack did not work well with the GFS pipelining data while pushing data to chunk replicas. The record append rate was found to be limited by the bandwidth of the individual chunkservers. The next set of evaluations presented a comparison between usage statistics for smaller R&D clusters and larger production level clusters used at Google. In one such comparison, various aspects of GFS were compared, such as storage, metadata overhead, read and write rates, rate of operations sent to the master, and the recovery time. Another such comparison involved chunkserver workload profiling according to operation count and bytes transferred, comparison of number of appends vs writes, and the breakdown of requests received at the master. These comparisons validated the assumptions made by the GFS designers and also validated the GFS design objectives of high performance.

While the above evaluation covered the significant aspects of GFS, I believe that the following tests could have been added. First, there was no comparative analysis against other distributed systems such as NFS or AFS. Second, the key design decision of using chunk size as 64 MB was not evaluated – a performance analysis using different chunk sizes (both smaller and greater than 64MB) would have provided more insight on this decision.

Questions/Confusion 1. Semantics for handling chunk replica inconsistencies arising due to duplicates created by the record append operation.

Posted by: Shantanu Bhate | March 31, 2016 09:00 AM

1. Summary Google File System (GFS) was introduced to specifically cater to Google's data-intensive workload providing fault-tolerance on inexpensive commodity hardware. The authors discuss the file system interface extensions designed to support large cluster of terabytes of storage across multiple disks in various machines. Optimizations like constant monitoring, replicating large chunks, and fast and automatic recovery have been discussed for concurrent appends and sequential reads of large files while ensuring that the centralized master is not a bottleneck. The design was well tested with real world problems as well, proving its feasibility.

2. Problem Keeping in mind that components fail all the time, the authors tried to tackle the problem of not providing low latency in data access across networks but assurance of high bandwidth and crash recovery that never leads to loss of data. For most workloads, concurrent append writes to multi-GB files was most the common load and supporting multiple sequential readers was vital. Atomicity with minimal overhead was essential for multiple producers appending data and a consumer reading it simultaneously. The existing solutions of file system were not optimized for such workloads.

3. Contribution Keeping multiple replicas across multiple machines is the key design element for ensuring minimum data loss. With a 64MB chunk size, the client would translate the filename and byte offset into a chunk index and request master for the location and chunk handle of replicas. Master keeps track of chunks by periodic HeartBeat messages to all the chunkservers(CS) and only polls at startup or when CS joins. Operation log was the only persistent data maintained by GFS, checkpoint is in a compact B-tree that can be directly mapped into memory and optimizes namespace lookup. There are at least 3 replicas maintained and with one of them being Primary with a time-out lease. Shadow masters support only read-only access and are accessed when master is unavailable. Metadata state changes in master via logging, flushing to disks and finally applying. New checkpoint can be created without lag in incoming mutations. Having relaxed consistency implies maintaining version number of chunks and the data may be unavailable but never corrupt. Garbage collection kicks in for stale chunks via version numbering and on deletion of file the orphaned chunks are identified on CS via HeartBeat messages. Reclamation of resource upon deletion of a file is lazily carried out on master.

4. Evaluation The micro-benchmarks evaluate Read / Write / Append throughput via bandwidth utilization. Recovery time is also computed along with analysis and profiling of real Google workloads. In my opinion, separating file system control to master and data transfer amongst chunkservers and clients is a good approach to harness the most out of a distributed file system. Keeping the 3 major metadata types and no file data is a good usage of memory and also ensures no single point of failure. Chunk replica information is not persisted in master since the servers can modify them locally. File data is never cached, and snapshot is done on the same chunkserver that reduces the latency and network traffic a lot. It would have been good to evaluate these optimizations in a step-wise process to understand the performance gain better.

5. Question How are the inconsistent regions during record appends handled.

Posted by: Sejal chauhan | March 31, 2016 08:59 AM

1. Summary This paper describes Google File System which is a scalable distributed file system tailored for large data intensive workloads running on commodity machines. With goals like fault tolerance, scalability, etc. similar to previous distributed file systems, Its design is mainly driven by the workload characteristics and environment with focus on high aggregate throughput than low latency. It provides an API that is different than POSIX with operations like atomic record append and snapshot.

2. Problem There was a need to design a new distributed file system to meet Google’s growing data processing needs typically involving huge distributed workloads operating on a large number of multi-GB files. Some characteristics of their workload and operating environment include inexpensive commodity machines that fail often, files of huge size typically in GB, workloads with large streaming reads and small and rare random reads, workloads with large sequential appends to file, concurrent clients, etc.

3. Contributions The main contribution of the paper is in the overall design of a distributed file system that support huge data intensive workloads operating of huge amounts of data. i) While many people thought that single master design was not scalable and fault tolerant GFS design proves them wrong. Single master keeps the design simple. Though it might seem like a bottleneck, GFS master serve only metadata requests and the actual data requests go to the chunk servers. ii) Splitting a file into fixed chunks of 64MB and making the chunk as the unit of replication is another novel design choice. Having chunks of large size reduces the size of metadata at master and also reads and writes on the same chunk require only one initial request to the master. iii) As opposed to POSIX semantics which doesn’t provide any guarantee for concurrent operations within a file, GFS provides a clear definition of file state in presence of concurrent accesses for write and record append operations. Since concurrent appends are common among the workloads at Google, GFS provides a new atomic record append operation that guarantees that the data is written to the same offset at all the replicas. iv) Some other novel design choices include pipelined writes between replicas that fully utilizes each machine's bandwidth, checksums to protect against filesystem corruption, lazy deletion that runs in the background that can also deal with cases like accidental deletion, flat namespace and concurrent mutations to a directory, no caching at client.

4. Evaluation The evaluation presents some microbenchmarks and real world workloads from Google clusters. They evaluate aggregate throughputs of various operations like read, write and record append and compare it against the theoretical limit. They also show the master load which proves that a single master is not a bottleneck in their design. They also evaluated recovery times during chunkserver failures. They also present the real world workload characteristics in two GFS clusters. The evaluation didn’t show the downtime that can be caused due to master failure.

Posted by: Aishwarya Ganesan | March 31, 2016 08:59 AM

1. Summary The paper presents the design and implementation of the Google File system which is a scalable distributed system for large distributed data-intensive applications, providing fault tolerance running on inexpensive commodity hardware and delivers high aggregate performance(throughput) to a large number of clients. The design was driven analyzing the application workloads at Google and thus traditional choices in design were reexamined to explore suitable design points.

2. Problem The application workloads and technological environment at Google is different from existing compute and storage file systems. Component failures are the norm rather than exception and thus needs constant monitoring,error detection, fault tolerance and automatic recovery. Also, the files are huge by traditional standards, with Multi-GB files being the common case. Another aspect of the environment is that most files are mutated by appending new data rather than overwriting existing data, which makes random-writes practically non-existent. Reads performed will be sequential. They have also addressed the problem of atomic append operation for multiple clients to append concurrently without using synchronization constructs.

3. Contributions The design of GFS cluster contains a single master, multiple chunk servers and is accessed by multiple clients. Files are divided into fixed sized chunks which are replicated in GFS for reliability. Chunkservers store chunks,which are identified by global unique chunk handle, on local disks as Linux files. The master maintains all the file system metadata - namespace, access control information, mapping from files to chunks and current locations of chunks and controls system wide activities like chunk lease management, garbage collection and chunk migration. The client interaction with master happens only to get metadata, and all data-bearing communication goes directly to chunkserver. Having a single master simplifies the design and makes chunk placement and replication decisions using global knowledge. Chunk size is chosen to be quite large leading to reduction in client-master interaction , reduce network overhead by keeping persistent TCP connection, and reduction of metadata size stored at master node(it can be stored in memory even for very large clusters). Mutations are logged persistently into operational log at master to ensure reliability and recoverability after master crash. Another useful design point is the separation of data flow and control flow in GFS to use the network efficiently. Snapshot operation is supported to make a copy of a file/directory instantly. The master implements chunk level replica placement policy to maximize data reliability and availability. Garbage collection is performed lazily at chunk and file levels to amortize its cost. The design of master and chunkserver ensures fast recovery to restore their state by master replication and chunk replication. 4. Evaluation To illustrate the performance the authors rn micro-benchmarks to analyze the read, write and append performance. They have made compared GFS performance with the theoretical maximum performance for these operations calculated using cluster configuration. For real world clusters , the authors point out that metadata stored in master is very small and can easily fit in the master’s memory. They also present experimental results illustrating that master is not the bottle-neck in the system with the load being around 300-500 ops at master. Thus they show that scaling will not be a problem with these designs. Results for quick recovery are also presented based on the priorities (2-20min for different clusters 600GB). The read rates are are very high and about 70% of the theoretical value while the writes are slightly slower than expected by authors. IT would have been interesting if these experiments would have been compared to existing file/storage systems for these applications.

5. Confusion Could you explain about the state of the files after the atomic append operation? How is inconsistent file state repaired? Can replicas have slightly different contents at some stage?

Posted by: Anshul Purohit | March 31, 2016 08:54 AM

Sumary: The authors in this paper discuss the design and the implementation techniques used in GFS, a scalable distributed file system for large data intensive applications. GFS provides fault tolerance and high aggregate performance. In the design authors have completely given up the POSIX interface and have kept it simple enough to work optimally on their anticipated workload. This file system is successfully deployed within Google as a storage platform for processing large datasets.

Problem: The major problem being solved by the authors in the paper is to provide fault tolerance and high aggregate performance along with scalability in the distributed file system. They target GFS for workload with huge files which are mostly appended or read sequentially.

Contribution: The paper revisits the traditional file system assumptions in the scenario of anticipated workload of google and propose some radical changes for distributed file systems. First is with respect to fault tolerance. GFS with its heart beat messages proposes constant monitoring of its cluster and chunkserver by a master server. It uses checksums for error recovery of failed components and driver issues. It also uses automatic recovery by shadowing master, replication of state and chunk across different chunkservers. Second is with performance, based on an observation of file size distribution and access pattern, GFS uses a large chunk size, decouples data and control transfer between chunk servers and master. For concurrent access it supports append at least once mechanism along with chunk leases and explicit serialized mutation orders.

Evaluation: The authors evaluated the performance of GFS in the paper with few micro benchmark experiments for reads, writes and record appends. They find that efficiency of read drops as the number of readers increase as there could be multiple readers reading from the same chunk server. Writes were slower than expected but it doesn't increase any significant aggregate write bandwidth. They also did an examination on a real world cluster used within google, read rates were much higher than the write rates. Overall it is a great paper with many techniques working together and I think since the google file system is suited to google workloads the evaluation is appropriate.

Confusion: Would like to have a discussion on tradeoffs incurred in the design and how the file security as provided by POSIX interface is implemented here ?

Posted by: Ankur Srivastava | March 31, 2016 08:43 AM

1.Summary: This paper is about the design and implementation of Google File System(GFS), a scalable distributed file system for large distributed data-intensive applications. The main features of the file system are fault tolerance and high aggregate performance.

2.Problem: Traditional distributed file systems lack the following for data-intensive applications: 1) Inexpensive commodity components accessed by client machines are prone to failures and do not recover easily. As a result, fault tolerance must be integral to the system 2) Multi GB files are mostly common and must be optimized for, than small files. 3) There are large appends when compared to random writes and must be made efficient. 4) Co-designing file system API and the applications makes the system flexible. GFS is built on the above assumptions.

3.Contributions: GFS employs variety of simple techniques as part of the system. Following are some of the key contributions: 1) GFS cluster with single master and multiple chunkservers of fixed size chunks with globally unique 64 bit chunk handle assigned by master during creation. Master maintains all the metadata including file namesapce, mappings from files to chunk, chunk location, access permissions etc. Client gets chunk information from masters and contacts the chunk servers for reads and writes. 2) Error detection using 'heart beat' messages from chunk servers to collect state, give instructions. Fault tolerance is provided by replicating the chunks across the chunk servers and having a 'shadow' master that shares the master's metadata and logging information. 3) Having large chunk size of 64MB optimizes for workloads that read and write large files sequentially, reduce network overhead by persistent TCP connection. 4) Concepts such as snapshot to create copy of files/directory tree instantaneously and atomic record appends, where appending 'atomically atleast once' semantics is used. This suits the need where multiple clients append to the same file concurrently. Duplicates/inconsistencies are handled through checksumming. 5) Lazy Garbage collection at file and chunk levels, making it simpler. During file deletion, instead of reclaiming physical resources, it is changed to a hidden name and chunk entries are removed from metadata. Orphaned chunks are then found across the chunk servers and later reclaimed.

4.Evaluations: The authors have based the evaluation on micro-benchmarks and real world clusters. In Microbenchmarks, the performance of reads, writes and record appends for N number of clients are measured and compared against the theoretical limit. Observed read rate is 80% of the ideal limit and drops as number of clients increases, aggregate write rates about half of the limit and record append rates are less due to network congestions. Further, two real world clusters are taken, of which one contains many smaller reads and other with large sequential reads. Here, they evaluate the load on master and find that it is not a bottleneck ,with the master handling thousands of file accesses/sec due to efficient searches. Chunk server failure handling has been evaluated by killing one of the chunkservers with 15000 chunks and 600 GB of data, in which all of the chunks were recovered in 23.2 minutes. The authors have measured the system well and provided performance metrics in the form of tables and graphs, but have not really drawn any conclusion from them or provided their thoughts as to why the results are so. The efficiency of GFS could have been shown better by comparing against the same workloads on a different distributed file system.

5.Confusion: How does lazy space allocation solve internal fragmentation problem?

Posted by: Sharanya Devaraj | March 31, 2016 08:40 AM

Summary This paper describes the design and implementation of the Google File System (GFS), which is a scalable distributed file system designed by Google to especially support their large distributed data-intensive applications.

Problem Large-scale data processing needs at Google led the authors to revisit some of the design goals of the file system. Specifically, the requirement was to create a fault-tolerant & scalable file system that could run on commodity hardware and provide concurrent access to hundreds of clients to terabytes of data stored across a thousand of machines. And thus, driven by the unique requirement of their application workloads, the authors proposed GFS.

Evaluation The authors have presented their evaluation of GFS using micro-benchmarks, as well as, using real-world workloads. A small GFS cluster consisting of one master, two master replicas, 16 chunkservers and 16 clients was used to micro-benchmark the performance of GFS. For their micro-benchmarks, the authors have compared the performance of read, write and record append operations of GFS against the theoretical possible limit as established by network bandwidth. The read operation rate was able to achieve 75% to 80% of the theoretical limit, while the write operation was only about half of the theoretical rate. They have attributed slow write rates to their network stack, which does not interact very well with the pipelining scheme used to push data to chunk replicas. The authors also examine the performance of GFS on two kinds of real-world clusters- one that is used for research and development and the other that is used for production at Google. For both the types of clusters, they have demonstrated that GFS is able to sustain high read, write and record append operation throughput even in the presence of chunkserver and disk failures.

Given that GFS is highly tailored for use by large-scale data-intensive applications run within Google itself, I think it justifies for the authors to have done their evaluation primarily on the workloads seen by their own Google clusters. However, I feel that the authors could have elaborated more and justified some of their design choices through their evaluations. For example, what is an appropriate chunk size? Is 64 MB good enough? How does the performance vary with different chunk sizes? What is a good replication factor? How does increasing replication factor affect the network congestion and performance of the overall file system? I think it would be interesting to see answers to these questions.

Confusion How is access control and permissions implemented in GFS?

Posted by: Saket Saurabh | March 31, 2016 08:36 AM

1. Summary In this paper, the authors describe the design and implementation of the Google File System - a scalable distributed file system for large distributed data-intensive applications. This system meets the rapidly growing demands of Google’s data processing and storage needs. 2. Problem The authors reexamine the traditional choices in designing distributed file systems in order to meet their needs. These include having component failures as norms rather than exceptions, ubiquity of large files and the fact that most files are mutated by appending new data rather than overwriting the existing ones. They have also co-designed the applications and the file system API to benefit the overall system flexibility. 3. Contribution The authors have laid down several design and implementation decisions taken in order to provide a fault tolerant distributed system. A GFS cluster consists of a single master and multiple chunkservers accessed by multiple clients. It provides create, delete, open, close, read, write, snapshot and record append operations for a file interface. The files are divided into fixed-size (64MB) chunks identified by a 64 bit chunk handle assigned by the master during chunk creation. The chunks are replicated across atleast three servers. The master maintains all file system metadata which includes the file and chunk namespace, access control information, mapping from files to chunks and chunk locations. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks and chunk migration between chunkservers. Having a single master allows it to make sophisticated chunk placement and replication decisions using global knowledge. Chunks reduce clients’ need to interact with the master as operations on a chunk require only one initial request to the master for location information. It also reduces the size of the metadata stored and enables to have a prolonged TCP connection established between the client and the server. The master obtains all the chunk locations at startup by polling the chunkservers and it keeps itself up-to-date thereafter with regular Heartbeat messages. Operation log stores the critical metadata changes preventing metadata losses. GFS has a relaxed consistency model that guarantees that the file namespace mutations are atomic, concurrent successful mutations leave the region undefined but consistent and the mutated file region is defined and contains the data written by the last mutation. System interactions with the master is minimised in order to avoid it becoming the bottleneck. It leases replicas as primary and secondary to establish a mutation order. Data flow and control flow are decoupled to use network efficiently. Atomic record append allows the client to specify only the data and the GFS guarantees that the data will be appended to the file atleast once. Snapshot operation allows the system to have a consistent state to rollback to in case of failures. The master manages namespaces through locking allowing for concurrent mutations in the same directory. The system allows for simple garbage collection due to the mappings maintained. Identification of orphaned chunks is done through heartbeat messages. The chunk version number identifies stale replicas. The authors also specify the fault tolerance and diagnosis. I feel that the authors have done a great job in identifying the key design challenges and developing a system that addresses these shortcomings. 4. Evaluation The system has been carefully and convincingly evaluated. The GFS performance was analysed through micro-benchmarks and real world clusters. In the micro-benchmarks, the read and write rates increases with client numbers while the record append operation shows variations due to congestion, which is not an issue in practise. The authors justify the read/write performance through the real world clusters whose workload has been broken down and analysed. The master is no longer the bottleneck. They also provide an analysis of the recovery time. 5. Confusion

Posted by: Nivetha Singara Vadivelu | March 31, 2016 08:35 AM

Summary: This paper describes the design of a new file system called the Google File System that intends to optimize common case workloads and is distributed, fault tolerant on commodity hardware, and hence does not strictly adhere to POSIX standards. The paper also evaluates the design on google workloads to prove its effectiveness.

Problem: Existing POSIX file systems were not suitable for the common case workloads at Google. They decided to go ahead with commodity hardware where failure was the norm and conventional file systems are not fit for high scalable, fault tolerance, error detection and fast recovery in case of failures in large clusters. The size of files used were large and performing reads and writes at small block granularity was becoming too much overhead, so they needed larger chunk sizes. Concurrent writes and appends was a common workload, and again existing file systems are not tuned for such specific type of workloads. Google also intended to co-design applications with the file system, hence incorporated all the above mentioned design specifications into GFS for high performance and reliability.

Contributions: So the contributions of the paper are in terms of the policies and mechanisms to incorporate all the design principles mentioned in the problem section of this review. Fault tolerance is provided by constant monitoring and recovery in times of component failures. GFS consists of a single active master and multiple chunk servers. Clients interact with both the master and the chunk servers. Chunk servers contain the actual data in fixed size chunks (64 MB) and the master is responsible for coordination and metadata management. Operation logging checkpoints are used for faster recovery of master. The concept of leases is used for clients to interact with chunk servers and to also ensure correct ordering of writes in all the replicas. Data and control transfer is decoupled to enhance utilization of network bandwidth and load balancing. Data transfer is pipelined for better network utilization and synchronous operations. It also provides additional operations such as snapshot to make additional local copies using copy on write. Replica placement places chunks on different racks for better fault tolerance. Replication factor of chunks can also be dynamic based on request patterns. Lazy garbage collection to clean up deleted files later by just logging the deletes and performing actual deletes during regular scans improves performance. Clients interact with the master for metadata, but all the data traffic is directed directly to the chunk server to prevent bottlenecking of the master. No caching of data is performed to avoid the cache coherence overhead and more so because cache is not useful in case of large sequential reads, which is a common workload. Metadata is cached and is stored in memory for fast lookups. The file states after mutation are consistent, defined, undefined and inconsistent. Chunk servers use checksums for error detection to ensure data integrity.

Evaluation: It uses micro-benchmarks to evaluate the bottlenecks in the GFS design using real world clusters and compare the read, write and append against the theoretical marks. It also evaluates and verifies all the design choices made in the design and proves that master is not a bottleneck and workload is append intensive. Each machines network bandwidth is well utilized. As readers increase, reduction in aggregate performance was observed while reading simultaneous from same chunkserver. Aggregate write performance also reduces as number of writers increase. 2 different GFS clusters were used for evaluation. Chunkservers (one or two failures) were on purpose killed to evaluate performance and recovery times were noted in both cases. Master load was minimal ( Missing Evaluations: Evaluation of a range of other non-standard workloads (too many small files), comparison against other distributed file systems, advantage of file system buffer caching in these workloads.

Issues: Why not use multiple chunk sizes depending on the workload? Wouldn’t the relaxed consistency model create high overhead in clients?

Posted by: Siddharth Suresh | March 31, 2016 08:34 AM

Summary: GFS demonstrates how to support large scale processing workloads on commodity hardware. It is designed to tolerate frequent component failures and optimized for huge files that are mostly appended. They also extend and relax the standard file system interface to improve the overall system.

Problem: This paper describes a filesystem designed to be specifically efficient for Google's applications and workloads. This filesystem was designed keeping in mind the following observations: i)High commodity failure rates because of use of inexpensive commodity components ii)Workloads use modest number of huge files iii) Files are write one append only mostly concurrently iv)Large streaming reads and high sustained throughput favored over low latency

Contribution: The new filesystem designed to optimize for the features listed above was the contribution. Googles file system does not use the traditional per directory data-structure that lists all the files in that directory. i) Here, files are stored in chunks and the chunk size is also decided after careful analysis to reduce interaction with master at the same time not leading to hot spots if many clients access small files in same chunk. Chunks are also replicated across different chunk-servers to improve reliability. ii) Simple centralize management, but the master is not consulted on every I/O. Clients cache the information of which chunk-servers it should contact and uses it for subsequent operations. Not only does it minimize master involvement but also has shadow masters which are used during failures. iii) Google also identified that for its workloads data caching is not important since they have large streaming reads and large data sets. iv) The whole file system has been designed with the same interfaces (read, write). Two new interfaces have been added: snapshot and record append. These interfaces have been designed specifically to expedite appends to a file and creating snapshots of directories, both of which are prevalent in their workload. v) Simpler and more reliable garbage collection: it does it asynchronously , it logs the deletion and renames it to hidden name. This feature also helps in keeping deleted data for some time helping in cases of accidental delete. vi) Decoupled the data flow and the control flow to avoid network bottlenecks and provide high aggregate throughput to concurrent readers and writers.

Evaluation: The performance of this filesystem is evaluated using a GFS cluster consisting of one master, 2 master replicas, 16 chunk-servers, and 16 clients. All machines are configured with dual 1.4GHz P3 processors, 2 GB of memory, two 80GB 5400 rpm disks, and a 100Mbps full-duplex Ethernet connection to an HP 2524 switch. All 19GFS servers are connected to one switch and 16 client machines to the other. The 2 switches are connected through a 1Gbps link. The paper evaluates the performance in terms of read rates, write rates and append rates and compares it to theoretical limits imposed by the network. This is done for varying number of clients to test the system with varying levels of pressure. One common observation was as number of client increases efficiency drops which is obvious in a networked server. An explanation of how 16 clients translate to a realistic sharing scenario for a cluster with 16 chunk-servers is not given. I mean, say if a google cluster has 1000 chunk-servers in real situation are their just 1000 clients? The paper then takes 2 workloads, A and B , one used for research in google and other a production data processing. It tests the read rate , write rate and master operations on the two types of workloads and proves that master is not a bottleneck for these operations. None of the results have been compared against traditional filesystem. What happens if google workloads are run on a traditional filesystem, how do the numbers relate? Master could become a bottleneck with increasing number of chunk-servers I think. Evaluation for future scalability options (esp. for master). Even though results prove that presently it is not a bottleneck due to measures taken in design.

Confusion: No data caching in DRAM or the SRAM caching? Is there no caching at all?

Posted by: Vishakha Dhelia | March 31, 2016 08:28 AM

1. Summary This paper describes a distributed file system with a master node that maintains metadata and client nodes that hold replicas of the data stored. It is designed to be fault-tolerant enough to run well on commodity hardware.

2. Problem Google found that it used distributed file systems in ways for which no file system had been designed. First, Google used clusters of inexpensive commodity hardware to store their data. This virtually guaranteed that, before the end of a file system operation, some hard drive, computer, or network would crash or permanently fail. Second, Google stored many large files. These were typically written in long, sequential appends to the end of the file and read in long, sequential reads, although small random reads and writes did sometimes occur.

3. Contributions The paper contributes a file system design and implementation that uses a master node that manages metadata and multiple chunkservers that hold the file data itself. Files are split into large blocks called chunks, which in the paper are set at 64 MB. Each chunk is replicated across multiple chunkservers, and the master stores which chunkserver each chunk is located on. The master also holds metadata, such as namespaces and protection information, as well as an information log, which records the mutations made to the files. All of these are held in memory. While the GFS file system described in this paper does not give a POSIX interface, it provides operations to create, delete, open, close, read, and write to files as usual, as well as two new operations: record append and snapshot.

For reading, a client asks for the location of an offset of a file in a chunk on a chunkserver. Using the IP address, the master is able to give the client the location of a close chunkserver with that data. The client then reads the data, verifying each 64 KB block against its pre-computed checksums in order to detect corruption. For a mutation, either writing or a record append, to a pre-existing chunk, the master grants a lease to one of the replicas, which is called the primary replica. The master then notifies the client of all the replicas for the file requested, to which the client then pushes data, in any order. The primary replica then decides the order in which the mutations should be applied and alerts the other replicas to this.

The two new operations, record append and snapshot, are implemented as follows. Record append takes data as operand, but not a location, since the data must be appended to the end of the given file. It will be appended at least once, until all replicas have an identical region of data at an identical chunk offset. This allows faster concurrent appends. Snapshot is implemented in a copy-on-write fashion. When the master receives the operation, it marks all chunks of the data for the snapshot and revokes leases to them. When new mutation requests are initiated to those chunks, the master creates new chunks.

The replication of the data across chunkservers contributes greatly to the fault tolerance of the system. In addition, the master node is replicated to many shadow masters, each of which stores a copy of the operation log and the checkpoints of it. The checksumming described above also contributes to the fault tolerance.

4. Evaluation The paper evaluates the GFS on both microbenchmarks and real-world clusters. The microbenchmarks were run on a cluster with three master nodes and 16 client/chunkserver nodes. Reads are found to be 75%-80% of the theoretical limit. Writes, contrastingly, run at about half the theoretical limit.

For the real-world clusters, the read rates were again much higher than the write rates. The master node was never found to be a bottleneck and restoration of a single chunkserver with 600 GB of data took 23 minutes.

In general, this evaluation seems appropriate. However, as the paper itself points out, they are evaluating the Google File System on Google's workloads, which is a large threat to the validity of the experiments. Nevertheless, this shows how well the file system behaves on (exactly) what it was designed to do.

5. Confusion How is data protection (permissions) handled with this file system? Is it in this file system, or in a layer above it?

Posted by: Stephen N. Lee | March 31, 2016 08:04 AM

1. Summary This paper describes Google File System (GFS), which is a scalable distributed file system for large distributed data-intensive applications. Though GFS shares many of the goals of previous distributed file systems, the interesting aspect of GFS is that it was designed keeping in mind the application workload as well as the technological environment. The authors give details regarding the various aspects of GFS and justify their choice. Lastly, carrying out measurements using micro-benchmarks as well as from real world use puts across the feasibility of GFS. 2. Problem As stated earlier, the authors wanted to come up with a file system that was designed keeping in mind Google’s workload. Firstly, they wanted to build a FS that would scale using commodity hardware, wherein failure was no longer an exception. Thus, there was a need for constant monitoring and other mechanisms to ensure fault tolerance. Secondly, they wanted to optimize the FS for Google’ s workload that had the characteristics of large files, append-only workloads, streaming reads and less random reads, emphasis on throughput as opposed to latency. The existing solutions were too generic and were not tailored to such workloads. 3. Contribution According to me this paper was a very interesting read and the authors have put across the various features of the GFS systematically. The GFS consists of a single master (contains the metadata in memory) and multiple chunk servers (store the actual data in terms of chunks). I liked that fact that authors took into consideration that a single master could be a bottleneck and mitigate the issue to an extent by minimizing the master’s involvement in the read and write operations. In additional to normal FS operations, GFS supports snapshots and record appends. Another aspect that I found interesting was that use of heartbeats to ensure that the master is aware of the chunk servers and monitors their status. This allowed the master not to persist the chunk locations, which is a vital decision as chunk locations may keep on changing. Apart from chunk locations, the other metadata is kept persistent using an operational log. GFS follows a relaxed consistency model wherein the basic idea is to make the applications responsible for enforcing consistency. I feel this made the design of the server simpler (as opposed to if it would have to handle the consistency semantics). The key feature of decoupling data flow and control flow is an important one as it ensures efficient network usage. Also, the mutation order is maintained using the concept of leases, which is simple yet efficient (as the master solely isn’t responsible for the order). Though distributed garbage collection can be challenging, GFS simplifies the problem at hand by adopting a lazy reclamation policy. There are various other interesting aspects of the GFS such as replication of the chunks to increase availability, rebalancing of chunks to ensure proper disk utilization, use of “shadow” masters to reduce the impact of a master failure to an extent. 4. Evaluation The authors evaluate the FS by measuring the performance of a few micro-benchmarks as well as present some numbers from real clusters at Google. Through the micro-benchmarks, the authors show the aggregate read/write throughput generally increases with number of clients. Though the results show that the throughput achieved by record appends is considerably lower, the authors attribute the cause of this to the network bandwidth limitation. Through the real world clusters and the workloads running on them (research oriented and product-oriented), the authors justify the initial assumptions of the workload characteristics. They also give details regarding the recovery time, master load and metadata storage overhead which helps one gauge at the feasibility of GFS. Though convincing, I feel the evaluation of the system could have been done in a better manner. Firstly, comparing GFS with existing distributed FS would have been ideal. Also, evaluating GFS on other workloads apart from local Google workloads would have been more convincing. 5. Confusion When would one want to read files via the hidden name? Is the chunk size a unit of allocation on the disk? How exactly does lazy space allocation work (doesn’t padding incase of record append violate the same)?

Posted by: Arjun Singhvi | March 31, 2016 07:09 AM

1. Summary This paper describes Google File System, a distributed file system that is designed and used by many data-intensive applications in Google. Some of the design decisions are made specifically to fit the workload and hardware environment of Google. 2. Problem In contrast to the many existed discussions and papers on distributed file system, Google File System made some design choices based on their observations on the commodity hardware and typical workload pattern in Google: at the workload scale of Google, hardware failure of some machine at any given time is expected rather than an exception; files are usually huge, scaling to the size of multi gigabytes; data access pattern involves mostly large sequential and small random reads, along with sequential writes. 3. Contributions The design of the Google File System to meet their needs include a single master server and multiple chunk servers: the master server holds all the file metadata, while files are split into fixed-size chunks and scattered across chunk servers. The interaction between master and clients is minimized when master only responds to the request of chunk server location that contains the given chunk of a file. The chunk size is designed to be 64 MB, much larger than physical disk block size, to further minimize the need to querying master for the metadata and enables caching more on the client’s side. Each chunk is replicated on multiple chunk servers, so that fault tolerance of data is improved. Concurrent write is supported by providing the atomic record append, in which appending specified data to a file is atomic and the offset would be returned. 4. Evaluation The performance of GFS is measured on a cluster with one master, two master replicas, 16 chunk servers, and 16 clients. The test includes concurrent reads and write to new files, as well as record appending. In both read and write cases the performance is compared to the bandwidth limit. Later the performance is also measured on real workloads, where there is a comprehensive list of aspects that are measured, including operations per second and recovery time. The evaluation shows how this system scales with data size and number of clients, and like some other work from Google, due to the uniqueness of the workload they are facing, there are not many other systems they would compare with, but the performance is mostly measured considering the hardware limitations. 5. Confusion

Posted by: Fujie Zhan | March 31, 2016 06:35 AM

1. Summary The paper introduces Google File System (GFS) a new distributed file system tailored to google’s own workloads which are characterized by data intensive application with mostly sequential read and append operations. These characteristics are used to selectively optimize the file system interface in a manner different from the standard POSIX file system interface. 2. Problem The problem here is Google’s own workload needing a large distributed file system to handle particular workloads. Due to the large scale and usage of commodity machines, failures are the norm and hence the solution needs to be fault tolerant. The large file size also motivates a different design pattern including things like block sizes. A concurrent append workload necessitates atomicity be built into the file system rather than being managed by clients. Complete control over the applications as well as the file system offers various flexibility to tune the system as required. 3. Contribution The paper designs a distributed system where the master and chink servers are separate. This allows a clear separation of concern between critical and non-critical data. For example, the master metadata is valued more than the user data being stored at the chunk servers as loss of the former would lead to loss of the entire file system state. The master maintains all metadata including file to chunk mappings and is responsible for all decisions such as distributing the chunk between servers, handing out leases to the primary chunk server as well rebalancing, cloning and garbage collecting chunks. The chunk servers maintain little metadata other than the actual data and cooperate among themselves for fast distribution of any write request data by piping the write data to all other servers. The client here however needs to be a little smart and must understand the chunk servers and work around the relaxed consistency model imposed by GFS. 4. Evaluation The paper presents a thorough evaluation of the performance of this distributed system by first using microbenchmarks to classify read and write performance. The paper then utilizes the fact that GFS is actually deployed to use data from deployed clusters to show the read and write rates, amount of metadata created, the load on the master and time taken to recover after a chunk server. The paper also analyses the workload to demonstrate the reason for certain design choices. The evaluation however does not show comparisons with a similar distributed file system, I realize that this may not be easily performed as in this case the clients are deeply coupled with GFS. However, something would have been used in this scenario prior to the development to GFS and a summary of the performance improvements gained after moving to GFS would have provided more confidence to other users to tune/deploy something similar for their workloads. 5. Confusion I am confused by whether inconsistent/duplicate data may vary between chunk servers and how this is handled by different clients talking to different chunk servers

Posted by: Abhinav Mehra | March 31, 2016 04:24 AM

Summary This paper presents a new scalable distributed file system introduced by Google called Google File System. The file system was designed for a certain class of data-intensive applications and aims at providing high aggregate performance to a large number of clients with fault tolerance over commodity hardwares. The paper further lists out the measurements for both micro-benchmarks and real world use cases using GFS.

Problem Google explores it's current storage requirements and data processing needs and identifies key aspects that do not go well with the use of traditional distributed file systems. With the use of commodity hardwares, component failures occur more frequently than are believed to occur. Need to handle fast growing huge data sets. And, looking at the write access patterns for such data, it seemed there were more requests for appending data to a file as compared to over-writing existing data. The newer file system is designed to fit in these above cases.

Contributions a. A GFS cluster comprises of a single master and multiple chunkservers containing the actual data. With a single master, it provides more simplicity to the design. Further, a client only contacts the master to get the locations of the file and then caches the information and then talks to the chunkserver directly, hence reducing the chances of master becoming a bottleneck. b. The master is responsible for chunk allocation/replication policies providing a load balance among the chunkservers. File metadata is stored at the master with prefix compression of file names, thus reducing the meta data size. c. Replicas of data are maintained among chunkservers ensuring recovery if one of the chunkservers fail. d. Consistency is maintained by ordering of writes followed by each chunkserver and maintaining a chunk version number with every update to it. Such approach helps with garbage collection at the chunkservers to remove stale data. Also, checksumming is used to detect data corruptions. e. Each update to the metadata at the master is logged and checkpointing is used to reduce the log sizes. Such logging ensures recovery speed ups after crashes. f. Operations like record append and snapshot are provided to allow multiple writes concurrently and copy a given file or a directory tree respectively.

Evaluation The evaluation seemed a bit confusing to me. The paper comes out with the read/write/append rates for the applications with GFS but does not really give an idea on how GFS has improved the performances for these applications over the traditional file systems. GFS is tested with microbenchmarks and real data-intensive applications at Google. The microbenchmarks were tested on a platform with two master replicas, 16 chunk servers and 16 clients. In case of reads, a linear rise is seen in efficiency with the client number but then gets reduced after a certain number of clients. The write rates and record append rates were found to be less as compared to reads and the authors claim that the network stack(involved with data and control flow) claims for the extra time. For the real applications, bigger clusters with 342 and 227 chunkservers were used. Again, read rates were found more than write rates with read rates using almost utilizing the network bandwidth. Master was seen to give good throughput and did not pose a bottleneck. The restoration times for chunkservers were found reasonable, proportional to the priority with which they were cloned. Plus, they do not mention on how the applications were affected with the introduction of some file system parameters like finding the right chunk and keeping track of each chunk's locations.

Confusion Is there any overhead for applications in tracking the chunks for each of it's files and keeping a track of it's locations ?

Posted by: Akshay Kanfade | March 31, 2016 03:35 AM

1. Summary This paper presents the Google File System, designed for supporting large-scale data-processing workloads on commodity hardware. It was motivated by the needs of Google's in-house data processing, and hence was co-designed with the applications. Scalability, fault tolerance, fast recovery, automated management and high aggregate throughput are key design motivators, optimized for large files, where read and append operations dominate. 2. Problem Optimize for common case of Google's data processing – large files, reads and concurrent appends, component failures. Large file handling – difficult in traditional FS. Scalability – ease of adding capacity, monitoring and maintaining was another need. Commodity hardware - cheap, but unreliable - failure is common and expected. Network - Minimize bandwidth usage, needs load balancing 3. Contributions 1. Co-design FS and application. 2. Not just a file system – Autonomic Computing - Automatic monitoring and administration. 3. Additional file commands - a) record append (atomic, append at-least once sematics) b) snapshot (copy-on-write on same chunkserver - local copy faster, after leases expire/are revoked) 4. Big file handling – 64 MB chunks, with unique 64 bit handles 5. Master Metadata - file and chunk namespaces – persistent log on disk, checkpoints. file->chunk mapping – persistent log on disk, checkpoints. chunk replica locations – master polls at startup, or when chunkserver joins 6. Fault Tolerance - Fast Recovery, Replication Master Metadata State changes - logged, flushed to disk on all replicas, finally applied. On master process failure - restarted, almost immediately up (operation log). Machine / disk failed - replica starts process. Shadow masters - support read-only access, whether master is down or up. Writes pause for less than a min while new master talks to chunkservers. (comparable to most RAID recovery) Clients only know the master via a DNS alias, can change master transparently. Chunk Server Replicas - default 3, configurable. Primary (time-out lease) and secondaries. Replicas on Different machines on different racks (side benefit - reads - bandwidth of multiple racks) 7. Messages Heartbeats Chunkserver -> Master - metadata updates - periodic Master -> chunkserver instructions Lease extensions piggy-backed Handshakes between master and chunk servers Failed chunkservers – identified, replication done as necessary. Data corruption - 32 bit checksums per 64KB block. checked by chunk server on read Stale chunks – identified using version number 8. Relaxed consistency Version numbers for chunks Mutations applied to all replicas in same order, managed by primary replica. Worst case, data unavailable, not corrupt 9. Garbage Collection On deletion of a file, only logged with timestamp, lazy resource reclamation after 3 days on master On removal from master, orphaned chunks are identified to chunkservers on heartbeat messages. Stale chunks - version number identifies, regular garbage collection removes. 10. Data flow Pushed along a linear chain, instead of trees etc, to use every machine's full outbound bandwidth. As large reads and writes are common, GFS is optimized for sustained bandwidth rather than low latency or IOPS. 11. Master - Global Lookup table mapping file path to metadata, not per directory Each node has a R/W lock - lazy allocation, acquisition is ordered by level and lexicographically. 12. Decouple data flow from FS control flow on mutations (append / write) 4. Evaluation Measurement Micro-benchmarks evaluated Read / Write / Append throughput via bandwidth utilization. Real Google clusters - workloads profiled, analysed. In addition to I/O ops, recovery time was also computed. Not detailing, since comparison isnt the point, appropriateness for their workload is. My Opinion 1. Comparison with existing distributed file systems isn't done, but perhaps forgivable as the design motivations and environment are completely different. 2. Debatable Design decisions - Driven by workload characteristics, after analysis - good. 3. Smart memory usage - avoid caching file data, store checksums / metadata . 4. Neither the client no the chunk server cache file data (CS - buffer caches since linux file) 5. Single Server - Bottleneck / Single Point of Failure ? - not really In-memory – 3 major metadata types, no file data - fast, so not bottleneck Replication, persistence – fast recovery, so not SPOF. 6. Chunk size - lazy allocation to avoid internal frag 7. Chunk replica info not persisted - master just holds, doesn’t own/change this data 8. Different replication, reclamation policies for different parts of the namespace 9. Copy-on-write on same chunkserver - nice touch 5. Confusion

Posted by: Adithya Bhat | March 31, 2016 03:18 AM

Summary The paper presents a The Google File System or GFS is a scalable, fault-tolerance distributed file system custom-designed to handle Google’s data-intensive workloads. GFS provides high aggregate throughput for a large number of readers and append-only writers (i.e., no overwrites) in a fault-tolerant manner, while running on inexpensive commodity hardware. It is specially optimized for large files, sequential reads/writes, and high sustained throughput instead of low latency. GFS uses a single master with minimal involvement in regular operations and a relaxed consistency model to simplify concurrent operations by many clients on same files. The problem Based on observations made on Google's application workloads and technological environment, the authors identify three assumptions that are radically different from the traditional distributed file systems. Firstly, component failures are norms rather than exceptions. Secondly,files here are huge (multi GB) in size. Thirdly, most files are mutated by appending new data rather than overwriting existing data. Finally co-designing the applications and file system API increases flexibility. The above departure in characteristics from earlier file systems motivated the authors to come up with the Google File System. Contributions 1.A GFS cluster consists of multiple nodes. These nodes are divided into two types: one Master node and a large number of Chunkservers. Each file is divided into fixed-size chunks. Chunk servers store these chunks. Each chunk is assigned a unique 64-bit label by the master node at the time of creation, and logical mappings of files to constituent chunks are maintained. Each chunk is replicated several times throughout the network, with the minimum being three, but even more for files that have high end-in demand or need more redundancy. The master state is also replicated for reliability and this way high availability is ensured. 2.GFS promotes a relaxed consistency model with simple guarantees of defined and consistent regions and the GFS applications can easily accommodate this a with a few already existing techniques. 3.GFS uses a lazy garbage collection instead of eager deletion. This form of lazy storage reclamation comes with a lot of advantages like providing reliability over component failure , allows merging of garbage collection with background activities resulting in cost amortization and acts as a safety net against accidental deletions 4.The master stores the metadata instead of the chunks itself in memory itself. The metadata per file is pretty small in size hence this memory-only approach is not a major limitation. 5.GFS introduces a concept of leasing for maintaining consistent mutation order across replicas. 6.Unlike most other file systems, GFS is not implemented in the kernel of an operating system, but is instead provided as a userspace library. 7.GFS accommodates easy snapshots without hampering ongoing mutations. 8.The atomic append operation called record append allows concurrent writes to the same region without complicated distributed lock manager. 9.To optimize network traffic, data flow and control are decoupled. GFS was a very important paper and paved the way for HDFS. Evaluation The evaluation section is divided into two parts-first it presents the bottlenecks inherent in the GFS architecture at the very outset via micro-benchmarks and secondly it reports performance statistics of real world clusters. The authors provide adequate explanation for all the observations like why aggregate write rates are less than the theoretical limitations. All the design choices made in the paper are verified by the results – the workload is append intensive , master is not the bottleneck etc. Also the workload breakdown provided is very detailed and helps in better understanding. However no comparison is done with existing distributed file systems to show where exactly GFS gains in performance for Google's own application workloads. Also it would have been good to present the performance statistics for workloads which benefit from client caching and are not of the form read-once,write-once. Confusion What is meant by explicitly deleting a file which is already deleted? Also how is it feasible for a chunkserver to have a chunk with a version number higher than that recorded in the master?

Posted by: Amrita Roy Chowdhury | March 31, 2016 02:30 AM

Summary This paper describes GFS, a distributed file system built over commodity linux systems and disk drives. It is optimized for workloads at Google, provides fault tolerance and extends the file system API for synchronous append operations. Problem: Workloads at Google use large files (1GB) and tend to stream reads/writes to the disk. Secondly, application tend to have sequential appends to large files, and concurrent updates are also a common case. To maintain atomicity such updates require complex synchronization. The engineers at Google recognized an opportunity to build a distributed file system to cater to these needs and build a reliable storage sub system out of commodity disk drives rather than using expensive proprietary network storage. Contribution: GFS leverages common case of large files and divides every file into fixed size chunks. The system consists of a master server and a collection of chunkservers. Each chunk in a file is replicated over three or more chunkservers. The chunks themselves are stored as files on the native linux system in each chunkserver. Some of the main ideas of the design include: 1- A single master keeps the system simple. To avoid making the master a bottleneck, all read/write interactions for the chunk are handled directly with the chunkserver. A client caches the chunk-id and chunkserver list obtained from the master for subsequent requests for the chunk. 2- The master stores metadata for the file namspace, a mapping of file+offsets to chunks and a list of chunk servers presently storing a chunk. Metadata is maintained in-memory and is kept persistent using a combination of compact fast checkpoints and logs. Multiple replicas of the meta-data are kept in sync allowing redundancy in case the master server crashes. 3- The master assigns one of the chunk owners as the primary chunkserver using leases. The primary chunkserver acts as an ordering point for updating multiple replicas of the chunk consistently. The request path is de-coupled form the data path which uses a bandwidth optimized chain for carrying the data to all replicas before an update. 4- GFS add a record operation for concurrent appends to a file. This guarantees the update is atomically applied at-least once. GFS relaxes consistency requirements from a file system allowing it to add padding in chunk crossing cases. In cases where updates to one of the replicas fails the primary server re-tries at the following offset. Applications must deal with duplicates and invalid records. 5- The master uses a combination of replica placement policies to achiever uniform disk utilization and balance load across chunkservers as well as across the server racks. GFS uses lazy garbage collection for deleted files and stale chunks (using a version number) 6- GFS uses several fault tolerance measures. The master uses a heartbeat signal to poll the health of chunkservers and re-replicate chunks if necessary. The master state is replicated in shadow server which also provide a lagged version view of the file system meta data. To add resilience against data corruption the chunkservers maintaining checksums at a finer granularity which are verified during reads to a block. Evaluation The authors use micro benchmarks to evaluate the potential read/write/append bandwidth in an experimental system. Particularly record appends give lower write bandwidth as it is limited by the network bandwidth of the chunkserver storing the last chunk. They use a calculated theoretical limit as a know best case for comparison. They also profile real world systems running GFS to give an idea of the the size of metadata in the system and sustained read write performance. They analyze the recovery time in case of one and two chunk server crashes showing that full replication counts were restored in under 30m and 2x replicas in just 2 mins. It would be interesting to see the overheads due to checksums in the chunk servers, also an analysis of how atomic record appends improve the performance vs distributed locking. Confusion How does GFS provide protection if an application fraudulently tries to access a chunk that does not belong to an opened file?

Posted by: Brian Coutinho | March 31, 2016 02:03 AM

1. Summary This paper proposes a new file system design called Google File System (GFS) which provides high aggregate performance and is fault-tolerant. They first talk about the workloads in google and the assumptions they made about it. They then explain each feature/ aspects of this file system in detail. In the evaluation part they prove that their assumptions were right and they show how it performs in google's workload.

2. Problem The main motivation for this new design was google's workload. Component failure has become a norm and the File System(FS) must be fault tolerant, monitored and needs fast recovery. They also deal with large amount of data and most of it is just writing data to the end-of-file or append new data instead of overwriting existing data. Multiple clients are concurrently appending to the same file and atomicity is required.

3. Contributions The architecture of the FS consists of a GFS master and many GFS chunkservers. The master stores 3 types of meta-data (namespace, mapping and locations of chunks) & is responsible for providing the client with chunk server information using the available metadata. One of the interesting things which I liked about the system was that the master polls the chunkserver (using heartbeats) for chunk information instead of having a persistent record. Each chunk is in one primary replica and 2 secondary ones where the ordering is decided by the primary replica. They have separated the data and control flow so that data flow is done in a network aware manner (the idea of separating data from control plane is interesting and is also being done at different levels in other systems e.g. SDN separates data plane and control plane). Like it was mentioned in the motivation part most of the writes are append so they have a separate function called record append and the primary is responsible for ensuring it is consistent and defined. Some other new features which I found in GFS were allowing concurrent mutations in same directory, re-replicating & rebalancing chunks and using chunk versions to avoid stale data. Another interesting note is that a lot of features of GFS are also seen in the hadoop (HDFS) file system. Along with chunk replica it also uses shadow master replicas which gets update from master and chunkservers and is used when master goes down (read-only until elected as new master). They provide integrity using checksums and does garbage collection in the background. They have decided to temporary rename deleted files (before garbage collecting them) instead of removing them immediately and use heartbeat messages to tell chunkservers about the garbage collection decision. All the chunk and master replicas as well as the rebalancing and heartbeat features are used to provide high availability and fault tolerance. They have used snapshot and a combination of log and checkpoint to aid in recovery.

4. Evaluation The evaluation section was mainly justifying the assumptions made at the start of the paper. They evaluated different operations for different clusters & real-time workloads. They mainly focused on showing what operations were performed and what operation dominated the workload (most of the operations were writes and here GFS performed well for concurrent operations). They also spoke about the amount of time it took to perform various operations. I felt they could have compared GFS to the previously existing file system used in google or the other popular file systems used at that time. They were able to justify all their assumptions. But they have also claimed that such kind of workloads can also exist in other places but have not spoken about where such kind of workloads can exist or compared how GFS performs on such similar but non-google workloads.

5. Confusion What do they mean by lazy evaluation? How are they doing that when space is assigned in chunks? Doesn't padding waste memory? The paper claims that accessing hidden files (deleted) is useful, when is it actually getting accessed without restoring the file (without renaming file back to its original name)? Under data integrity they say that to make sure checksum is not hiding corruption they check the first and last record of the region, why are they checking only the first and last record?

Posted by: Anubhavnidhi "Archie" Abhashkumar | March 31, 2016 01:59 AM

1. Summary This paper describes the distributed file system designed by Google. The file system was optimized for workloads used by Google keeping in mind the constraints they face everyday. These workloads are majorly characterized by large sequential reads and concurrent appends to a file, while running on commodity hardware where failure is a norm. It is designed to provide high aggregate performance and to be fault tolerant. The file system is codesigned with applications to exploit the maximum performance out of this system.

2. Problem The state of the art file systems were not designed for the types of challenges faced at Google. They needed a system that could run on many inexpensive commodity components that often fail. Their system stores a modest number of large files rather than large number of small files. The workloads consists of large stream reads or small random reads. The workloads also have large sequential writes that append to a file. These appends are concurrent in nature and the system needs to guarantee some consistent semantics in such scenarios. Atomicity with minimal overhead is desirable in such a scenario. Finally, they care for a system which supports high sustained bandwidth rather than low latency. Given these non traditional requirements, they designed a new file system optimized for the above scenario.

3. Contribution The major contribution of this paper is in providing an insight into the workloads at Google and the kind of optimizations required for running such workloads, given a high failure rate of hardware. The google file systems consists of a master and multiple chunkservers. Each file is divided into multiple chunks which are stored across different chunkservers. Each chunk has a 64 bit ID associated with it. Each chunk is replicated across multiple chunkservers. The site of replication is chosen based on utilization of a chunkserver, location of the chunkserver and the number of recent creations on the chunkserver. The master stores minimum meta data about the file. The major structures are: the file and the chunk namespaces, the mappings from files to chunks and the location of the chunk’s replicas. The metadata is stored in masters memory. The first two types are also kept persistent by logging mutations to an operation log. The master also has replicas and changes to metadata structure are first pushed to all the replicas before returning to the caller. The master maintains a “heartbeat” connection with the chunkserver which helps keep the master up to date with information about the status of the chunks/ location of the chunks. The GFS interface has the usual read, write etc calls along with the addition of calls to support snapshot and recording appends. GFS has a relaxed consistency model which is simple and efficient to implement. File namespace mutations are atomic. The masters operation log defines a global total order of these operations. A file region could be either consistent or non-consistent and defined or undefined. Successful record append operations are generally defined and may be interspersed with inconsistent data. In order to maintain a consistent mutation (write/append changes) order across replicas, they have a notion of lease. The serverchunk with the lease determines the order of all the changes to a particular chunk and communicates that information to the replicas. In order to optimize for network bandwidth utilization, data flow is pipelined. The snapshot provides a copy on write semantics to quickly make a copy of a file or directory tree. GFS uses read and write locks for providing concurrency while accessing files. To avoid deadlocks, all locks are acquired in a consistent global order. Garbage collection is done lazily. This helps the master to not get blocked on storage reclamation. This is an important optimization assuming a high rate of chunkserver failure. This allows the master to be responsive to other requests immediately and also helps provide a safety net against accidental deletion. The chunkserver also store checksums to ensure correctness.

4. Evaluation The authors run two sets of tests to measure the performance of their FS - microbenchmarks and real world workloads. It would have been interesting to see a comparison of the micro benchmarks when run on an existing state of the art distributed file system. However, the authors do not present that data. For microbenchmarks, the aggregate read/write speed increases as the number of clients increases. There is a small drop in performance after a certain point which is mainly because of the reads/writes going to the same chunk. The append process is generally bottlenecked by the network bandwidth of the chunkservers that store the last chunk of the file. It would have been interesting to see some sort of study in these microbenchmarks which provide an indication as to when the master becomes a performance bottleneck. This would have given a better idea about the scalability of the system. The 2 real world workloads - one focussed on R&D and the other on production data processing, validate their assumption on the characterization of the workloads. They also provide an insight on the recovery time of the system, the cost of in memory data structures for meta data, the rate of operations sent to the master and the total storage area used.

5. Questions 1. What does a client do with the offset returned after an append call? 2. Why does the primary have to wait for all the chunks to receive data before starting with writing the mutations? 3. How are the duplicated data regions due to append handled during sequential file read?

Posted by: Urmish Thakker | March 31, 2016 01:58 AM

1. Summary This paper presents the design and implementation of Google File System (GFS), a scalable distributed file system for large distributed data-intensive applications. Based on observations from Google’s application workloads and technological environment, GFS revisits the traditional design choices and is a departure from other file system approaches. 2. Problem The existing distributed file systems were not tuned to the characteristics of Google’s workloads and compute environment. First, Google aimed to use off-the-shelf cheap systems in large numbers which basically meant that failure of a node is not an exception, but a fairly common occurrence. As such the file system needed to focus on monitoring, error detection, fault tolerance and fast recovery. Second, huge (multi-GB) files represent the common case, instead of small multi-KB files. As such the system parameters need to be revisited. Third, Google’s applications had the common characteristic of appending new data at the end of file, while overwriting data at specific locations was not as crucial. As such the file system needed to be optimized for appends. The existing file systems did not have these design goals/considerations and thus, GFS was designed. 3. Contributions The major contribution of the work starts with analyzing the workloads Google is interested in and then coming up with a new set of considerations that affect the file system design. Some of them were - most of the files were huge in size, so system should not be optimized for small files; large streaming and small random reads are important; instead of random writes, file appends should be optimized for; concurrent atomic file appends with minimal/no synchronization overhead should be supported. GFS did not implement POSIX interface, rather focussed on the requirements of the company - in addition to normal file operations, they supported snapshot and record append operations. To keep the control flow simple, GFS implemented a single master for a cluster that controlled a number of chunkservers. It represented a file as a collection of 64MB chunks to efficiently support large file operations. It offered a different set of guarantees about the semantics of file regions being consistent and defined. Google applications were aware of the consistency model and worked with GFS to ensure correctness of their implementation. By decoupling the data transfer between clients and chunkservers, and the control flow between clients and master, GFS enjoyed the simplicity of control enforced by the single master without the risk of making master the bottleneck. The concept of leasing a chunk further offloaded the master. The master also used checkpointing and operation log to maintain persistence without hurting instantaneous system performance. Replication was a key feature of GFS to boost system reliability. A single node of control using master made it possible to optimize replica placement and replication levels. GFS further provided sophisticated mechanisms to ensure data integrity and stale replica detection. It used lazy garbage collection to keep system lightweight. It also ensured that master has minimal system state, in-memory and persistent, to keep it lightly loaded and be able to scale to large number of servers/clients. File and chunk namespaces are structured to support multiple concurrent updates and efficient snapshot operations. 4. Evaluation In first part of the evaluation, the authors used microbenchmarks to study a GFS cluster with 1 master, 2 master replicas, 16 chunkservers and 16 clients. They show how the aggregate read bandwidth scales well with number of clients with minimal decrease in per client performance. The write performance shows a similar trend except for relatively more degradation in performance with number of clients due to greater contention for updating multiple replicas per write. In the second part, the authors used two workloads - one focussed on research and development, and the other on data processing. They show that master and chunkservers maintain minimal in-memory metadata (50MB to 100MB). The master is lightly loaded with 200 to 500 ops/sec that it can easily support due to optimized in-memory data structures, thus not becoming the performance bottleneck. Further, the recovery times observed are in accordance with expectations and Google’s requirements - for example, they could restore 600GB of replicated data (lost due to chunkserver failure) in less than 24 minutes. Lastly, they analyze the characteristics of each of the two real world workloads to explain how they conform to the assumptions initially stated and thus, benefit from various optimizations of GFS. These experiments provide some good insights about the efficacy of GFS mechanisms. However, it would be more convincing if the scalability studies shown in part 1 are done on bigger clusters. Further, some kind of comparison against the performance of other file systems with Google’s workloads would put GFS performance number in the right perspective. 5. Confusion When record append operation returns the offset info, what is it used for? How are duplicated data regions due to append retries handled during sequential file read? What are semantics of a directory in GFS?

Posted by: Lokesh Jindal | March 31, 2016 01:52 AM

Summary The paper explains the design and implementation of Google File system, a distributed file system for data intensive applications. The file system is designed to handle the data consisting of large files by splitting them into chunks and efficient utilization of the network bandwidth by maintaining a global master with reduced metadata. Improved fault tolerance with shadow masters and replication of data on separate chunk servers. Problem As the data processing needs of Google increased they analyzed their application loads and the current environment and found the following problems which are not addressed in general distributed file system 1] The Failures of systems in distributed environment are normal, hence constant monitoring , fault tolerance, automatic recovery should be part of the system. The file system is now stored on inexpensive commodity systems. 3] File are huge and hence the design assumptions and parameters such as I/O operation and block sizes have be revised 3] most of the operations on the files are append instead of random reads ,hence focus should be on performance of append 4] Co-designing the applications and the file system API benefits the overall system by increasing the flexibility. Contributions 1] Efficient utilization of network bandwidth by minimizing the interaction with the master. Chunk locations are determined on boot. Master handles replication placement, load balancing and hence interacts with Heartbeat messages periodically, but at a lesser frequency just to check if chunk server is alive 2] Fault tolerance and Faster recovery by using logs and shadow master. 3]Chuck replicas are placed in different chuck server and hence faster recovery of chunks and increased fault tolerance and better utilization of network bandwidth. 4] Bottleneck of master is reduced by minimize the metadata stored so that it can be placed in memory. Client directly interacts with chunks server determining the one which is nearest in topology 5] Master stores the filename to chuck location mapping and hence search of directory structure is removed. 6] Lazy space allocation to avoid wastage of internal fragmentation. Lazy Garbage collection but just logging the deletion and reclaiming later during regular scan and hence simplifying the garbage collection in distributed system which is a big problem. Evaluation All the evaluations are specific to workload identified within google systems and very detailed with clear explanation. Good points: The graphs shown clearly mention what is the network limit and what was achieved and also explaining where they are lagging behind(Figure 3). Good evaluation of Real world workloads which exist in Google. Detailed explanation of workload in table 2. Reduction in metadata which was achieved is clearly shown. Read and writes achieved are shown in Table 3. Reduction in interaction with master is also shown in table 3. A clear breakdown of workload in given in section 6.3 and all the tables 4,5 explain what was the distribution of data and operations .

Missing evaluation: 1] The evaluation of recovery doesn’t seem enough for me. Instead of failing 1 or 2 chunk server(out of just 16) and checking the recovery time the workload should have been if there are 1000’s of chunk server and multiple of them are failing how would the recovery perform and how will the load balancing mechanism perform. 2] In the evaluation there are only 16 clients and 16 chunk servers, since mapping is filename to chunk and if thousands of files are present, not sure how the master will perform as the metadata will be more and response time might change as it may not fit in memory. 3] The lazy garbage collection mechanism performance when the systems have reached it maximum capacity and there is heavy network workload such are writes and reads. Will the system be able to recover efficiently? 3] No comparison with other existing Distributed file system such as AFS. Confusion How does the system recover from inconsistent state i am still not clear.

Posted by: Mushahid Alam | March 31, 2016 01:45 AM

1. Summary The authors propose a mechanism to build a scalable distributed file system above a cluster of cheap machines for data-intensive applications, and specially tune the design to the actual applications running in Google. It consists of single master maintaining file system metadata and multiple chunkservers accessed by multiple clients. 2. Problem Distributed file system should deal with application and OS bugs, human errors and failures of disk, memory and network. Realistic design assumptions should be made to sacrifice management of small files for prominent large workload. Performance optimization and atomicity needs to be guaranteed by focusing on appends rather than random writes. Co-designing file system like relaxing consistency model could increase application flexibility. 3. Contributions It provides fault tolerance by constant monitoring and detection and recovery of component failures. It delivers high aggregate performance to large number of clients by batching and sorting small reads and maintaining high sustained bandwidth. It has snapshot and record append operations to handle multiway merge results and concurrent appends to producer-consumer queue. Cache coherence issues are eliminated by chunkserver and client not caching file data locally. Having single master made it possible to make sophisticated chunk placement and replication decisions using global knowledge. Space wastage is avoided by lazy space allocation. Log allows to update master state simply and reliably without inconsistencies in crash. File metadata size is reduced by employing prefix compression of file names, storing checkpoint in B-tree like form directly mapped to memory which further speeds up recovery and improves availability. Not storing chunk location persistently eliminated need to sync master and chunkservers in the event of failures. Consistency is guaranteed by ensuring write order on all replicas and using version number to detect staleness. Checksumming and regular handshake protocols are employed to detect chunkserver failures and data corruption. 4. Evaluation It exhibits poor low latency as it don’t impose stringent response time requirements. It doesn’t guarantee that all replicas are bytewise identical in record append ops. Data flow and control flow decoupling lead to performance increase by leveraging network topology. Each machine’s network bandwidth is fully utilized. Applications are received errors instead of corrupt data. Clients and chunkservers experience minor hiccup during recovery while timing out on their outstanding requests. When master is down, file metadata like directory contents or access control information could become stale. Efficiency drop was observed as the number of readers increase due to simultaneous reading from same chunkserver. Pipeline scheme is not compatible with network stack resulting in reduced write rate due to delays in data propagation between replicas. Due to reduction in metadata size, recovery is fast. Master support effective file accesses with efficient namespace binary searches by its data structures. Data loss during chunkserver failure is prevented by fast cloning during chunk restoration. Load balancing and fault tolerance was exhibited by maintaining location independent namespace. Garbage collection uniformly cleans up any unused replicas which impose no overhead as is done in background in batches. 5. Confusion Can we go thru how is deadlock between GFS data structures prevented?

Posted by: Unmesh Phalak | March 31, 2016 01:26 AM

1. summary This paper is about the Google File System(GFS) which is a distributed file system designed for a network of inexpensive commodity hardware systems with high fault tolerance. It is a distributed system with a single master and multiple chunk servers. 2. Problem The GFS aims to address the following main design issues faced by current distributed file systems:the inexpensive storage components in the distributed network have a high failure rate,the file system consists mostly of large multi-GB files as opposed to smaller files ,file operation are mostly append and not overwrite. 3. Contributions The GFS system consists of a single master and multiple chunk servers. A file is divided into multiple chunks of fixed size.Each chunk is replicated across multiple chunk servers. All metadata is maintained by the single master.Clients interact with master to get chunk server for a file in order to perform data operations, thus master does not become a bottleneck.The chunk to chunk server mapping is not maintained persistently .The master queries the servers to find out this mapping at start time, this prevents any consistency issues.An operation log maintains a persistent log of metadata changes.Changes are made visible after metadata changes are replicated and persisted in the log.

A lease is used to ensure correct ordering while applying mutations to all chunk replicas. The lease is granted to one server which is considered primary, this server picks the correct order in which metadata changes will be replicated. Extension of lease is piggybacked on heartbeat messages.Data flow is decoupled from the above control flow, i.e data is pushed serially to chunk servers in a pipeline. Record appends ensure idempotent append operations to a file.Copy on write techniques are used to implement snapshots.The physical space of a deleted file is not immediately reclaimed , the file deletion is logged,the filename is changed to a hidden name and a timestamp is maintained.Hidden files are removed if the file is older than 3 days.This interaction is piggybacked on heartbeat messages.

4. Evaluation Using microbenchmarks that consist of one master , two master replicas,16 chunk servers and 16 clients it is shown that the read efficiency drops as the number of clients increase(because multiple clients will read from same chunk server).Similarly,the write efficiency rate is significantly less than the theoretical limit(because of the testing network stack).This does not hinder the aggregate write bandwidth of the system.Append performance is limited by the network bandwidth of the chunk servers.This is alright as a client can move on and write to a new file while chunkserver is busy with append.Two clusters(one used for R&D other used for data processing) in the Google environment are used to analyze performance of the system in real world envrironment.It is shown that the master is not a bottleneck and can handle 200-500 operations per second.Further,a single/double failed chunkserver can be restored within reasonable time.The evaluation is well justified and nuanced because it explains how the observed performance is sufficient for a production environment even when the performance is less than the theoretical maximum.The evalautation section could have touched upon what the performance impact would be if the master fails and needs to recover the metadata state from the replicas. 5. Confusion What exactly happens when the master fails or gets corrupted and its state needs to recovered ? has that happened at Google so far and how would that impact overall performance ?

Posted by: shreya kamath | March 31, 2016 01:25 AM

1. Summary Google File System caters to storage for large distributed data-intensive applications, taking care of fault-tolerance, scalability, reliability, and availability. GFS builds upon real-world technical and application requirements to provide a flexible and scalable storage platform in the form of a client-server model to manage files and provide fault-tolerance. 2. Problem Analyzing the real-world environment and application workload, it is clear that component failures are a norm, files are large, appends are most common, there would be concurrent writes, and that higher bandwidth is most desired rather than latency. So, the file system needs to address these by making the internal architecture transparent. Fault tolerance, replication, large file placement and management should be handled by the storage layer, and not the users. 3. Contributions GFS was designed to handle application workloads at Google data center, hence the architecture principle is heavily based on the specific behavior as stated in the assumptions. They derive a client-server model with server consisting of a single master containing metadata and multiple chunk-servers containing data, and clients issuing requests on files that are divided into fixed-size chunks because managing replicas of arbitrary size is complex, and that internal fragmentation due to this large fixed chunk size is not a concern. Thus, the major takeaway here that a chunk is the simplest unit of replication, and thus replicating chunk across multiple chunk-servers achieves high availability and fault-tolerance. And with such a cluster architecture[master - chunk-servers] there is decoupling of control and data transfers that makes request operations faster and the network load-balanced as opposed to having a single server serving both metadata and data. Data transfer is pipelined to attain maximum bandwidth- both machine and network. And then GFS is famous for the record append operation: owing to the workload behavior they optimize for appends rather than random writes. GFS client contacts the master, and master chooses and returns the offset it writes to and appends the data to each replica at least once (consistency semantics), without the need of any distributed locking even with multiple concurrent appends. Fault tolerance is achieved through chunk replication with a minimum replication factor chosen as 3 and reliability through replicating master state across shadow masters. Techniques employed for replica placement, load balancing and lazy garbage collection are sound in such a distributed system. Wonderfully written paper, the design and principles are convincing, and the assumptions unchallenged considering it was at Google. 4. Evaluation GFS is first evaluated to find the bottlenecks and then they present its performance on real workloads at Google clusters. Reads get a high throughput till the point when chunkservers become the bottleneck due to increasing clients. Writes are affected more when the number of clients increase due to network congestion due to added replication. While appends are bottlenecked by the network capacity of the intended chunkserver. On real world clusters, the measurements prove all their assumptions and design principles. They test the fault tolerance and recovery. This section falls a little short because they did not compare GFS with a distributed file system/architecture. If comparing against other systems was improbable, they could have compared the different optimizations in achieving each design goal. 5. Comments/Confusion Where do they employ lazy space allocation, because they eagerly allocate a 64MB chunk anyway. Why would they check data integrity before overwrites, and if that was justified, why 1st block along with the last block?

Posted by: Tithy Sahu | March 31, 2016 01:22 AM

1. Summary Based on the understanding of majority of workloads and patterns, Google came up with its own distributed file system, the Google File System, which heavily focuses on the common case at Google. 2. Problem Previous distributed file systems came with many design assumptions that were not acceptable at Google’s environment. Unlike the assumptions, components failures are much more common that they are regularly expected, files are much bigger, and most of the files are updated by appending rather than overwriting existing data. 3. Contributions As a result, GFS was introduced to not only continue achieving previous distributed FS’ goals, such as scalability, availability, and reliability, but it was also heavily optimized for the workloads at Google. These workloads consist of using large files that are written mostly by append and read either in large streams or small random accesses. The GFS architecture consists of a single master, which holds all the file metadata, and multiple chunkservers, which persist chunks of files. When a client needs to access a file, it first asks master information about on which chunkservers the specific file chunk resides, and it then communicates with listed chunkserver(s) to access the file. To keep things fast, the master keeps metadata in memory, and to keep things reliable and available, the master is shadowed by more machines. GFS provides a relaxed consistency model, which guarantees that the metadata changes are atomic using operation log and includes an atomic append to make sure multiple clients can append to the same file without extra synchronization. Another contribution of GFS is that it heavily uses heartbeat messages and piggybacks other messages together between master and chunkservers. It also tries to take full advantage of network bandwidth as well as load balancing by reading different parts of a files from separate chunkservers to read in file pieces in parallel. It also provides snapshotting and data integrity using checksums. 4. Evaluation The authors do two sets of tests: micro-benchmarks and real-world workloads. Though micro-benchmark results are useful to know general performance of the file system, it is not being compared against previous distributed systems such as AFS to show how much faster GFS is. This would have been useful for the micro-benchmarks, though it might not have been really possible the real-world workloads results, which were taken over long period of real use of the file system. In addition, the applications are tuned for GFS rather than a general FS that can be swapped in to replace GFS. Overall, the results look good, and the authors show that GFS tries to utilize the network bandwidth as much as possible to prevent any possible network bottlenecks. 5. Confusion Can we go over the consistency model and how it is okay for data to be in different offsets in each replica?

Posted by: Arman Shanjani | March 31, 2016 01:14 AM

1. Summary This paper describes GFS, a scalable distributed file system (DFS) designed from the ground-up to provide fault tolerance and high aggregate performance. This system's design was based on Google's application workloads and technological environment. The paper clearly delineates the design choices, presents the interface extensions and reports measurements from micro-benchmarks and real world utilization.

2. Problem Prevalent distributed file-systems at the time were designed to handle small files and were optimised for low latency and performant random read/write operations. Google's append-once-read-many workload, technological environment and scalability requirements led them to re-examine file system design choices and build a system using commodity hardware based on the following assumptions: - Component failures are the norm - Large Multi-GB are common-case - Workloads are typically large streaming reads/writes - High sustained bandwidth over low latency

I personally think that this problem was interesting as the file-system was designed to optimize large file management, large streaming reads and append operations.

3. Contributions The design for a scalable distributed parallel fault-tolerant file-system using commodity hardware for special workloads was highly innovative. The system was architected to use a single master with multiple chunk-servers to serve a large number of clients. Though not POSIX compliant, the file-system interface supported the usual operations along with new operations snapshot and record append.The use of a large chunk-size (64MB) proves to advantageous and helps reduce client-master interactions, network overhead and master's metadata. This system maintains all its metadata(file,chunk namespaces, map(file -> chunks), location(chunk-replicas)) in memory which is drastically different from existing systems and uses prefix compression to reduce memory footprint of filenames. The use of an operation log further optimises recovery and availability. GFS has a relaxed consistency model and explicitly defines the notion of consistent and defined states. The authors also show that this model works when the applications and file-system are designed cooperatively. The system uses chunk version numbers to detect stale replicas. GFS separates control and data flow which I believe is highly innovative at that time and this helps significantly with chunk leasing and serializing mutation order.GFS uses locks efficiently manage namespace regions, has effective replica placement policies to ensure better availability and takes significant steps towards chunk creation, re-replication and rebalancing. Though distributed garbage collection is a challenging problem, GFS uses non-eager(lazy) deletion (file renamed to hidden file) which is simple and reliable. It also merges storage reclamation to the background which amortizes cost and provides a safety-net for accidental, irreversible deletes. As the master is a single point of failure, the system maintains "shadow" masters to provide read-only access when primary master is down. External monitoring systems can detect master failures and quickly spin up a new up-to-date master using the operation log and replicas. Overall, the authors seem have to targeted "simplicity" to build elegant solutions to complex problems, from avoiding hot spots on chunk-servers to metadata and operation log management.

4. Evaluation The authors use micro-benchmarks and real world clusters to justify the design choices and illustrate the bottlenecks within the system. The micro-benchmarks for read, write and record-append are compared against the theoretical limit. The aggregate read performance drops as the number of readers increase and this could be attributed to multiple readers reading from the same chunkserver. Aggregate write performance drops as the number of writers increase and this could due to increased collision rates. Also, the performance of record appends drop with increasing number of clients and this is attributed to congestion and network transfer rates. The authors showcase results from two GFS clusters used at Google. They first evaluate the read/write performance and provide statistics that commensurate the workload characteristics (more reads than writes, network throughput: 750 MB/s). They next provide statistics to show that load on the master is within manageable limits (500 operations/second). The authors designed experiments wherein they killed chunk-servers to evaluate recovery time. In the case of both single and double failures, the recovery times were 23.2 minutes for the single failure case (15000 chunks with 600GB data) and under 2 minutes for the second case which is more drastic and requires higher priority(2X) cloning and replication. The authors also provide a workload breakdown and metrics for operation breakdown by size and Bytes transferred Breakdown by Operation Size (%) to support their claims. Overall, I feel the performance was unsatisfactory as the results were just showcased as a large set of numbers(metrics) without providing deep valuable explanations or justifications. They have also not compared their system against any existing distributed file system to show how much better or worse their system could be.

5. Confusion How does lazy space allocation work ?? Why would anyone use the hidden name from the master to access a deleted file ? When would the hidden file-name (GC'ed) be used to read from the master ? Why should we only read and verify the first and last blocks of range for a write operation over an existing range ??

Posted by: Vinothkumar Siddharth | March 31, 2016 12:58 AM

Summary The paper describes Google File System(GFS), a scalable distributed file system designed specifically for distributed data-intensive applications at Google. It runs on commodity hardware and provides fault tolerance and delivers high aggregate performance to a large number of clients.

Problem Google's applications, compute and storage infrastructure varies significantly compared to other companies designs and needs. Google's applications mostly append new data rather than overwriting the existing data. Moreover, the files being operated upon are in the order of GB. The compute and storage infrastructure comprises of thousands of commodity hardware where failure is a norm rather than an exception. Traditional filesystems are unable to cater to the needs of Google's high data processing needs and provide fault tolerance without sacrificing performance. Thus, Google wanted to re-examine traditional file system assumptions in the light of the current and anticipated workloads and infrastructure where applications and filesystem are co-designed together to suit each others needs.

Contribution GFS consists of a single master node, multiple chunkservers and is accessed by multiple clients. The master maintains all filesystem metadata - namespace, ACL, mapping files to chunks and current location of chunks. All the metadata is stored in memory for faster access and file mapping + namespace is stored on-disk in the form of operation log. Replaying of Operation log is used for recovery whenever a failure happens. Master also controls chunk lease management, garbage collections of orphaned chunks and chunk migration between chunkservers. Chunkservers stores the files in the form of fixed sized chunks and replicates them efficiently(rack aware) to ensure reliability and availability. The chunks are stored on local disks as Linux files and read/write data is specified bu a chunk handle and byte range. Primary chunkserver who is granted the chunk lease defines the mutation order between mutations and all secondary chunkserver follow that same order. Clients interact with the master for metadata operations while all data-bearing communication goes directly to chunkservers. Clients never cache file data but caches metadata for a limited time. File namespace mutations are handled by master guaranteeing atomicity and correctness. The state of a file region after a mutation depends on the type of the mutation, whether it succeeded or failed and whether there are concurrent mutations. The states can be either consistent, defined, undefined and inconsistent. Data mutations consists of writes - data written at application specified offset and record appends -data appended atomically at least once even in presence of concurrent mutations, but at the offset chosen by GFS. To ensure fault tolerance, master and chunkservers can restore their states faster and master state is replicated to ensure reliability. For data integrity purpose, chunkservers use checksums to detect corruptions.

Evaluation To illustrate the performance, the authors create a small cluster configuration and run micro-benchmarks to measure read, write and record appends performance. The read, write performance decreases with increasing number of clients as multiple clients start reading/writing simultaneously to the same chunkserver. Record appends performance is impacted due to congestion and variances in network transfer with increasing number of clients. For real world clusters, the authors point out that metadata size is very small compared to the data size and hence master's memory does not limit the system's capacity. Moreover, the load at master is about 200-500 ops and the 64.3 % request is for findlocation, 26.1% for open and 7.8% for findleaseholder. Thus, it confirms that scaling is possible by adding more chunkserver without affecting the master's in-memory footprint and load. The authors also shows how quickly the cluster recovers when one chunkserver is killed and how the cluster clones the chunks at highest priority when two chunkservers are killed thereby enabling the system to tolerate another chunkserver failure without data loss. Overall, the results show that the read rates is almost 75-80% of the theoretical read rates proving one of their goal to make read faster as the workloads are majorly read-heavy. Their workload also contains large, sequential writes that appends files that is handled by aggregate write bandwidth of the system even though writes are slower than expected. I do not expect any security based benchmarking as GFS is used internally to Google and everything is trusted with an enterprise.

Confusion I am not sure how GFS recovers in case of inconsistent state of the file due to failed mutation. Also, it seems that the application needs to do a lot of heavy-lifting of validating and identifying their own records. Wouldn't this complicate the application and divert the attention from the application logic to handling filesystem semantics?

Posted by: Yuvraj | March 31, 2016 12:56 AM

1. Summary The Google File System is a distributed file system specially tuned for the application use pattern in Google, where device failures are common, a relatively small number of huge files are stored, and most I/Os are concurrent append and sequential read.

2. Problem Traditional distributed file systems treat device failure as an exception but this is no longer true when a huge number of devices are involved. Fault tolerance and recovery need special care. The fact that most I/Os are sequential and most files are huge is not fully exploited to optimize the performance. Traditional locking scheme is not optimal either.

3. Contributions The master controls all the metadata while chunkservers store fixed-size chunks. The client can talk to a chunkserver directly and chunkservers can talk to each other. A temporary primary replica is assigned by the master to coordinate all replicas when a write job comes. This centralized architecture simplifies the management while retaining the performance and reliability gain of a distributed system. A special atomic append operation is introduced, and the consistency guarantee is relaxed to achieve simpler design and better performance. Each replica is not bytewise identical: they provide same valid data but may contain different garbage. The detection of padding, corrupted data and duplicates can be done in application libraries where checksums and unique IDs are already widely used. Deleted, corrupted and all other kinds of useless data are handled uniformly by the garbage collector. The collection is also part of the regular master background activities, just like the HeartBeat messages. This can keep the cost at a lower level. To achieve fast recovery, the architecture does not distinguish graceful termination with a crash at all.

4. Evaluation Results from both micro-benchmarks and practical measurements are given. In micro-benchmarks, the authors compared the differences between the measured I/O rate with the network limit, and gave some simple explanation. Although the write performance is not good as expected, they claimed the causes would not apply in practical use. For real world statistics, clusters used for research and for production are both included. Tasks on research clusters are relatively smaller and shorter than those on production clusters. Metadata cost, I/O rates, master load, recovery time as well as workload breakdown is shared. They proved the assumptions they made in designing the system and the resulting system met their goal. Some designs are not covered in the experiments, including garbage collection, rebalancing and master replication.

5. Confusion Is there any unnecessary overhead as GFS is built on top of a Linux file system?

Posted by: Xiangjin Wu | March 31, 2016 12:41 AM

1. Summary: The paper presents Google File System, which is designed to efficiently handle the common case of large reads and appends in large files, a pattern observed in Google’s workloads. The authors back their claims by providing relevant workload statistics and the performance of their implementation for these workloads. 2. Problem: Contemporary Distributed File Systems (DFS) were agnostic of the workloads which will use them. In a way they were too generic. Since the applications and FSs were not co-designed, FS had to handle all corner cases and suffer from unnecessary complexity and delays. Failures and recovery were still not considered the “norm”. Infact, at the scale of thousands of disks and Petabytes of data, the need is to handle and optimize for such failures. The requirement to be compatible to previous FSs also limited radical assumptions and developments! 3. Contribution: With all this, the authors analyzed their workloads, and realized that the common case has changed. Thus, they decided to build a new DFS optimized for their needs: 1. handle large files. 2. Optimize for common case of large reads and data appends. 3. Handle faults and recovery so that system could run on commodity hardware. They further realized throughput is important for their applications compared to latency, and made some important design decisions which are their major contributions: a. In a way, they made the FS “application aware”, also evident by large chunk sizes they chose. Since this was a new approach, they also didn’t unnecessarily bother about backwards compatibility with existing systems. b. They realized simple design is important and use one master to make bookkeeping decisions, but didn’t want the master to become bottleneck, and hence distributed its data across multiple machines. To further reduce its traffic, it only stores meta-data. c. They decoupled data flow from control flow to ensure consistency in concurrent appends, without being limited by network bandwidth. They also use caching, timeouts, and B-tree like structures for efficient lookups and other optimizations. d. They solved the problem of reliability by replication. They identify failures by implementing “heartbeat” messages to find out stale data, or which chunkservers are down. They added the mechanism for garbage collection in these messages, and performed it lazily to ensure actual work is not affected. e. They use log records and checkpointing for faster recovery. This paper also inspired HDFS, Hadoop Distributed File System, an open source file system, widely in use today to implement Map Reduce operations! 4. Evaluation: The authors base their design on their workload, and properly present its breakdown to justify their claims. They also present the throughput of their system for reads, writes and appends for 1 GB data, and show that their system scales with the number of clients. The throughput for writes and appends is less than reads because of delays in transferring same data over all replicas. Although, they show that the number of operations at master is around 500 op/s which can be handles, I would have liked to see the throughput results on the master as well as chunkservers, when the master starts/restarts after crash. Since, all chunkservers have to communicate with master initially, the traffic on network would be really high. Also, they did not compare their results to any existing DFS. This would have told more about the effectiveness of their specific optimizations. Also, even though they say thier FS is tailored for their applications, performance results on other workloads would have thrown more light on the drawbacks of current FSs. 5. Confusion: Exactly what kind of workloads perform only appends a lot? How do applications handle inconsistent regions left as such during record appends?

Posted by: Mohit | March 31, 2016 12:30 AM

1. Summary The authors have come up with a new scalable File System (FS) interface which basis its design on present workloads which does more appends than writes, uses commodity hardware and expects large file size. It also is fault tolerant, provides faster automatic recovery and high performance to the clients. Also the FS was codesigned to work with applications which could benefit out of such a system.

2. Problem Existing FS were designed based on assumptions (when workload patterns were different) which were not longer relevant in the current time. Also, most of the FS tried to find a solution to general use case, where application requirements were not given any consideration.

3. Contribution The authors came up with FS which utilized commodity components, designed their system to work for large files - by breaking them into chunks of 64MB, replicated thrice. Reads were classified in two forms - sequential and random (which were batched and sorted in advance). Their design consisted of a master (with another read only shadow master) and chunkservers. Masters store the metadata in memory backed up in disk. Chunk replication are not backed up in the disk and the master polls the chunkservers to learn about the information. Chunkservers store the actual data in 64MB chunks. Clients interact with the master only for the metadata and rest of the interaction happens with chunkservers. This prevents the master from becoming the bottleneck of the design. Operation Logs were used with checkpointing to persist the record and faster failover recovery. To be able to provide consistency, masters grant chunkserver lease (60 secs.) which then acts as a primary chunkserver to initiate the writes. Data is copied in the most network optimized manner, with data replicating on the nearest chunkserver. Only when a primary chunkserver signals the secondary chunkservers, the writes are committed. Other operation such as snapshot is done by duplicating the metadata on the master followed by the chunkserver making a local disk copy of the data. Optimizations in the form of replica placement prefer availability by placing them on different racks (even though it may lead to write traffic flowing across racks). While replication, they also considered factors such as below average utilized disks, prefer to replicate the chunks with just 1 copy than 2 copies. Garbage collection techniques (running the job in background, lazy cleaning) were used for cleaning deleted files, which makes things a lot simpler. Stale files were handled using a version number which ensures client getting the latest copy of the file. Other forms of integrity such as checksum also provide extra layer of correctness without which reads might be susceptible to silent corruptions.

4. Evaluation The evaluation presented in the paper was provided for comparing the read and writes rates compared to the theoretical rate. This served as a good estimate of performance compared to theoretical performance. However the test system was not close to the scale of the real world clusters. The authors also provided a sense of the memory consumed and the break up in the form of metadata at the master and chunkserver, which supports their claim of keeping metadata in the memory. Recovery time experiments provided baselines for expecting the time needed to recover chunkserver failover and replicating chunkserver containing single copy. The last set of experiments provided type and number of calls made to the server. Overall, the evaluation was comprehensive but since they did not compare with any existing solutions, it is difficult to relate their work. Also, they could have provided some inspiration for choosing the chunk size as 64MB.

5. Confusion I am still not clear how will the file system interact with the intermediate files - pipes, mmap area? Does it only make a copy of the files or the temporary areas as well?

Posted by: Vikas Goel | March 31, 2016 12:17 AM

1. Summary GFS provides a failure-tolerant distributed filesystem optimized for read and sequential write-heave concurrent workloads over multi-gigabyte files.

2. Problem Google's big data workloads require data storage at the petabyte scale, requiring hundreds of inexpensive commodity-hardware backed machines to serve data. At this scale, storage clusters are guaranteed to be experiencing one or more hardware failures at any given time, meaning replication is mandatory. Additionally, these workloads entail concurrent access to multi-gigabyte files by large numbers of clients, and data must be persisted correctly and atomically to replicated files. Because of the high degree of replication and the number of concurrent clients, it needs to be possible correctly propagate metadata changes, and read and write large extents of data without inducing network congestion, or having any single network node becoming a performance bottleneck.

3. Solution By identifying key characteristics of their workload, the authors tune GFS to its specific properties, namely access to multi-gigabyte files, where the majority of writes are sequential appends, and readers are resilient against certain forms of inconsistency, thus the authors tune the file system to support idempotent appends and arbitrary reads with a high throughput. Each file is divided into fixed size chunks, which are replicated to some predefined degree across arbitrary nodes. A 'Master' server acts as the global maintainer of metadata, mapping filenames and their offsets to chunks, and maintaining a directory of chunk locations via in-memory data structures. The communication protocol is designed to minimize the need for communication with the master. On a write, for example, the Master grants a lease to a primary chunk server, and informs the client of the primary's identity, along with the locations of the replicas. To optimize network traffic, data flow and control are decoupled; the client is free to push the write data to the closest replica (which is pipelined to the others), while passing the request to the primary. The primary selects a canonical serialization of all pending writes, which it pushes to the replicas. Stale chunks are detected via versioning, and garbage is lazilly collected. If the Master does not detect heartbeats from one or more chunk servers, new blocks are added to bring all blocks with missing replicas back up to a pre-defined count.

4. Evaluation The authors test GFS on a smaller scale using a synthetic benchmark and a real-world benchmark. The synthetic benchmarks show read, write, and append performace approaching or exceeding 50% of the theoretical limit imposed by the network characteristics. In the live workloads, read rates on a stressed cluster approach the maximum possible in the network, and GFS is capable of supporting write rates in the 100MBPS range. Results also show that the master is not a bottleneck in live workloads. I appreciated the authors' detailed characterization of the production workloads, particularly the breakdown of read/write sizes, as well as specific application characteristics. being able to operate at Google's scale is a rarity in research, so being able to see an open and thorough discussion of live data is refreshing.

5. Confusion I'd like to hear a bit more about the consistency model - I'm still not quite sure what 'abnormal' file states are permitted by GFS and tolerated by applications.

Posted by: Michael Vaughn | March 31, 2016 12:16 AM

1. Summary The paper presents the design of GFS(Google File Systems) which provides a scalable and reliable distributed system for handling the datasets that Google processes.

2. Problems The traditional file system design are not suited for the datasets handled by google. For example, typical file sizes are of 1GB and having a file system to support block size in KB does not yield any performance benefit. Since these distributed systems are built on top of inexpensive components that are highly likely to fail, the file system design must assume their failure and have built-in recovery mechanisms. Also, having the synchronization mechanisms in the application software level reduces flexibility. Thus a traditional file system may not scale to the needs of Google since the design points are radically different.

3. Contribution GFS improves the performance by modifying the design points to suit the data handled by google. Following are the main contributions of GFS: (i)GFS Architecture: The architecture consists of a master which records all the metadata information and the data is stored in the chunkserver. Any client which wants to do a file operation asks for chunkserver server which has stored the file and then proceeds to communicate with the chunkserver for data. The files are replicated across many chunkserver ( 3 by default) to provide reliability. Chunkserver and master exchange metadata information like file deletion, data corruption, etc. through periodic heartbeat message. (ii) Data specific design decisions : Many system design decisions are based on the datasets handled by google. For example, the chunk size is fixed at 64MB, which is representative of the file sizes handled by google’s servers. GFS supports operations like record append because the random writes are rare and most of the writes are just write once and read it many times later. Since the clusters are built with inexpensive hardwares, error checking and consistency models are built within GFS. (iii) Master: Master is a central point of contact for all metadata information. It also performs various necessary operations like deciding on replica placements, creation and replication of chunks, garbage collection, stale data replication and fault tolerance and diagnosis. For faster address lookups, master stores absolute file paths with prefix compression for lesser memory footprint. Master takes care of atomically updates of metadata and provides consistency through logging mechanisms. It supports copy-on-write type of mechanism for faster snapshot operations. (iv) Chunkserver: Data is maintained in these servers and it is replicated across many such servers for reliability. It provides basic data integrity using checksums and restricts bad data proliferation. The primary chunkserver is responsible for ordering of the chunk writes among the replicas.

4. Evaluation The evaluation presented in the paper is two fold. First it characterizes the file operations like read, write and append through microbenchmarks and shows how they scale with clients and how close they reach the theoretical design limit. Second phase takes the actual clusters running GFS and presents various insights on how memory overheads(for metadata), traffic handled by master and chunkservers and the recovery time for scenarios like single and 2 server failures. Also it discusses the workload characteristics to back their design philosophy. However the evaluation skips to do present a scaling analysis on overheads involved in master for real workload. Overall the paper presents a complete evaluation characterizing micro and macro behaviours.

5. Confusions What is prefix compression? What are the exact overheads due to network connection establishment. How has this evolved since publishing this paper?

Posted by: Bharadwaj Krishnamurthy | March 31, 2016 12:05 AM

1. Summary The Google File System is a distributed, scalable file system built specifically for large distributed data-intensive applications. The paper introduces the unique design decisions made to accommodate its common case, and then it presents a series of tests done on a mini-version of a GFS implementation. An analysis of a real-life cluster is also conducted. 2. Problem GFS is built for a specific need: namely, large files. Workloads are usually large sequential reads or writes, or alternatively small random writes. Files are also expected to be very large (several GB is not unusual), and component failures within the system should be commonly accepted. GFS was developed as master-chunkserver system to address these design requirements.

3. Contributions In general, GFS is built in a “cluster” system focusing on a single master coordinator that is responsible for delegating as much work as possible to its underlings in order to minimize bottleneck. A general write workflow looks as follows: the master relays chunkserver information, and there is usually a primary/secondary order amongst the chunkservers (which is cached by the client). The primary then handles the specific read/write request, making a few attempts to retry if necessary. The single master design allows for a simpler overall structure, but caching must be done on the client side to compensate. Each cluster has a single master, multiple chunk-servers, and are accessed by multiple clients at a time. Files are divided into chunks, and the master is responsible for maintaining metadata about the individual chunks for each file. It is also responsible for placement decisions, reations of new chunks, and coordination of system-wide activity. The master's speed is optimized by storing metadata in memory, causing master operations to be very fast. There are three types of metadata: file/chunk namespaces, file-chunk mapping, chunk replica locations. Operation logs are used to keep the first two persistent, while the third is collected from each chunkserver upon certain key events. This policy allows the master to access metadata quickly, implement chunk garbage collection, and replay events whenever crashes happen. GFS has a relaxed consistency model that provides several guarantees: 1) File namespace mutations are atomic. 2) Concurrent successful mutations will leave a data region undefined but consistent (ie clients will see the name data, but mutation versions may differ). 3) After a series of successful mutations, the mutated file region is guaranteed to be consistent and contain the data written by the last mutation. For applications, this means that file mutation can be done via appending rather than overwriting, which makes applications more resilient in the face of failure. Constant checkpoints are also used to allow incremental restart, which in turn increases file resiliency.

4. Evaluation Evaluation was done with micro-benchmarks on a small GFS cluster. Read, write, and record append performances were evaluated against a “theoretical limit”, although I'm not too sure how these limits were obtained or how other systems perform against them. Two real world clusters were also analyzed, this time with hundreds of chunkservers. It's hard to give a per-client read/write rate in a real-life network situation where clients may drop in and out, but appears that overall network speed is on par or faster than the test simulation. As the real-world GFS test was produced via actual client traffic, it presents a realistic view of a typical day-to-day use within the system, although the authors do warn that the workload is tailored specifically for Google applications and thus might not be expandable to other areas.

5. Confusion I'm still a little confused on how appending works. If replicas aren't bytewise identical, how does the master know which chunk is the real one?

Posted by: En-Ui Lin | March 30, 2016 11:53 PM

Summary: The paper describes the design and implementation of Google File System, a distributed file system that while satisfying the scalability,reliability and availability requirements of traditional distributed file systems, is also optimized mainly for large sequential reads and write-append workloads that are specific to Google.

Problem: Traditional distributed file systems are not designed for a particular kind of workloads and are thus generic. The performance of such file systems are modest. They also don’t assume significant amount of component failures that could potentially lead to loss of data. Google’s workloads were characteristic of large sequential reads to multi-GB files and appending to such files from multiple clients concurrently. Google also had a number of commodity components in its cluster and the probability of failure was quite high. GFS was developed under these assumptions.

Contributions: 1) GFS has a single master server (that is replicated) that provides metadata information about the location of data to its clients. Metadata maintained in the master although persistent(except for chunk locations), is small enough to be stored in the memory, thereby providing fast lookups. In order to avoid metadata losses, multiple copies of metadata is stored as operation logs from time to time, GFS has chunkservers which provide the data requested directly from the client. Data is almost always replicated across chunkservers This separation of control and data flow removes the bottleneck at the highly contended master servers. File regions are divided into Chunks that are the basic units of storage in the chunkservers. 2) GFS has a relaxed consistency model when it comes to concurrent writers. However, GFS does guarantee atomicity between writes from different clients to the same chunk. GFS also guarantees that the chunks in different replicas are always consistent after a successful mutation. This is enforced using leases where a specified chunkserver containing the chunk determines the order in which mutations to the chunk from different clients is written and instructs the other replicas to do perform the same operation in the same order. Record append operations are semantically treated as ‘append at-least once’ in the sense that although any particular replica may contain duplicate records, any successful operation makes sure the record is at the same offset in all the replicas. The applications read/write such files through library code which takes care of undefined regions. GFS also provides the snapshot operation that has copy-on-write semantics on cloned namespaces. 3) The GFS master server is responsible for creating, replicating and rebalancing chunks across chunkservers. It also does lazy garbage collection of deleted files. It also detects stale replicas in chunkservers using version numbers for the chunks. Most of these mechanisms have user-level policies/knobs for greater control from the userspace. GFS also maintains persistent checksums for each chunk to identify corrupt data.

Evaluation: The authors have demonstrated the scalability of GFS, albeit at a small scale by measuring the throughput for large sequential reads,large random writes to different files and concurrent record appends. However, a comparison of GFS with other popular distributed file system would have thrown more light on the optimizations made in GFS. The authors also evaluate the recovery time after losing a number of chunks. The authors have also reported the nature of workloads that actually run on production and R&D clusters. Although this more or less proves their initial assumptions correct, there is not sufficient throughput data to actually interpret the scalability of GFS. Also, on a large scale one might expect the network to be flooded with TCP flow control messages. The authors could have evaluated the effective throughput for production clusters.

Confusion: What are the overheads of TCP connection setup/breakdown.

Posted by: Prashanth Balasubramanian | March 30, 2016 11:49 PM

1. summary Google implemented Google file system, which can reduce the cost of large density disks and prevent failure from server crash by using the policies: replication of data over three machines, new file system architecture, relaxed consistency model. To support these policies, GFS utilizes explicit communication sequences among master, client and chuck servers to access data, and to write or append data, and to replicate data.

2. Problem Because distributed system is build with many machines, the component failure is often, comes from application bugs, operating system bugs, human errors, and the failures of hardwares, and power shortage. The sizes of most file are huge, which is opposite from the previous file system assumption that most file sizes are small. Most files are mutated by appending new data rather than writing random access overwrite. Combining the applications and file system are beneficial because it can reduce restrictions applied for correct operation over designed file system at the sacrifice of performance.

3. Contributions Main contributions in GFS is that it provides fault-tolerant distributed system, even when there are several servers are crashed at the cost of replication of data over default 3 chunkservers and backup of master. Replication is reliable while it is costly. To cope with the increase of cost, GFS uses commodity cheap disk instead of expensive enterprise disk. The detail operations are followed. A client gets information about file handle and chunkserver location by asking a file name and chunk index to master, and then the client start transaction on one of closest chunkserver. To reduce the traffic between clients and a master server, GFS has 3 components: master storing namespace and location of chunk, clients, chunckservers maintaining data chunk. To avoid heavy traffic on a machine GFS uses replication of data requiring data structure such as namespace and locations. Those data is kept in master’s space and logged in disk and backup master to prevent losing such data. Write operations takes place with permission named lease from master. It reduces the traffic from client to replicate same data over replicas. To maintain chunks evenly on chunkservers, the new chunk is stored in the chunkserver which utilizes below average disk space when there is a request for new replica. GFS uses large chunk size because the metadata in master needs a lot of memory space when the size of chunk is small. In addition, it can provide good sequential operation for read or write. The update of metadata needs synchronization to prevent corruption from concurrent requests by clients. 4. Evaluation GFS is evaluated with simple cluster: one master, two master replica, 16 chunkservers, and 16 client. The read and write access speed shows that the utilization of network for data transaction increases as the number of requests from clients increases while record append operation utilize network bandwidth statically regardless of the number of requests from clients. The metadata in chunkservers require large space because of checksum according to the real world workload evaluation while metadata in master needs small space. The metadata for recovery are small, so the recovery time also is small. The recovery time for 600GB of data needs 23.2min, which is very small time for its operation. The table.5 shows that the percentage of write operation which has over 1MB are 3.3 and 28.0% each other. Is It big potion? 5. Confusion What is the actual meaning of appending here? What is the difference between appending and overwriting if the requested operation is overwriting?

Posted by: Choungki Song | March 30, 2016 11:01 PM

1. Summary This paper describes the design and implementation of Google's distributed file system. The design decisions for implementing the file system were driven by Google's own application and workload requirements which resulted in certain radical differences as compared to the already existing systems. But this filesystem with custom applications designed to run on it have successfully met Google's storage needs. 2. Problem Contemporary distributed file systems were not a perfect fit for Google's applications and workloads. The file systems assumed component failures as exception events, but Google's clusters were in the order of thousand machines containing thousands of disks. A system failure was a highly probable event in such an environment and with Google planning to expand their clusters further, failure would be a norm. The files on which Google's application operated were in the order of Giga Bytes and were mostly mutated by data appends, so they had to revisit the design decisions involving i/o operations and block sizes and optimize for the large file size.

3. Contributions The design of GFS shows how a filesystem can work with a relaxed consistency model if the applications are co-designed taking into account such a model. Applications running on GFS modify files by appending and the consistency model of GFS allows duplicate records in chunks, so the onus of detecting the duplicates is on the applications. GFS architecture has a single master with data stored on the chunkservers. The clients only exchange metadata with the master and the data is exchanged directly with the chunkservers, this design prevents master from being a performance bottleneck. A single master which can be a single point of failure, the authors overcome this issue having shadow masters which only store the metadata. GFS stores data in 64MB chunks, each file is represented by chunks which are distributed on different machines and replicated by default, it ensures data integrity of the huge chunks by maintaining a checksum for each 64KB block, this speeds of calculating checksum for appended data if chunks are partially full. The system also has well thought policies for chunk placement, replication and rebalancing which spreads the chunks across several machines and racks to ensure equal disk utilization, prevent a chunkserver to become a bottleneck and prevent the failure of a rack from losing data. The snapshot technique enables master to create quick copy of large amount of data and the garbage collection mechanism which lazily removes files makes the system more reliable. 4. Evaluation The authors present evaluation in two stages. First they show the performance of reads, writes and appends in micro-benchmarks and then present evaluation for real world clusters using GFS. The micro benchmarks show that the per-client read rates highest as compared to the theoretical maximum when the number of clients are less. This efficiency drop as clients increase is attributed to the fact that as the number of clients increase then there is a higher chance that they simultaneously read from the same chunkserver. The rate of record append drops as the number of clients simultaneously append data to chunks of same file this is attributed to network congestion and the network topology. The paper also presents evaluation of 2 real world clusters, both the clusters have hundreds of chunkservers serving around 700k files with data in order of tens of TBs. The metadata for masters of such clusters is less than 100MB so it is clear to see how the data can be easily maintained in memory, the authors also claim that the relatively small RSS of master makes recovery faster in case the master crashes. The authors had claimed that their systems read much more data than they write and they show the numbers in evaluation, and since they optimize the GFS for reads the evaluation shows that in large clusters the read rates are about 77% of theoretical maximum which is great. GFS was designed to prevent master from being the bottleneck, the evaluation shows that even in huge clusters the master has to deal with around 500 ops/sec, which mostly do not involve any i/o operations and so the master can easily handle them when the namespace is efficiently designed to store data in a binary tree. The paper also presents analysis of recovery time and workload breakdown which re-enforces the assumptions and the design decisions made by the authors.

5. Confusion a. discuss consistent and defined regions in class and how client can identify defined regions. b. I did not quite understand how checksumming is optimized for appends.

Posted by: Mihir Shete | March 30, 2016 10:27 PM

Research Journal of Information Technology

Vol 8 (3), 2016

Research Article

Analyzing google file system and hadoop distributed file system.

Received: March 05, 2016; Accepted: June 21, 2016; Published: September 15, 2016

How to cite this article

Introduction.


Fig. 1:	Google file system architecture


Fig. 2:	Hadoop components

HBase	:	Apache HBase is the Hadoop database, a distributed scalable and big data store
HDFS	:	Hadoop file system is a core component in the Hadoop architecture. The HDFS sits in the data storage layer in Hadoop. The HDFS and HBase will be explained in more details in the coming sections

•	A 1100-machine cluster with 8800 cores and about 12 Petabytes raw storage
•	A 300-machine cluster with 2400 cores and about 3 Petabytes raw storage

•	In GFS, files are organized hierarchically in directories and identified by path names
•	The GFS is exclusively for Google only
•	The HDFS supports a traditional hierarchical file organization
•	Users or application can create directories to store files inside
•	The HDFS also supports third-party file systems such as CloudStore and Amazon Simple Storage Service (S3)

Table 1:	Bigtable anf HBase comparison

•	Read requests are sent by clients to master in order to find out where a particular file on the system is stored
•	Master server replies back with the location for the chunk server acting as the primary replica holding the chunk
•	The master server provides a lease to the primary replica for the desired chuck
•	If the lease is not held by any replica, the master server defines a chuck as primary and chooses the closest chunkserver to client. That chunkserver becomes the primary
•	Finally, the client contacts the desired chunkserver directly, which sends the data to the client

•	The client sends a request to the master server to allocate the chunkserver acting as the primary replica ( )
•	The master sends to the client the location of the chunkserver replicas and identifies the primary replica
•	The client sends the write data to all the replicas chunk server’s buffer, starting with the closest. Data sent through pipeline
•	Once the replicas receive the data, the client tells the primary replica to begin the write function
•	The primary replica writes the data to the appropriate chuck and then the same is done on the secondary replica
•	The secondary replica completes the write function and reports back to the primary replica
•	Finally, the primary replica sends the confirmation to the client

•	Client asks the NameNode about block’s location
•	NameNode has metadata for all blocks location. It sends blocks’ location back to the client
•	Client seeks and retrieves the blocks directly from DataNode where the blocks are placed

•	The client sends a block write request to the NameNode ( )
•	The NameNode responds back by telling on which DataNodes the file’s blocks should be written
•	Directly, HDFS client contacts the first DataNode over TCP and sends "Ready" command. The first DataNode by its turn sends it to the second DataNode and the same process continues for the third DataNode
•	"Ready" command is sent from the third DataNode to the second one and finally to the first DataNode which delivers it to the client telling all DataNodes are ready for the write order
•	The TCP pipeline is now ready to transfer the data block
•	The client sends the first block wishing to write directly to the first DataNode, then the second and finally third DataNode
•	All DataNodes update the NameNode about the written block
•	First DataNode tells the client that file’s block was written successfully
•	Then after, the client repeats the same scenario for the rest of data blocks


Fig. 3:	GFS write I/O


Fig. 4:	HDFS write I/O

Table 2:	Summary comparison

ACKNOWLEDGMENT

Kouzes, R.T., G.A. Anderson, S.T. Elbert, I. Gorton and D.K. Gracio, 2009. The changing paradigm of data-intensive computing. Computer, 42: 26-34. CrossRef Direct Link
Ghemawat, S., H. Gobioff and S.T. Leung, 2003. The google file system. Proceedign of the 19th ACM Symposium on Operating Systems Principles, October 19-22, 2003, ACM, Lake George, NY., pp: 29-43.
Dean, J. and S. Ghemawat, 2004. MapReduce: Simplified data processing on large clusters. Proceedings of the 6th Symposium on Operating Systems Design and Implementation, December 6-8, 2004, San Francisco, CA., USA., pp: 137-150. Direct Link
Chang, F., J. Dean, S. Ghemawat, W.C. Hsieh and D.A. Wallach et al ., 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., Vol. 26, No. 2. CrossRef
Shafer, J., S. Rixner and A.L. Cox, 2010. The hadoop distributed filesystem: Balancing portability and performance. Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, March 28-30, 2010, White Plains, NY., pp: 122-133. CrossRef
Vora, M.N., 2011. Hadoop-HBase for large-scale data. Proceedings of the International Conference on Computer Science and Network Technology, Volume 1, December 24-26, 2011, Harbin, pp: 601-605. CrossRef

Yermia Gem Reply

Such studies help in understanding the similarities and differences between different systems and technologies, and can aid in making informed decisions about which system or technology to use for a particular project. In this case, the study aims to compare two widely used distributed file systems, which can help in selecting the appropriate system for managing big data in large clusters.

Your email address will not be published. Required fields are marked *

Review for Paper: 7-The Google File System

This paper introduces design choices for the Google File System (GFS).

The GFS design follows these assumptions:
1. system components often fail, so the system needs to monitor, detect, tolerate, and recover failures.
2. Files are large.
3. The workload consists of large streaming reads and small random reads.
4. The workload includes many writes than append data to files.
5. The system should handle concurrent write (update) well.
6. Throughput is more valued than individual request's latency.

To meet goals, the GFS architecture is designed to have a single master (hold metadata for files, have some master replicas) and multiple chunkservers (hold replicated data and respond to client requests, for reliability).

Data are separated into chunks, and a large chunk size reduces client-master interaction, network overhead, and size of metadata, while increasing the chance of having hot spots in the system.

Checkpoints on the master node are used. Snapshots, operation logs, and master replicas are used to recover master states.

GFS has a relaxed consistency model, and maintains several guarantees: file namespace mutations are atomic, and after a sequence of successful mutations the file region is defined and contain the data written by the last mutation.

The concept of lease is used to help the system decide global mutation order of a set of mutations on the same data. During the order decision process, data flow and control flow are decoupled to fully utilize each machine's network bandwidth.

GFS use read-write locks to manage namespaces. It has smart replica placement methods to maximize data reliability and availability and maximizes network bandwidth utilization. GFS also do re-replication, rebalancing, and garbage collection (lazily remove unused files) to fulfill these goals.

The author uses two real-world sample workloads to illustrate their design assumptions are correct, and the design choices meet needs.

This paper well illustrates the motivations to the GFS, and the designs made to meet different goals. The thing I like most is that it gives a clear explanation on the system architecture of GFS, and use real-world workload example to show the results they have on the system they built.

This paper introduces design choices for the Google File System (GFS).

The GFS design follows these assumptions:
1. system components often fail, so the system needs to monitor, detect, tolerate, and recover failures.
2. Files are large.
3. The workload consists of large streaming reads and small random reads.
4. The workload includes many writes than append data to files.
5. The system should handle concurrent write (update) well.
6. Throughput is more valued than individual request's latency.

To meet goals, the GFS architecture is designed to have a single master (hold metadata for files, have some master replicas) and multiple chunkservers (hold replicated data and respond to client requests, for reliability).

Data are separated into chunks, and a large chunk size reduces client-master interaction, network overhead, and size of metadata, while increasing the chance of having hot spots in the system.

Checkpoints on the master node are used. Snapshots, operation logs, and master replicas are used to recover master states.

GFS has a relaxed consistency model, and maintains several guarantees: file namespace mutations are atomic, and after a sequence of successful mutations the file region is defined and contain the data written by the last mutation.

The concept of lease is used to help the system decide global mutation order of a set of mutations on the same data. During the order decision process, data flow and control flow are decoupled to fully utilize each machine's network bandwidth.

GFS use read-write locks to manage namespaces. It has smart replica placement methods to maximize data reliability and availability and maximizes network bandwidth utilization. GFS also do re-replication, rebalancing, and garbage collection (lazily remove unused files) to fulfill these goals.

The author uses two real-world sample workloads to illustrate their design assumptions are correct, and the design choices meet needs.

This paper well illustrates the motivations to the GFS, and the designs made to meet different goals. The thing I like most is that it gives a clear explanation on the system architecture of GFS, and use real-world workload example to show the results they have on the system they built.

Google File System team introduced their product in this paper. GFS is designed to meet Google’s enormous data processing need but now it’s used broadly for different purposes. It’s necessary because traditional distributed file systems suffer drawbacks like can’t deal with component failure. This paper provides a thorough overview of the GFS system design and performance. First, it introduces the architecture and model design of the system. Then, they show how the system interact with the client, master and chunk servers’ operations. Next section is the fault tolerance performance analysis and some bottlenecks in GFS architecture and implementation. At last the team shared their experience while developing GFS, the problem they faced and how they deal with that.

Some of the strengths of this paper are:
1. GFS has high availability. Data is still available even if some of the node fails in the file system. In other words, component failures are the norm rather than exception
2. By running multiple nodes in parallel, GFS delivers high aggregate throughput to many concurrent readers and writers’ actions.
3. GFS storage is reliable. When data corrupted, it can be detected and recovered.
4. Earlier GFS has a workload bottleneck in Master. Current GFS has solved the problem by changing master data structures to allow efficient binary searches.

Some of the drawbacks of this paper are:
1. For files with a small size less than 100MB, GFS is not optimized.
2. GFS can’t handle random write or modified previous files efficiently because of the appending mechanism it used.
3. GFS optimize a high data processing rate but not optimizing the time performance for a single read or write.

The paper presented the design overview of Google File System, explained its mechanism to support large distributed data-intensive applications and reported measurements from both micro-benchmarks and real world use. As the demands of data processing needs keep growing rapidly, distributed file systems require better performance, scalability, reliability, and availability. Besides, the observations of application workloads and technological environment have changed into more common component failure, large file, more appending than overwriting, and increased flexibility by co-designing.

The paper summarized the redesigned model as follows:

1.Design Overview
The whole design is based on assumptions: 1) Components often fail. 2)Files stored are large. 3) Streaming reads and sequential writes outnumber random reads and random writes. 4) Concurrent appending and high bandwidth are important.
The architecture of GFS is a single master with multiple chunkservers, which is accessed by multiple clients. Master maintains all the metadata and system-wide activities while chunkservers store file chunks with replicas on other chunkservers. Clients cache metadata but neither client nor chunkserver caches file data.
Metadata contains the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas, the first two of which are also kept in an operation log. The master can recover its file system by replaying the operation log.
GFS has a relaxed consistency model, which guarantees atomicity, correctness, definedness, fast recovery and no data corruption, and this model can be accommodated by GFS applications.

2.System Interaction
The system is designed to minimize the master’s involvement in all operations.
1)Leases are used to maintain a consistent mutation order across replicas in order to minimized the overhead at master.
2)The flow of data is decoupled from the flow of control to use the network efficiently. While control flows from the client to the primary and then to all secondaries, data is pushed linearly along a carefully picked chain of chunkservers in a pipelined fashion.
3)GFS provides an atomic append operation called record append.
4)GFS uses standard copy-on-write techniques to implement snapshots.

3.Master Operation
1)GFS allows multiple operations to be active and use locks over regions of the namespace to ensure proper serialization.
2)GFS manages chunk replicas throughout the system by spreading chunks across machines and racks.
3)GFS makes placement decisions, creates new chunks and hence replicas, and coordinates various system-wide activities to keep chunks fully replicated, to balance load across all the chunkservers.
4)GFS offers lazy garbage collection to reclaim storage for simplicity and reliability, as well as merging storage reclamation into the regular background activities of the master, and safety net against accidental, irreversible deletion.
5)GFS maintains version number to detect stale replica.

4.High Availability
GFS is highly available by fast recovery and replication. It achieves data integrity by checksum and uses extensive and detailed diagnostic logging to help problem isolation, debugging, and performance analysis.

5.Measurement
This paper presented micro-benchmarks to illustrate the bottlenecks inherent in the GFS architecture and implementation, and also some numbers from real clusters in use at Google.

The paper provided a distributed file system with high availability, high throughput and reliable storage, which pioneered in industrial at that time. Instead of giving a high level idea of how to design this file system, the paper gave a detailed description and reason of this design, which makes reader clear about both the design and the reason why this design should be optimal.

However, this design also has following problems:
1. This design wastes storage for small sized files.
2. A single-master structure restricts the scalability of the system.
3. This design is not suitable for a large number of random read/write operations.

Problem & Motivations:
The engineers want to design a scalable distributed file system for large distributed data-intensive applications on inexpensive commodity hardware which suits for google workloads.

The system shares 4 distinctions with the tradition.
1. Enable fault tolerance.
2. Handle files with a large file size.
3. Different files operation patterns (append, large stream reading).
4. The flexibility of API.

The authors propose the Google File System which can suit the requirements.

Contributions:
It proposes the Google File System. The GFS contains many useful talent design details like the file system structure with the chunk. However, the most important contribution is that it views the faults as common rather than exceptions. And by introducing replicas, it successfully built a system which relied on the inexpensive commodity hardware.

Drawback:
Contains too many details and yet lack a sense of the whole. If there is an example that can guide as started from the application request to how GFS track the data and send it back (step by step and detailly). It will be excellent!

Problem & Motivations:
The engineers want to design a scalable distributed file system for large distributed data-intensive applications on inexpensive commodity hardware which suits for google workloads.

The system shares 4 distinctions with the tradition.
1. Enable fault tolerance.
2. Handle files with a large file size.
3. Different files operation patterns (append, large stream reading).
4. The flexibility of API.

The authors propose the Google File System which can suit the requirements.

Contributions:
It proposes the Google File System. The GFS contains many useful talent design details like the file system structure with the chunk. However, the most important contribution is that it views the faults as common rather than exceptions. And by introducing replicas, it successfully built a system which relied on the inexpensive commodity hardware.

Drawback:
Contains too many details and yet lack a sense of the whole. If there is an example that can guide as started from the application request to how GFS track the data and send it back (step by step and detailly). It will be excellent!

This paper details the Google file system developed in house to be a “scalable distributed file system for large distributed data-intensive applications.” The motivation for developing this system were the distinctive workloads faced by Google, such as a large amount of sequential reads of very large files for data analysis, as well as other factors unique to the internal Google ecosystem. With that in mind, the Google file system aims to provide performance, scalability, reliability, and availability, much like the typical file systems being used. This paper starts by detailing the main assumptions behind the design of this system. First, components are assumed to have relatively high failure rates (since they are composed of large numbers of inexpensive commodity items). Second, the file sizes that the system is expected to work with are orders of magnitudes larger than traditional file sizes. Third, appending to the end of the file rather than random writes is the norm. Finally, designing the applications and file system API together increases flexibility.

The paper continues by describing the key implementation details of the Google file system. Each system is comprised of a single master and multiple chunkservers, which are potentially accessed by multiple clients. Clients are directed to a particular chunkserver by the master, which is also responsible for maintaining the system metadata. The chunk size was chosen to be 64 MB, much larger than that on typical systems, which brings the advantage of less client interaction with the master, among other advantages. The paper continues the discussion in a similar level of detail about how chunk locations are stored, replication (since it is a distributed system by design), fault tolerance and recovery methods, and ways to ensure data integrity. It then follows with experimental results based on benchmarks run on the Google file system, which are compared with performance data from real clusters in use at the time.

The main strengths of this paper are that it introduces a system that is well specialized to the unique use case for Google. By identifying the typical workloads, the system can be well tailored to their needs. Also, the way that they essentially assumed that a system could fail at any time, by treating normal and abnormal terminations as the same, and taking steps to verify everything, helps in ensuring high reliability and data availability. In general, the paper was well written and easy to follow.

The primary weakness probably ties hand in hand with its greatest strength, in that it is greatly specialized to a particular type of use case (e.g. mostly file appends rather than random writes). This undoubtedly brings greater gains than a more generalized system, but there is always a risk that usage patterns may change in the future (though that probably will not be the case in this age of big data). Beyond that, the presence of just one master that is responsible for managing many chunkservers presents a single point of failure, as well as a potential bottleneck that might make it more difficult to scale in the future.

This paper’s purpose is to outline the Google file system (GFS) which boasts its scalability and reliability in serving many clients and offering distributed services. It mentions the challenge of handling very large, ever changing files. This is important because of how widespread Google’s database is in the average person’s daily life, and it is quite robust. The paper then describes some of the transactions that the file system interface supports. We get to see the architecture of the database as a single master and many chunkservers, which hold all of the data in the database in fragments. We look into chunk size and how it affects the performance of queries since larger chunk sizes will reduce the need for interaction with the master since all data can be found on one chunkserver.

GFS deals with consistency issues by having multiple backups of each of the chunkservers. That way data is not lost unless all backups of that data chunk all fail before the master has performed its sort of heartbeat “handshake” with the chunkservers. We also looked into mutations, which were transactions that modified data, and how they were backed up to replica chunkservers whenever changed. We also looked into how GFS deletes files by renaming them and garbage collecting after the file has sat with the new name for 3 days. Data integrity was also covered as we have previously seen, using checksumming handshakes to detect corrupted data.

I liked how thorough this paper was, this was one of the few times that assumptions were explicitly outlined, before going into the main contribution of the paper. We got to see what they assumed in the resources, performance, and challenges that the file system will face, like recovering data from one of the cheap distributed computers, or dealing with a lot of rapidly changing files. The paper was very organized as well, I felt, everything flowed well and was very digestible.

I did not like how dense figure 1 was. I thought figure 2 was great, very digestible. However figure 1 was a bit hard for me to understand because of how dense the information was in it, even though they covered it in the text of that section. Perhaps if this visual was broken up into multiple sub graphs and explained in a bit more detail incrementally, I would have received this information better.

GFS is a file system created by Google to fit needs. It leverages largely on distributed system ideas and the whole file system is running on multiple servers so that it costs less. Some difficulties and assumptions are 1)single component failure is very common considering large volume of servers. 2)files are huge by traditional standards. 3)there are two kinds of read: large streaming reads and small random reads 4)adopts data appending instead of overwriting. 5)handles simultaneously visits.

The basic structure of GFS is master-chunkservers architecture accessing by clients. It is a relaxed consistency model with easy implementation but realized distributed properties. The master maintains metadata of files and communicating with the chunk server using HeartBeat messages basically. The master has three major types of metadata 1)the file and chink namespaces 2)the mapping from files to chinks 3) the locations of each chunks' replicas. The master is designed with the aim of minimizing involvement in reads and writes to avoiding becoming a bottleneck, The chunk size is relatively large with easy space allocation. And The files are divided into fixed-sized chunks as storing.

GFS uses lease mechanism. It is designed to minimize the masters involvement in operations. GFS also supports pushing data using each machine's network bandwidth which prevent network bottlenecks and latency. This is achieved by pushing linearly along the chain of chunk servers instead of distributed in some other ways. It also minimizes latency. By pipelining the data transfer over TCP connections. GFS provides record appends which is an atomic operation and supports many clients on different machines append to the same file concurrently. GFS also use smart 'copy-on-write' to implement snapshots.

GFS's master server serves many tasks: 1)executes all namespace operations 2)makes placement decisions 3) create new chunks and replicas 4)load balancing and reclaim unused storage 5)garbage collection(not deleting immediately, simpler and reliable but sometimes constrain things when storage is tight) 6)stale replica detection.

GFS achieves high availability by fast recovery and (chunk, master) replication. Mention that there is a shadow master providing read-only access to file system in case the primary master is down. Also each chinkserver using checksumming to detect corruption of stored data and it will not that impact the performance. Also the diagnostic log is quite useful but have small impact to the performance.

The contribution of this paper is that it uses commodity hardware to realize large scale data processing workload which is really amazing. And many times the design ideas bring easy implementation but good performance. It meet the needs for scalability, stability, concurrency, integration for real situation.

I think one of the flaw of the GFS is on its master which may raise a bottleneck for system. My idea is to build a small group of masters(3-4) with fully connected manner to stabilize the master's work and ensures its performance.

The paper presents the Google File System (GFS), a novel distributed file system. The system is built on top of commodity hardware, and takes the approach of assuming that components *will fail* and that the system should be resilient. GFS uses a single-master approach, and stores each “chunk” of data on three datanodes. A novel method, “record append,” is used to write data in an atomic fashion. Through a clever set of techniques involving caching data on client machines and circumventing larger network hops, the system is able to be quite efficient despite having many machines and only one master. Overall, this paper presents a system that gives up storage space and latency in order to achieve durability and scalability.

The true strength of this paper in my opinion was that it covered almost all of the bases when it came to potential flaws. I constantly found myself marking that I thought something was a potential problem, only to read half a page later that the authors had a clean solution for that problem. One of the first things that I was concerned about was that the master seemed like a single point of failure; then on page 3, I read that the operation log was used to prevent problems in the case of master crashes, and on page 9, I read that the master’s state is indeed replicated. Another similar concern I had was that it might be possible or likely to have problems with multiple replicas, especially with the clear statement that machines are expected to fail. However, the decision to spread replicas of each chunk over multiple racks helps with at least some issues that could take down the data in multiple places. The way that the paper was written gave me confidence that the authors had built a reliable system. The empirical data provided was also useful in this regard.

While most of my concerns were not ignored in the paper, I did still have some reservations about the system when I finished reading:
1. The system is highly inefficient when it comes to the number of machines used because of the (configurable) 3x replication factor. The authors briefly discussed using parity or erasure codes, and I’d like to know if this was ever implemented. If not, this might not be a great system for companies looking to save on infrastructure costs, although it does allow for some automation that may save money in the long run. Machines are cheap.
2. The system is clearly built for specific workloads at google, but might not fit workloads at other companies as well. For instance, there is a clear emphasis put on overall bandwidth as opposed to latency seen by one client - other applications might not have the same priorities.
3. Simply put, this system sounds like a pain to set up. Google probably has some great automation tools, but I have managed HDFS servers on AWS and it wasn’t fun.
My concerns aren’t a suggestion not to use GFS - they are merely a list of reasons why it shouldn’t be considered as the only option.

This paper purposed Google File System, which is a scalable distributed file system for large distributed data-intensive applications. The motivation for this system is that Google has observed a marked departure from original file system design assumptions. More specifically, Google found that component failures are common, files are huge, and most files are mutated by appending new data.

A GFS cluster is designed as follows: it consists of a single master and multiple chunkservers. Files are divided into fixed-size chunks and identified by a globally unique 64bit chunk handle. Each chunk is stored and replicated multiple times (by default 3) on multiple chunkservers. When the client needs to read or write data, it first communicates with the master to get metadata such as chunk handle and chunk locations.

The Master is responsible for storing metadata such as file and chunk namespaces, mapping from files to chunks and the locations of each chunk’s replicas. Some of the information should also be kept persistent so that when the master restarts these states can be retrieved. There’s also a shadow master which can serve read operations when the primary master is down.

Besides the high-level architecture, the paper also explains lots of details, for example, the consistency model, lease, mutation order, locking scheme, garbage collection, etc. Together, they form a complete description of the Google file system.

The main goal of GFS is to build a file system that meets the needs of Google’s workload. The paper presented measurements from their research & development cluster as well as production data processing cluster and it seems like GFS works extremely well on these workloads.

I think besides the specific design of GFS, another lesson we should learn from this paper is that commodity hardware is capable of supporting large-scale data processing workload under the right design decisions and workload assumptions.

In this paper, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung discuss the implementation approach and considerations taken when creating the google file system (GFS). Specifically, this paper discusses the design overview, system interactions, master operations, fault tolerance and diagnosis, and measurements - both on the small data benchmarks and industry scale. When developing GFS, Ghemawat's design was driven by Google's workloads and technological environment, both current and anticipated. Thus, some traditional choices were abandoned in order to suit their needs. As a result, they developed a scalable distributed file system that runs on inexpensive commodity hardware and serves high aggregate performance to a large number of clients. In one instance, their largest cluster involves hundreds of terabytes of data, across thousands of disks, concurrently accessed by hundreds of clients. Scaling out rather than scaling up enables Google to meet their research and development needs at a cheaper price. This decision to make this public knowledge is important as it allows any smaller company to gain insight and more room for growth.

Ghemawat placed many core assumptions that allowed him to carefully choose his architecture. These assumptions include using commodity hardware that fail frequently, a system storage of multi-GB files, workloads with large stream reads and small random reads, workloads with large sequential writes that append to files, and a prioritization on high sustained bandwidth rather than low latency. The architecture chosen consists of a single master and multiple chunksevers that are accessed by clients. The master mediates interactions between these chunks and oversees metadata storage. The interaction between the chunks and master are minimized in order to reduce the overhead for the master - the master will only respond with the chunk handle and chunk location. Thus, applications use the appropriate chunkservers to extract their data. It addition, the master creates and manages chunk replicas to balance loads across the chunkservers. This enables fast recovery of data, data integrity, and an easy diagnosis on problems (since machines are not to be trusted).

Even though there are many great technical contributions, there are just as many drawbacks as well. The first and most obvious drawback is a security flaw in their file system. The GFS is located in the user space, which in practice, is not a good thing to do. Using the kernel space is much more appropriate due to the fact that the Linux operating system has a better chance of staving off unwanted viruses and protecting against attacks. Furthermore, having a single master might create a bottleneck in the system if they have constant random reads or writes to data. I would have appreciated it if they included graphs detailing the impact on performance they might have, if it was done constantly.

This paper outlines the implementation of the Google File System. GFS is built with large workloads in mind; it uses several servers, ande is optimized for multiple clients and reading and writing from large files. Files are broken up into fixed size chunks, which are replicated and stored on a large amount of chunkservers. Clients contact a single master server, which redirects them to the appropriate chunkserver. As little work as possible is done on the master server, so that it doesn’t get overloaded by client requests. The chunkservers communicate with each other to synchronize changes, although each chunk replica need not be in exactly the same state after a change. Because replicas don’t need to be the same, they must keep track of their data integrity individually. Replicas of chunks are created whenever needed, or whenever the number of replicas falls below the needed threshold. Copies of files can be made from a user perspective with little work; by using copy-on-write, the actual copying can be saved until changes are made to either copy.

GFS is optimized for large data loads, and is built with all of its components being replaceable without having a significant negative effect on the system. This allows it to scale up to much larger sizes much more easily than other storage systems. Being built on fault tolerance also helps the system to stay consistent. The general usage is mostly similar to other file systems as well, which should help portability.

The focus on exclusively large datasets does limit its usage, however. The bulk of optimization is focused on serial as opposed to random access, which makes some applications less useful if they don’t read much data serially.

The Google File System is developed to meet the industry demand with scenarios like more machine involved and more clients participated. It’s crucial as it successfully solve the problem by considering the fault tolerance on multiple inexpensive machines and optimizing aggregation performance to clients.
The GFS is different from traditional file systems for these assumptions. It considers the fault tolerance and recovery on a routine basis. It emphasis on append operation. The chunk size should be reconsidered to both efficiently manage big files and also support small files. The workloads consists of large streaming reads, small random reads and large, sequential writes that append data to files. The system must efficiently support multiple concurrent writes to the same files and high sustained bandwidth is more important than low latency.
The architecture of a GFS cluster consists of one master and multiple chunkservers accessed by multiple clients. The files are divided into several chunks delivered to and stored in different chunkservers and are accessed by unique chunk-handle. Given reliability, chunks will be delivered to multiple chunkservers.The Master manages metadata which is mainly about the information of chunks instead of file data. The master and clients only communicate metadata, and the read and write is processed in different chunkservers. Clients only need to backup metadata.
One master, big chunk, metadata and concurrency control are crucial characteristics of GFS in my opinion. One master is beneficial to simplify operation and global scheduling. Big chunk will decrease the numbers of chunks which are beneficial to decrease interaction between master and client, relieve metadata storage pressure and decrease cost of transfers of clients. The utilization enables the master to focus on managing chunks instead of heavy work of reading and writing. The concurrency control mainly solves the problems in scenarios where multiple clients concurrently write to the same file. The GFS implement record append to realize this multiple consumers and single producer process and save synchronization cost compared with tradition file system.
One thing to mention is the backup strategy of GFS. The data is stored in chunkservers and matedata is stored in master. We should always consider fault tolerance and recovery. The data will be backup in different chunkservers on different racks and metadata will be backup in different chunkservers.
There is no doubt that GFS achieved great success. But maybe there are some drawbacks. The GFS make an assumption of more big file than small files, so it uses bigger chunk size, which shows that too many small files(if any) will greatly decrease the efficiency of GFS. Another thing to mention is that clients will backup metadata. Assumed there are frequent crash in chunkservers, the metadata backup in clients will also be frequently invalid.

This paper introduces how Google File System delivers high aggregate performance to a large number of clients with inexpensive hardwares. GFS is essentially distributed file system that shares similar goals like performance, scalability, reliability, and availability with other distributed systems. What makes GFS different is it is application specific. Author provides a design overview before going into technical details, which is very helpful for readers to understand the whole picture of the distributed system. In the overview section, author provides assumptions that are specific to the application, high level explanation of interface and architecture, how GFS keeps its metadata, and how it deals with consistency. The following sections goes into details of each aspect mentioned in the overview section. Essentially, GFS has a single master that maintains all file system metadata. Files are divided into fixed-size 64 MB chunks, and the chunks have multiple replicas spread across racks for recovery. Clients interact with master to get the chunk location information and then send requests to the closest replica for chunk data. Master controls all chunk placement and monitors chunkserver status with HeartBeat message. There is also a operational log that contains a historical record of metadata changes. The paper explains why this structure is adopted for the application, why a certain size is chosen, why replicas are distributed in a certain way, how these key components interact to keep records consistent, what special features GFS has, how to detect stale replica, and many more aspects. The last section analyzes read, write, and append time, recovery time, and workload of real world clusters. This paper is very fluent in structure and provides reasoning for almost all design decisions. It also points out the potential problem of the system like hot spots and provides possible solutions.

One thing that may help improve the paper is to provide summary of unique settings of Google. The paper mentions what types of service clients use the most here and there. If there is a summary of the specific application, it would be very helpful.

“The Google File System” by Sanjay Ghemawat et al. describes a new file system approach created at Google that aims to support many clients with large reads and writes on an architecture built from (many) commodity hardware machines. Since there is high likelihood that some of the (many) commodity machines will fail in a given time range, GFS must utilize a few approaches to ensure data is not lost and that the system does not go offline: data replication across multiple chunkservers, replicated master metadata, checksumming for confirming data integrity, and fast recovery. To support high aggregate throughput, GFS uses a master-chunkserver-client architecture where clients communicate read/write requests (but not data) to the master in order to get routed to an appropriate chunkserver. The master is not a bottleneck, as the master is not performing the read/write operations; it is only telling clients which chunkserver they should read/write data to/from. The authors position the paper as questioning traditional file system standards and considering whether assumptions in prior research apply to their use case at Google: commodity hardware clusters running large-scale data processing workloads.

In addition to their novel architecture and boldness in questioning the standard, I really appreciate that the paper outlines the assumptions (of scenarios to support and not support) that they made when designing their system. This makes it clearer for the reader, to understand where and how the approach could generalize, and how the work compares to related work. It is clear that the authors were actively aware of the assumptions during their research process, and actively considering what their research contributions were. It is also nice that the authors evaluate their approach on real world systems at Google, systems that would be expected to handle large and data-intensive workloads. The file system architecture as a result appears realistic, and the approach promising.

The tradeoff of making many assumptions is that the approach likely would not work for a wide range of hardware and data workloads, but there is not always a one-size-fits-all solution. As another critique, in the related work I think it could have been helpful to have a table or otherwise easy to read summary of the different kinds of filesystem architectures and different kinds of workloads, and which architectures are and are not effective for each kind of workload.

This paper describes the design and implementation details of the Google File System. It considers several goals same as other distributed systems, such as performance, scalability, reliability, and availability, along with their key observation of their application workloads and technological environment.

In Section 2, the paper gives its assumptions of the GFS. It is so important that only by considering the assumptions can the system design be correct. For example, "the system is built from inexpensive commodity components that often fail", that's why they need to make several replicas and consider fast recovery.

The paper gives the architecture of the system. Telling us why they choose a single master, multi-chunkserver structure and their consideration of choosing 64MB to be the chunk size, which is very different from the Linux file system.

Also, the paper gives a detail description of how the client, master, and chunkservers to interact with each other to perform different data operation while keeping the consistency, available of the system. All from a system and engineering perspective. The paper also describes how the GFS achieves the availability of fault tolerance and diagnosis.

At last, the paper shows it's experiment results on test cluster and real-world clusters.

Overall, the paper gives the design of GFS detailedly, and they fully considered the real-world situations and limitations of building such a large distributed file system. Besides, it's very demonstrative to illustrate the data flow in a flow chart.

However, after reading the paper, I'm still confused by a problem, how will the files be stored if I upload lots of files much smaller than the chunk size, for example, 1K photos of size around 1MB. If the system just leaves the chunk empty, it will be a large waste.

This paper is one of the three most famous paper purposed by Google, the other two are MapReduce and Bigtable. The idea of GFS is a milestone in the area of distributed storage systems and make a big success in the market. The famous open source system Hadoop Distributed File System (HDFS) is designed based on many ideas of GFS. It’s a great pleasure for me to spend time reading this wonderful paper.

With the coming of the Internet era, the volume of data grows at a crazy speed. How to effectively and efficiently manage these data come to a question for every internet company including Google. They need to build some storage system provide high reliability, availability, scalability and high performance for the rapidly growing demands of Google’s data processing needs. How to build such a system and apply their business logic to it is a significant problem for every internet company, because for an internet company, data is the most important thing for it. It needs to keep high availability rate for their website, guarantee the reliability of the storage so that it won’t lose user’s data, also it needs to make sure that it can handle accesses from many users at the same time. Based on these demands, the Google File System is introduced. The GFS uses the master-slave pattern which consists of a single master and multiple chunkservers. GFS achieves high performance as well as scalability, reliability and availability. Next, I will summarize the crux of GFS with my understanding.

In GFS, all the servers are using commodity machine and it is very flexible to add or remove chunkservers. Since commodity devices are subject to failure, GFS introduces several mechanisms for the reliability including monitoring, failure detection, failure tolerance and recovery. GFS supports files in different size and files are divided into fixed-size 64MB chunks. GFS focus on workload in large streaming and small random reads and large, sequential writes (append). For GFS, the large bandwidth is required while the latency is not a big problem. GFS supports normal file operations include create, delete, open, close, read and write, besides new features like snapshot and record append are also introduced. One of the key ideas for maintaining the reliability of GFS is using chunk replicas on multiple chunkservers. In GFS, master server coordinates the operation of the system including metadata management, chunk lease management, garbage collection, chunk migration and etc. The master uses a periodical HeartBeat message to control chunkservers and collect their states. However, the master does not involve in reads and writes and they try to minimize its interaction in all operations. I think it a good design avoiding master become a bottleneck. Besides, their design decouples the data flow and control flow, which makes it easier to schedule expensive data flow efficiently. The whole system design is driven by observations of workloads and the technological environment in Google, this is also something I learned from this paper. When we try to design or create something new, we should start with real-world practice, identify the requirement clearly, then apply our knowledge to solve the problem.

This is a pioneering paper in the area of distributed file systems and it does make a great contribution to the development of distributed file systems. The idea of master-slave mode and usage of commodity hardware has a great impact on modern distributed file systems. In their design, it doesn’t introduce too many complicated mechanisms, they try to make their design as simple as possible, I think this is what we still need to follow nowadays. Also, some design of the system is pretty innovative, like HeartBeat protocol, snapshot utility, chunk managements and etc. Although this paper was presented in 2003, it already includes many important ideas for Bigdata.

GFS is a successful product which is still in use (maybe Colossus) as an important infrastructure of other products in Google. Overall, it’s a great paper and I do not find any main drawbacks. Since this paper was written 15 years ago, I think nowadays, it is impossible to use a single master to do things, a single master will definitely become the bottleneck of the system. By the way, GFS is not open source as HDFS, it would be better if Google is willing to open source it.

This paper introduces a distributed file system developed and used by Google for “large distributed data-intensive applications.” Its main architecture consists of a master node and many non-master “chunkservers” which host “chunks” which contain the stored data. Clients access these chunks by asking the master node for the mapping that tells them which chunkserver to request information from; this can be cached by the client to mitigate bottlenecking at the master node.
In order to reduce complexity, a weaker version of consistency (compared to serializability) is guaranteed by GFS, which optimizes append operations over re-write operations. In order to guarantee durability, a log file (called the “operation log”) is maintained for when things go wrong. Several specific feature optimizations are also listed, like lazy garbage collection and stale replica detection, which allow for even better performance.
The chief advantages/benefits of GFS are as follows:
1) It runs on a distributed network of cheap commodity servers. This is the biggest strength in my opinion, as it allows for high performance at a cheap cost.
2) It is robust under failures, as its distributed protocol and operation log ensure that the consistency guarantees are always valid.
3) This is optimized for the environment assumed by the paper, which is data-intensive applications which use append operations far more than re-write applications.
On the flip side, although GFS is optimized for the environment assumed, the environment IS based on a number of assumptions, which can be treated as weaknesses. For example, there are big assumptions made on the workloads; particularly that they will mostly be either large streaming reads, small random reads, or large appending writes. For workloads that do not conform to this assumption, GFS will not perform as well. Also, although the authors addressed this with a short-term solution, small chunksizes can be a bottleneck to the system if many clients are trying to access the same chunk at once. Replication was offered as a short term solution, but the paper did not mention an algorithm to determine what level of replication is necessary for a given chunk at any given time.

Pardon Our Interruption

As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen:

You've disabled JavaScript in your web browser.
You're a power user moving through this website with super-human speed.
You've disabled cookies in your web browser.
A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article .

To regain access, please make sure that cookies and JavaScript are enabled before reloading the page.

Engineering & Technology
Computer Science
Data Management

NFS and Google File System Case Study

Add this document to collection(s)

You can add this document to your study collection(s)

Add this document to saved

You can add this document to your saved list

Suggest us how to improve StudyLib

(For complaints, use another form )

Input it if you want to receive answer

GFS: Evolution on Fast-forward

A discussion between kirk mckusick and sean quinlan about the origin and evolution of the google file system..

During the early stages of development at Google, the initial thinking did not include plans for building a new file system. While work was still being done on one of the earliest versions of the company's crawl and indexing system, however, it became quite clear to the core engineers that they really had no other choice, and GFS (Google File System) was born.

First, given that Google's goal was to build a vast storage network out of inexpensive commodity hardware, it had to be assumed that component failures would be the norm—meaning that constant monitoring, error detection, fault tolerance, and automatic recovery would have to be an integral part of the file system. Also, even by Google's earliest estimates, the system's throughput requirements were going to be daunting by anybody's standards—featuring multi-gigabyte files and data sets containing terabytes of information and millions of objects. Clearly, this meant traditional assumptions about I/O operations and block sizes would have to be revisited. There was also the matter of scalability. This was a file system that would surely need to scale like no other. Of course, back in those earliest days, no one could have possibly imagined just how much scalability would be required. They would learn about that soon enough.

Still, nearly a decade later, most of Google's mind-boggling store of data and its ever-growing array of applications continue to rely upon GFS. Many adjustments have been made to the file system along the way, and—together with a fair number of accommodations implemented within the applications that use GFS—they have made the journey possible.

To explore the reasoning behind a few of the more crucial initial design decisions as well as some of the incremental adaptations that have been made since then, ACM asked Sean Quinlan to pull back the covers on the changing file-system requirements and the evolving thinking at Google. Since Quinlan served as the GFS tech leader for a couple of years and continues now as a principal engineer at Google, he's in a good position to offer that perspective. As a grounding point beyond the Googleplex, ACM asked Kirk McKusick to lead the discussion. He is best known for his work on BSD (Berkeley Software Distribution) Unix, including the original design of the Berkeley FFS (Fast File System).

The discussion starts, appropriately enough, at the beginning—with the unorthodox decision to base the initial GFS implementation on a single-master design. At first blush, the risk of a single centralized master becoming a bandwidth bottleneck—or, worse, a single point of failure—seems fairly obvious, but it turns out Google's engineers had their reasons for making this choice.

MCKUSICK One of the more interesting—and significant—aspects of the original GFS architecture was the decision to base it on a single master. Can you walk us through what led to that decision?

QUINLAN The decision to go with a single master was actually one of the very first decisions, mostly just to simplify the overall design problem. That is, building a distributed master right from the outset was deemed too difficult and would take too much time. Also, by going with the single-master approach, the engineers were able to simplify a lot of problems. Having a central place to control replication and garbage collection and many other activities was definitely simpler than handling it all on a distributed basis. So the decision was made to centralize that in one machine.

MCKUSICK Was this mostly about being able to roll out something within a reasonably short time frame?

QUINLAN Yes. In fact, some of the engineers who were involved in that early effort later went on to build BigTable, a distributed storage system, but that effort took many years. The decision to build the original GFS around the single master really helped get something out into the hands of users much more rapidly than would have otherwise been possible.

Also, in sketching out the use cases they anticipated, it didn't seem the single-master design would cause much of a problem. The scale they were thinking about back then was framed in terms of hundreds of terabytes and a few million files. In fact, the system worked just fine to start with.

MCKUSICK But then what?

QUINLAN Problems started to occur once the size of the underlying storage increased. Going from a few hundred terabytes up to petabytes, and then up to tens of petabytes� that really required a proportionate increase in the amount of metadata the master had to maintain. Also, operations such as scanning the metadata to look for recoveries all scaled linearly with the volume of data. So the amount of work required of the master grew substantially. The amount of storage needed to retain all that information grew as well.

In addition, this proved to be a bottleneck for the clients, even though the clients issue few metadata operations themselves—for example, a client talks to the master whenever it does an open. When you have thousands of clients all talking to the master at the same time, given that the master is capable of doing only a few thousand operations a second, the average client isn't able to command all that many operations per second. Also bear in mind that there are applications such as MapReduce, where you might suddenly have a thousand tasks, each wanting to open a number of files. Obviously, it would take a long time to handle all those requests, and the master would be under a fair amount of duress.

MCKUSICK Now, under the current schema for GFS, you have one master per cell, right?

QUINLAN That's correct.

MCKUSICK And historically you've had one cell per data center, right?

QUINLAN That was initially the goal, but it didn't work out like that to a large extent—partly because of the limitations of the single-master design and partly because isolation proved to be difficult. As a consequence, people generally ended up with more than one cell per data center. We also ended up doing what we call a "multi-cell" approach, which basically made it possible to put multiple GFS masters on top of a pool of chunkservers. That way, the chunkservers could be configured to have, say, eight GFS masters assigned to them, and that would give you at least one pool of underlying storage—with multiple master heads on it, if you will. Then the application was responsible for partitioning data across those different cells.

MCKUSICK Presumably each application would then essentially have its own master that would be responsible for managing its own little file system. Was that basically the idea?

QUINLAN Well, yes and no. Applications would tend to use either one master or a small set of the masters. We also have something we called Name Spaces, which are just a very static way of partitioning a namespace that people can use to hide all of this from the actual application. The Logs Processing System offers an example of this approach: once logs exhaust their ability to use just one cell, they move to multiple GFS cells; a namespace file describes how the log data is partitioned across those different cells and basically serves to hide the exact partitioning from the application. But this is all fairly static.

MCKUSICK What's the performance like, in light of all that?

QUINLAN We ended up putting a fair amount of effort into tuning master performance, and it's atypical of Google to put a lot of work into tuning any one particular binary. Generally, our approach is just to get things working reasonably well and then turn our focus to scalability—which usually works well in that you can generally get your performance back by scaling things. Because in this instance we had a single bottleneck that was starting to have an impact on operations, however, we felt that investing a bit of additional effort into making the master lighter weight would be really worthwhile. In the course of scaling from thousands of operations to tens of thousands and beyond, the single master had become somewhat less of a bottleneck. That was a case where paying more attention to the efficiency of that one binary definitely helped keep GFS going for quite a bit longer than would have otherwise been possible.

It could be argued that managing to get GFS ready for production in record time constituted a victory in its own right and that, by speeding Google to market, this ultimately contributed mightily to the company's success. A team of three was responsible for all of that—for the core of GFS—and for the system being readied for deployment in less than a year.

But then came the price that so often befalls any successful system—that is, once the scale and use cases have had time to expand far beyond what anyone could have possibly imagined. In Google's case, those pressures proved to be particularly intense.

Although organizations don't make a habit of exchanging file-system statistics, it's safe to assume that GFS is the largest file system in operation (in fact, that was probably true even before Google's acquisition of YouTube). Hence, even though the original architects of GFS felt they had provided adequately for at least a couple of orders of magnitude of growth, Google quickly zoomed right past that.

In addition, the number of applications GFS was called upon to support soon ballooned. In an interview with one of the original GFS architects, Howard Gobioff (conducted just prior to his surprising death in early 2008), he recalled, "The original consumer of all our earliest GFS versions was basically this tremendously large crawling and indexing system. The second wave came when our quality team and research groups started using GFS rather aggressively—and basically, they were all looking to use GFS to store large data sets. And then, before long, we had 50 users, all of whom required a little support from time to time so they'd all keep playing nicely with each other."

One thing that helped tremendously was that Google built not only the file system but also all of the applications running on top of it. While adjustments were continually made in GFS to make it more accommodating to all the new use cases, the applications themselves were also developed with the various strengths and weaknesses of GFS in mind. "Because we built everything, we were free to cheat whenever we wanted to," Gobioff neatly summarized. "We could push problems back and forth between the application space and the file-system space, and then work out accommodations between the two."

The matter of sheer scale, however, called for some more substantial adjustments. One coping strategy had to do with the use of multiple "cells" across the network, functioning essentially as related but distinct file systems. Besides helping to deal with the immediate problem of scale, this proved to be a more efficient arrangement for the operations of widely dispersed data centers.

Rapid growth also put pressure on another key parameter of the original GFS design: the choice to establish 64 MB as the standard chunk size. That, of course, was much larger than the typical file-system block size, but only because the files generated by Google's crawling and indexing system were unusually large. As the application mix changed over time, however, ways had to be found to let the system deal efficiently with large numbers of files requiring far less than 64 MB (think in terms of Gmail, for example). The problem was not so much with the number of files itself, but rather with the memory demands all of those files made on the centralized master, thus exposing one of the bottleneck risks inherent in the original GFS design.

MCKUSICK I gather from the original GFS paper [Ghemawat, S., Gobioff, H., Leung, S-T. 2003. The Google File System. SOSP (ACM Symposium on Operating Systems Principles)] that file counts have been a significant issue for you right along. Can you go into that a little bit?

QUINLAN The file-count issue came up fairly early because of the way people ended up designing their systems around GFS. Let me cite a specific example. Early in my time at Google, I was involved in the design of the Logs Processing system. We initially had a model where a front-end server would write a log, which we would then basically copy into GFS for processing and archival. That was fine to start with, but then the number of front-end servers increased, each rolling logs every day. At the same time, the number of log types was going up, and then you'd have front-end servers that would go through crash loops and generate lots more logs. So we ended up with a lot more files than we had anticipated based on our initial back-of-the-envelope estimates.

This became an area we really had to keep an eye on. Finally, we just had to concede there was no way we were going to survive a continuation of the sort of file-count growth we had been experiencing.

MCKUSICK Let me make sure I'm following this correctly: your issue with file-count growth is a result of your needing to have a piece of metadata on the master for each file, and that metadata has to fit in the master's memory.

MCKUSICK And there are only a finite number of files you can accommodate before the master runs out of memory?

QUINLAN Exactly. And there are two bits of metadata. One identifies the file, and the other points out the chunks that back that file. If you had a chunk that contained only 1 MB, it would take up only 1 MB of disk space, but it still would require those two bits of metadata on the master. If your average file size ends up dipping below 64 MB, the ratio of the number of objects on your master to what you have in storage starts to go down. That's where you run into problems.

Going back to that logs example, it quickly became apparent that the natural mapping we had thought of—and which seemed to make perfect sense back when we were doing our back-of-the-envelope estimates—turned out not to be acceptable at all. We needed to find a way to work around this by figuring out how we could combine some number of underlying objects into larger files. In the case of the logs, that wasn't exactly rocket science, but it did require a lot of effort.

MCKUSICK That sounds like the old days when IBM had only a minimum disk allocation, so it provided you with a utility that let you pack a bunch of files together and then create a table of contents for that.

QUINLAN Exactly. For us, each application essentially ended up doing that to varying degrees. That proved to be less burdensome for some applications than for others. In the case of our logs, we hadn't really been planning to delete individual log files. It was more likely that we would end up rewriting the logs to anonymize them or do something else along those lines. That way, you don't get the garbage-collection problems that can come up if you delete only some of the files within a bundle.

For some other applications, however, the file-count problem was more acute. Many times, the most natural design for some application just wouldn't fit into GFS—even though at first glance you would think the file count would be perfectly acceptable, it would turn out to be a problem. When we started using more shared cells, we put quotas on both file counts and storage space. The limit that people have ended up running into most has been, by far, the file-count quota. In comparison, the underlying storage quota rarely proves to be a problem.

MCKUSICK What longer-term strategy have you come up with for dealing with the file-count issue? Certainly, it doesn't seem that a distributed master is really going to help with that—not if the master still has to keep all the metadata in memory, that is.

QUINLAN The distributed master certainly allows you to grow file counts, in line with the number of machines you're willing to throw at it. That certainly helps.

One of the appeals of the distributed multimaster model is that if you scale everything up by two orders of magnitude, then getting down to a 1-MB average file size is going to be a lot different from having a 64-MB average file size. If you end up going below 1 MB, then you're also going to run into other issues that you really need to be careful about. For example, if you end up having to read 10,000 10-KB files, you're going to be doing a lot more seeking than if you're just reading 100 1-MB files.

My gut feeling is that if you design for an average 1-MB file size, then that should provide for a much larger class of things than does a design that assumes a 64-MB average file size. Ideally, you would like to imagine a system that goes all the way down to much smaller file sizes, but 1 MB seems a reasonable compromise in our environment.

MCKUSICK What have you been doing to design GFS to work with 1-MB files?

QUINLAN We haven't been doing anything with the existing GFS design. Our distributed master system that will provide for 1-MB files is essentially a whole new design. That way, we can aim for something on the order of 100 million files per master. You can also have hundreds of masters.

MCKUSICK So, essentially no single master would have all this data on it?

QUINLAN That's the idea.

With the recent emergence within Google of BigTable, a distributed storage system for managing structured data, one potential remedy for the file-count problem—albeit perhaps not the very best one—is now available.

The significance of BigTable goes far beyond file counts, however. Specifically, it was designed to scale into the petabyte range across hundreds or thousands of machines, as well as to make it easy to add more machines to the system and automatically start taking advantage of those resources without reconfiguration. For a company predicated on the notion of employing the collective power, potential redundancy, and economies of scale inherent in a massive deployment of commodity hardware, these rate as significant advantages indeed.

Accordingly, BigTable is now used in conjunction with a growing number of Google applications. Although it represents a departure of sorts from the past, it also must be said that BigTable was built on GFS, runs on GFS, and was consciously designed to remain consistent with most GFS principles. Consider it, therefore, as one of the major adaptations made along the way to help keep GFS viable in the face of rapid and widespread change.

MCKUSICK You now have this thing called BigTable. Do you view that as an application in its own right?

QUINLAN From the GFS point of view, it's an application, but it's clearly more of an infrastructure piece.

MCKUSICK If I understand this correctly, BigTable is essentially a lightweight relational database.

QUINLAN It's not really a relational database. I mean, we're not doing SQL and it doesn't really support joins and such. But BigTable is a structured storage system that lets you have lots of key-value pairs and a schema.

MCKUSICK Who are the real clients of BigTable?

QUINLAN BigTable is increasingly being used within Google for crawling and indexing systems, and we use it a lot within many of our client-facing applications. The truth of the matter is that there are tons of BigTable clients. Basically, any app with lots of small data items tends to use BigTable. That's especially true wherever there's fairly structured data.

MCKUSICK I guess the question I'm really trying to pose here is: Did BigTable just get stuck into a lot of these applications as an attempt to deal with the small-file problem, basically by taking a whole bunch of small things and then aggregating them together?

QUINLAN That has certainly been one use case for BigTable, but it was actually intended for a much more general sort of problem. If you're using BigTable in that way—that is, as a way of fighting the file-count problem where you might have otherwise used a file system to handle that—then you would not end up employing all of BigTable's functionality by any means. BigTable isn't really ideal for that purpose in that it requires resources for its own operations that are nontrivial. Also, it has a garbage-collection policy that's not super-aggressive, so that might not be the most efficient way to use your space. I'd say that the people who have been using BigTable purely to deal with the file-count problem probably haven't been terribly happy, but there's no question that it is one way for people to handle that problem.

MCKUSICK What I've read about GFS seems to suggest that the idea was to have only two basic data structures: logs and SSTables (Sorted String Tables). Since I'm guessing the SSTables must be used to handle key-value pairs and that sort of thing, how is that different from BigTable?

QUINLAN The main difference is that SSTables are immutable, while BigTable provides mutable key value storage, and a whole lot more. BigTable itself is actually built on top of logs and SSTables. Initially, it stores incoming data into transaction log files. Then it gets compacted —as we call it—into a series of SSTables, which in turn get compacted together over time. In some respects, it's reminiscent of a log-structure file system. Anyway, as you've observed, logs and SSTables do seem to be the two data structures underlying the way we structure most of our data. We have log files for mutable stuff as it's being recorded. Then, once you have enough of that, you sort it and put it into this structure that has an index.

Even though GFS does not provide a Posix interface, it still has a pretty generic file-system interface, so people are essentially free to write any sort of data they like. It's just that, over time, the majority of our users have ended up using these two data structures. We also have something called protocol buffers , which is our data description language. The majority of data ends up being protocol buffers in these two structures.

Both provide for compression and checksums. Even though there are some people internally who end up reinventing these things, most people are content just to use those two basic building blocks.

Because GFS was designed initially to enable a crawling and indexing system, throughput was everything. In fact, the original paper written about the system makes this quite explicit: "High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response-time requirements for an individual read and write."

But then Google either developed or embraced many user-facing Internet services for which this is most definitely not the case.

One GFS shortcoming that this immediately exposed had to do with the original single-master design. A single point of failure may not have been a disaster for batch-oriented applications, but it was certainly unacceptable for latency-sensitive applications, such as video serving. The later addition of automated failover capabilities helped, but even then service could be out for up to a minute.

The other major challenge for GFS, of course, has revolved around finding ways to build latency-sensitive applications on top of a file system designed around an entirely different set of priorities.

MCKUSICK It's well documented that the initial emphasis in designing GFS was on batch efficiency as opposed to low latency. Now that has come back to cause you trouble, particularly in terms of handling things such as videos. How are you handling that?

QUINLAN The GFS design model from the get-go was all about achieving throughput, not about the latency at which that might be achieved. To give you a concrete example, if you're writing a file, it will typically be written in triplicate—meaning you'll actually be writing to three chunkservers. Should one of those chunkservers die or hiccup for a long period of time, the GFS master will notice the problem and schedule what we call a pullchunk , which means it will basically replicate one of those chunks. That will get you back up to three copies, and then the system will pass control back to the client, which will continue writing.

When we do a pullchunk we limit it to something on the order of 5-10 MB a second. So, for 64 MB, you're talking about 10 seconds for this recovery to take place. There are lots of other things like this that might take 10 seconds to a minute, which works just fine for batch-type operations. If you're doing a large MapReduce operation, you're OK just so long as one of the items is not a real straggler, in which case you've got yourself a different sort of problem. Still, generally speaking, a hiccup on the order of a minute over the course of an hour-long batch job doesn't really show up. If you are working on Gmail, however, and you're trying to write a mutation that represents some user action, then getting stuck for a minute is really going to mess you up.

We've had similar issues with our master failover. Initially, GFS had no provision for automatic master failover. It was a manual process. Although it didn't happen a lot, whenever it did, the cell might be down for an hour. Even our initial master-failover implementation required on the order of minutes. Over the past year, however, we've taken that down to something on the order of tens of seconds.

MCKUSICK Still, for user-facing applications, that's not acceptable.

QUINLAN Right. While these instances—where you have to provide for failover and error recovery—may have been acceptable in the batch situation, they're definitely not OK from a latency point of view for a user-facing application. Another issue here is that there are places in the design where we've tried to optimize for throughput by dumping thousands of operations into a queue and then just processing through them. That leads to fine throughput, but it's not great for latency. You can easily get into situations where you might be stuck for seconds at a time in a queue just waiting to get to the head of the queue.

Our user base has definitely migrated from being a MapReduce-based world to more of an interactive world that relies on things such as BigTable. Gmail is an obvious example of that. Videos aren't quite as bad where GFS is concerned because you get to stream data, meaning you can buffer. Still, trying to build an interactive database on top of a file system that was designed from the start to support more batch-oriented operations has certainly proved to be a pain point.

MCKUSICK How exactly have you managed to deal with that?

QUINLAN Within GFS, we've managed to improve things to a certain degree, mostly by designing the applications to deal with the problems that come up. Take BigTable as a good concrete example. The BigTable transaction log is actually the biggest bottleneck for getting a transaction logged. In effect, we decided, "Well, we're going to see hiccups in these writes, so what we'll do is to have two logs open at any one time. Then we'll just basically merge the two. We'll write to one and if that gets stuck, we'll write to the other. We'll merge those logs once we do a replay—if we need to do a replay, that is." We tended to design our applications to function like that—which is to say they basically try to hide that latency since they know the system underneath isn't really all that great.

The guys who built Gmail went to a multihomed model, so if one instance of your Gmail account got stuck, you would basically just get moved to another data center. Actually, that capability was needed anyway just to ensure availability. Still, part of the motivation was that they wanted to hide the GFS problems.

MCKUSICK I think it's fair to say that, by moving to a distributed-master file system, you're definitely going to be able to attack some of those latency issues.

QUINLAN That was certainly one of our design goals. Also, BigTable itself is a very failure-aware system that tries to respond to failures far more rapidly than we were able to before. Using that as our metadata storage helps with some of those latency issues as well.

The engineers who worked on the earliest versions of GFS weren't particularly shy about departing from traditional choices in file-system design whenever they felt the need to do so. It just so happens that the approach taken to consistency is one of the aspects of the system where this is particularly evident.

Part of this, of course, was driven by necessity. Since Google's plans rested largely on massive deployments of commodity hardware, failures and hardware-related faults were a given. Beyond that, according to the original GFS paper, there were a few compatibility issues. "Many of our disks claimed to the Linux driver that they supported a range of IDE protocol versions but in fact responded reliably only to the more recent ones. Since the protocol versions are very similar, these drives mostly worked but occasionally the mismatches would cause the drive and the kernel to disagree about the drive's state. This would corrupt data silently due to problems in the kernel. This problem motivated our use of checksums to detect data corruption."

That didn't mean just any checksumming, however, but instead rigorous end-to-end checksumming, with an eye to everything from disk corruption to TCP/IP corruption to machine backplane corruption.

Interestingly, for all that checksumming vigilance, the GFS engineering team also opted for an approach to consistency that's relatively loose by file-system standards. Basically, GFS simply accepts that there will be times when people will end up reading slightly stale data. Since GFS is used mostly as an append-only system as opposed to an overwriting system, this generally means those people might end up missing something that was appended to the end of the file after they'd already opened it. To the GFS designers, this seemed an acceptable cost (although it turns out that there are applications for which this proves problematic).

Also, as Gobioff explained, "The risk of stale data in certain circumstances is just inherent to a highly distributed architecture that doesn't ask the master to maintain all that much information. We definitely could have made things a lot tighter if we were willing to dump a lot more data into the master and then have it maintain more state. But that just really wasn't all that critical to us."

Perhaps an even more important issue here is that the engineers making this decision owned not just the file system but also the applications intended to run on the file system. According to Gobioff, "The thing is that we controlled both the horizontal and the vertical—the file system and the application. So we could be sure our applications would know what to expect from the file system. And we just decided to push some of the complexity out to the applications to let them deal with it."

Still, there are some at Google who wonder whether that was the right call if only because people can sometimes obtain different data in the course of reading a given file multiple times, which tends to be so strongly at odds with their whole notion of how data storage is supposed to work.

MCKUSICK Let's talk about consistency. The issue seems to be that it presumably takes some amount of time to get everything fully written to all the replicas. I think you said something earlier to the effect that GFS essentially requires that this all be fully written before you can continue.

MCKUSICK If that's the case, then how can you possibly end up with things that aren't consistent?

QUINLAN Client failures have a way of fouling things up. Basically, the model in GFS is that the client just continues to push the write until it succeeds. If the client ends up crashing in the middle of an operation, things are left in a bit of an indeterminate state.

Early on, that was sort of considered to be OK, but over time, we tightened the window for how long that inconsistency could be tolerated, and then we slowly continued to reduce that. Otherwise, whenever the data is in that inconsistent state, you may get different lengths for the file. That can lead to some confusion. We had to have some backdoor interfaces for checking the consistency of the file data in those instances. We also have something called RecordAppend, which is an interface designed for multiple writers to append to a log concurrently. There the consistency was designed to be very loose. In retrospect, that turned out to be a lot more painful than anyone expected.

MCKUSICK What exactly was loose? If the primary replica picks what the offset is for each write and then makes sure that actually occurs, I don't see where the inconsistencies are going to come up.

QUINLAN What happens is that the primary will try. It will pick an offset, it will do the writes, but then one of them won't actually get written. Then the primary might change, at which point it can pick a different offset. RecordAppend does not offer any replay protection either. You could end up getting the data multiple times in the file.

There were even situations where you could get the data in a different order. It might appear multiple times in one chunk replica, but not necessarily in all of them. If you were reading the file, you could discover the data in different ways at different times. At the record level, you could discover the records in different orders depending on which chunks you happened to be reading.

MCKUSICK Was this done by design?

QUINLAN At the time, it must have seemed like a good idea, but in retrospect I think the consensus is that it proved to be more painful than it was worth. It just doesn't meet the expectations people have of a file system, so they end up getting surprised. Then they had to figure out work-arounds.

MCKUSICK In retrospect, how would you handle this differently?

QUINLAN I think it makes more sense to have a single writer per file.

MCKUSICK All right, but what happens when you have multiple people wanting to append to a log?

QUINLAN You serialize the writes through a single process that can ensure the replicas are consistent.

MCKUSICK There's also this business where you essentially snapshot a chunk. Presumably, that's something you use when you're essentially replacing a replica, or whenever some chunkserver goes down and you need to replace some of its files.

QUINLAN Actually, two things are going on there. One, as you suggest, is the recovery mechanism, which definitely involves copying around replicas of the file. The way that works in GFS is that we basically revoke the lock so that the client can't write it anymore, and this is part of that latency issue we were talking about.

There's also a separate issue, which is to support the snapshot feature of GFS. GFS has the most general-purpose snapshot capability you can imagine. You could snapshot any directory somewhere, and then both copies would be entirely equivalent. They would share the unchanged data. You could change either one and you could further snapshot either one. So it was really more of a clone than what most people think of as a snapshot. It's an interesting thing, but it makes for difficulties—especially as you try to build more distributed systems and you want potentially to snapshot larger chunks of the file tree.

I also think it's interesting that the snapshot feature hasn't been used more since it's actually a very powerful feature. That is, from a file-system point of view, it really offers a pretty nice piece of functionality. But putting snapshots into file systems, as I'm sure you know, is a real pain.

MCKUSICK : I know. I've done it. It's excruciating—especially in an overwriting file system.

QUINLAN Exactly. This is a case where we didn't cheat, but from an implementation perspective, it's hard to create true snapshots. Still, it seems that in this case, going the full deal was the right decision. Just the same, it's an interesting contrast to some of the other decisions that were made early on in terms of the semantics.

All in all, the report card on GFS nearly 10 years later seems positive. There have been problems and shortcomings, to be sure, but there's surely no arguing with Google's success and GFS has without a doubt played an important role in that. What's more, its staying power has been nothing short of remarkable given that Google's operations have scaled orders of magnitude beyond anything the system had been designed to handle, while the application mix Google currently supports is not one that anyone could have possibly imagined back in the late '90s.

Still, there's no question that GFS faces many challenges now. For one thing, the awkwardness of supporting an ever-growing fleet of user-facing, latency-sensitive applications on top of a system initially designed for batch-system throughput is something that's obvious to all.

The advent of BigTable has helped somewhat in this regard. As it turns out, however, BigTable isn't actually all that great a fit for GFS. In fact, it just makes the bottleneck limitations of the system's single-master design more apparent than would otherwise be the case.

For these and other reasons, engineers at Google have been working for much of the past two years on a new distributed master system designed to take full advantage of BigTable to attack some of those problems that have proved particularly difficult for GFS.

Accordingly, it now seems that beyond all the adjustments made to ensure the continued survival of GFS, the newest branch on the evolutionary tree will continue to grow in significance over the years to come. Q

LOVE IT, HATE IT? LET US KNOW

[email protected]

Minnesota Stands Out for Its Moderately Progressive Tax Code

August 6, 2024

Carl Davis Research Director

Most state tax systems fall short of the public’s perception of fairness by charging the rich lower tax rates than everyone else. Minnesota is among a small group of states that has chosen a different path. In Who Pays? , our comprehensive study of state and local taxes, Minnesota stands apart from the pack with a moderately progressive tax system that asks slightly more of the rich than of low- and middle-income families.

Recent reforms signed by Gov. Tim Walz, the Democratic Party’s presumptive Vice-Presidential nominee, have contributed to this reality. Our analysis shows that taxes on working-class families declined markedly over the last few years in Minnesota, while taxes on high-income people went up slightly over this same period.

The most notable changes were signed into law by Gov. Walz in 2023 as part of a sweeping tax reform package. Some changes were temporary, like taxpayer rebate checks and expanded property tax credits. But the bill also included a host of important, permanent reforms.

Chief among those was a new Child Tax Credit that is expected to slash child poverty in Minnesota by one-third, according to Columbia University’s Center on Poverty and Social Policy. The link between Child Tax Credits and child wellbeing is well established, as the financial security afforded by these credits is associated with improved child and maternal health, better educational achievement, and stronger future economic outcomes.

Other tax cuts signed by Gov. Walz include expanded exemptions for Social Security income and for student loan forgiveness, plus an extension of the Child Care Tax Credit to newborn children.

To help pay for these and other substantial tax cuts, the 2023 bill included a variety of well-targeted tax increases on high-income people and profitable corporations. Certain tax deductions claimed by high-income filers have been scaled back. Capital gains, dividends, and other investment income over $1 million per year is now subject to a modest 1 percent surtax. And multinational corporations reporting income overseas now face higher taxes as well, as the state opted to piggyback on a law written by Congressional Republicans targeting companies’ “low-taxed income.”

While the Minnesota tax code is somewhat progressive, it is far from radical. The state has embraced practical, administrable reforms that have lowered taxes for working-class families, reduced child poverty, and addressed the public’s frustrations with the tax treatment of multinational companies and wealthy people. At the end of the day, Minnesota does better than most states in living up to what most people would consider to be a bare minimum standard of tax fairness: the idea that wealthy people should not pay lower tax rates than everyone else.

Full Archive

All Blog Posts

Google File System

Components of GFS

Features of GFS

Advantages of GFS

Disadvantages of GFS

Please Login to comment...

Improve your Coding Skills with Practice

What kind of Experience do you want to share?

Samuel Sorial's Blog

The Google File System - Case Study

Table of contents

Architecture

System Interactions

Master Operations

The Polymathic Engineer

Google File System

Introduction

GFS Design Goals

GFS Architecture

This post is for paid subscribers

Stop Thinking, Just Do!

Sung-Soo Kim's Blog

29 April 2014

GFS requirements

GFS interface

GFS architecture

Managing consistency in GFS

CS 736 Reviews - Spring 2016

The Google File System

Post a comment

Research Journal of Information Technology

Research Article

How to cite this article

ACKNOWLEDGMENT

Yermia Gem Reply

Leave a Comment

Review for Paper: 7-The Google File System

Pardon Our Interruption

NFS and Google File System Case Study

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

GFS: Evolution on Fast-forward

Minnesota Stands Out for Its Moderately Progressive Tax Code

Carl Davis Research Director

Full Archive

Related Content

COMMENTS