Google File System team introduced their product in this paper. GFS is designed to meet Google’s enormous data processing need but now it’s used broadly for different purposes. It’s necessary because traditional distributed file systems suffer drawbacks like can’t deal with component failure. This paper provides a thorough overview of the GFS system design and performance. First, it introduces the architecture and model design of the system. Then, they show how the system interact with the client, master and chunk servers’ operations. Next section is the fault tolerance performance analysis and some bottlenecks in GFS architecture and implementation. At last the team shared their experience while developing GFS, the problem they faced and how they deal with that. Some of the strengths of this paper are: 1. GFS has high availability. Data is still available even if some of the node fails in the file system. In other words, component failures are the norm rather than exception 2. By running multiple nodes in parallel, GFS delivers high aggregate throughput to many concurrent readers and writers’ actions. 3. GFS storage is reliable. When data corrupted, it can be detected and recovered. 4. Earlier GFS has a workload bottleneck in Master. Current GFS has solved the problem by changing master data structures to allow efficient binary searches. Some of the drawbacks of this paper are: 1. For files with a small size less than 100MB, GFS is not optimized. 2. GFS can’t handle random write or modified previous files efficiently because of the appending mechanism it used. 3. GFS optimize a high data processing rate but not optimizing the time performance for a single read or write. |
The paper presented the design overview of Google File System, explained its mechanism to support large distributed data-intensive applications and reported measurements from both micro-benchmarks and real world use. As the demands of data processing needs keep growing rapidly, distributed file systems require better performance, scalability, reliability, and availability. Besides, the observations of application workloads and technological environment have changed into more common component failure, large file, more appending than overwriting, and increased flexibility by co-designing. The paper summarized the redesigned model as follows: 1.Design Overview The whole design is based on assumptions: 1) Components often fail. 2)Files stored are large. 3) Streaming reads and sequential writes outnumber random reads and random writes. 4) Concurrent appending and high bandwidth are important. The architecture of GFS is a single master with multiple chunkservers, which is accessed by multiple clients. Master maintains all the metadata and system-wide activities while chunkservers store file chunks with replicas on other chunkservers. Clients cache metadata but neither client nor chunkserver caches file data. Metadata contains the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas, the first two of which are also kept in an operation log. The master can recover its file system by replaying the operation log. GFS has a relaxed consistency model, which guarantees atomicity, correctness, definedness, fast recovery and no data corruption, and this model can be accommodated by GFS applications. 2.System Interaction The system is designed to minimize the master’s involvement in all operations. 1)Leases are used to maintain a consistent mutation order across replicas in order to minimized the overhead at master. 2)The flow of data is decoupled from the flow of control to use the network efficiently. While control flows from the client to the primary and then to all secondaries, data is pushed linearly along a carefully picked chain of chunkservers in a pipelined fashion. 3)GFS provides an atomic append operation called record append. 4)GFS uses standard copy-on-write techniques to implement snapshots. 3.Master Operation 1)GFS allows multiple operations to be active and use locks over regions of the namespace to ensure proper serialization. 2)GFS manages chunk replicas throughout the system by spreading chunks across machines and racks. 3)GFS makes placement decisions, creates new chunks and hence replicas, and coordinates various system-wide activities to keep chunks fully replicated, to balance load across all the chunkservers. 4)GFS offers lazy garbage collection to reclaim storage for simplicity and reliability, as well as merging storage reclamation into the regular background activities of the master, and safety net against accidental, irreversible deletion. 5)GFS maintains version number to detect stale replica. 4.High Availability GFS is highly available by fast recovery and replication. It achieves data integrity by checksum and uses extensive and detailed diagnostic logging to help problem isolation, debugging, and performance analysis. 5.Measurement This paper presented micro-benchmarks to illustrate the bottlenecks inherent in the GFS architecture and implementation, and also some numbers from real clusters in use at Google. The paper provided a distributed file system with high availability, high throughput and reliable storage, which pioneered in industrial at that time. Instead of giving a high level idea of how to design this file system, the paper gave a detailed description and reason of this design, which makes reader clear about both the design and the reason why this design should be optimal. However, this design also has following problems: 1. This design wastes storage for small sized files. 2. A single-master structure restricts the scalability of the system. 3. This design is not suitable for a large number of random read/write operations. |
Problem & Motivations: The engineers want to design a scalable distributed file system for large distributed data-intensive applications on inexpensive commodity hardware which suits for google workloads. The system shares 4 distinctions with the tradition. 1. Enable fault tolerance. 2. Handle files with a large file size. 3. Different files operation patterns (append, large stream reading). 4. The flexibility of API. The authors propose the Google File System which can suit the requirements. Contributions: It proposes the Google File System. The GFS contains many useful talent design details like the file system structure with the chunk. However, the most important contribution is that it views the faults as common rather than exceptions. And by introducing replicas, it successfully built a system which relied on the inexpensive commodity hardware. Drawback: Contains too many details and yet lack a sense of the whole. If there is an example that can guide as started from the application request to how GFS track the data and send it back (step by step and detailly). It will be excellent! Problem & Motivations: The engineers want to design a scalable distributed file system for large distributed data-intensive applications on inexpensive commodity hardware which suits for google workloads. The system shares 4 distinctions with the tradition. 1. Enable fault tolerance. 2. Handle files with a large file size. 3. Different files operation patterns (append, large stream reading). 4. The flexibility of API. The authors propose the Google File System which can suit the requirements. Contributions: It proposes the Google File System. The GFS contains many useful talent design details like the file system structure with the chunk. However, the most important contribution is that it views the faults as common rather than exceptions. And by introducing replicas, it successfully built a system which relied on the inexpensive commodity hardware. Drawback: Contains too many details and yet lack a sense of the whole. If there is an example that can guide as started from the application request to how GFS track the data and send it back (step by step and detailly). It will be excellent! |
This paper details the Google file system developed in house to be a “scalable distributed file system for large distributed data-intensive applications.” The motivation for developing this system were the distinctive workloads faced by Google, such as a large amount of sequential reads of very large files for data analysis, as well as other factors unique to the internal Google ecosystem. With that in mind, the Google file system aims to provide performance, scalability, reliability, and availability, much like the typical file systems being used. This paper starts by detailing the main assumptions behind the design of this system. First, components are assumed to have relatively high failure rates (since they are composed of large numbers of inexpensive commodity items). Second, the file sizes that the system is expected to work with are orders of magnitudes larger than traditional file sizes. Third, appending to the end of the file rather than random writes is the norm. Finally, designing the applications and file system API together increases flexibility. The paper continues by describing the key implementation details of the Google file system. Each system is comprised of a single master and multiple chunkservers, which are potentially accessed by multiple clients. Clients are directed to a particular chunkserver by the master, which is also responsible for maintaining the system metadata. The chunk size was chosen to be 64 MB, much larger than that on typical systems, which brings the advantage of less client interaction with the master, among other advantages. The paper continues the discussion in a similar level of detail about how chunk locations are stored, replication (since it is a distributed system by design), fault tolerance and recovery methods, and ways to ensure data integrity. It then follows with experimental results based on benchmarks run on the Google file system, which are compared with performance data from real clusters in use at the time. The main strengths of this paper are that it introduces a system that is well specialized to the unique use case for Google. By identifying the typical workloads, the system can be well tailored to their needs. Also, the way that they essentially assumed that a system could fail at any time, by treating normal and abnormal terminations as the same, and taking steps to verify everything, helps in ensuring high reliability and data availability. In general, the paper was well written and easy to follow. The primary weakness probably ties hand in hand with its greatest strength, in that it is greatly specialized to a particular type of use case (e.g. mostly file appends rather than random writes). This undoubtedly brings greater gains than a more generalized system, but there is always a risk that usage patterns may change in the future (though that probably will not be the case in this age of big data). Beyond that, the presence of just one master that is responsible for managing many chunkservers presents a single point of failure, as well as a potential bottleneck that might make it more difficult to scale in the future. |
This paper’s purpose is to outline the Google file system (GFS) which boasts its scalability and reliability in serving many clients and offering distributed services. It mentions the challenge of handling very large, ever changing files. This is important because of how widespread Google’s database is in the average person’s daily life, and it is quite robust. The paper then describes some of the transactions that the file system interface supports. We get to see the architecture of the database as a single master and many chunkservers, which hold all of the data in the database in fragments. We look into chunk size and how it affects the performance of queries since larger chunk sizes will reduce the need for interaction with the master since all data can be found on one chunkserver. GFS deals with consistency issues by having multiple backups of each of the chunkservers. That way data is not lost unless all backups of that data chunk all fail before the master has performed its sort of heartbeat “handshake” with the chunkservers. We also looked into mutations, which were transactions that modified data, and how they were backed up to replica chunkservers whenever changed. We also looked into how GFS deletes files by renaming them and garbage collecting after the file has sat with the new name for 3 days. Data integrity was also covered as we have previously seen, using checksumming handshakes to detect corrupted data. I liked how thorough this paper was, this was one of the few times that assumptions were explicitly outlined, before going into the main contribution of the paper. We got to see what they assumed in the resources, performance, and challenges that the file system will face, like recovering data from one of the cheap distributed computers, or dealing with a lot of rapidly changing files. The paper was very organized as well, I felt, everything flowed well and was very digestible. I did not like how dense figure 1 was. I thought figure 2 was great, very digestible. However figure 1 was a bit hard for me to understand because of how dense the information was in it, even though they covered it in the text of that section. Perhaps if this visual was broken up into multiple sub graphs and explained in a bit more detail incrementally, I would have received this information better. |
GFS is a file system created by Google to fit needs. It leverages largely on distributed system ideas and the whole file system is running on multiple servers so that it costs less. Some difficulties and assumptions are 1)single component failure is very common considering large volume of servers. 2)files are huge by traditional standards. 3)there are two kinds of read: large streaming reads and small random reads 4)adopts data appending instead of overwriting. 5)handles simultaneously visits. The basic structure of GFS is master-chunkservers architecture accessing by clients. It is a relaxed consistency model with easy implementation but realized distributed properties. The master maintains metadata of files and communicating with the chunk server using HeartBeat messages basically. The master has three major types of metadata 1)the file and chink namespaces 2)the mapping from files to chinks 3) the locations of each chunks' replicas. The master is designed with the aim of minimizing involvement in reads and writes to avoiding becoming a bottleneck, The chunk size is relatively large with easy space allocation. And The files are divided into fixed-sized chunks as storing. GFS uses lease mechanism. It is designed to minimize the masters involvement in operations. GFS also supports pushing data using each machine's network bandwidth which prevent network bottlenecks and latency. This is achieved by pushing linearly along the chain of chunk servers instead of distributed in some other ways. It also minimizes latency. By pipelining the data transfer over TCP connections. GFS provides record appends which is an atomic operation and supports many clients on different machines append to the same file concurrently. GFS also use smart 'copy-on-write' to implement snapshots. GFS's master server serves many tasks: 1)executes all namespace operations 2)makes placement decisions 3) create new chunks and replicas 4)load balancing and reclaim unused storage 5)garbage collection(not deleting immediately, simpler and reliable but sometimes constrain things when storage is tight) 6)stale replica detection. GFS achieves high availability by fast recovery and (chunk, master) replication. Mention that there is a shadow master providing read-only access to file system in case the primary master is down. Also each chinkserver using checksumming to detect corruption of stored data and it will not that impact the performance. Also the diagnostic log is quite useful but have small impact to the performance. The contribution of this paper is that it uses commodity hardware to realize large scale data processing workload which is really amazing. And many times the design ideas bring easy implementation but good performance. It meet the needs for scalability, stability, concurrency, integration for real situation. I think one of the flaw of the GFS is on its master which may raise a bottleneck for system. My idea is to build a small group of masters(3-4) with fully connected manner to stabilize the master's work and ensures its performance. |
The paper presents the Google File System (GFS), a novel distributed file system. The system is built on top of commodity hardware, and takes the approach of assuming that components *will fail* and that the system should be resilient. GFS uses a single-master approach, and stores each “chunk” of data on three datanodes. A novel method, “record append,” is used to write data in an atomic fashion. Through a clever set of techniques involving caching data on client machines and circumventing larger network hops, the system is able to be quite efficient despite having many machines and only one master. Overall, this paper presents a system that gives up storage space and latency in order to achieve durability and scalability. The true strength of this paper in my opinion was that it covered almost all of the bases when it came to potential flaws. I constantly found myself marking that I thought something was a potential problem, only to read half a page later that the authors had a clean solution for that problem. One of the first things that I was concerned about was that the master seemed like a single point of failure; then on page 3, I read that the operation log was used to prevent problems in the case of master crashes, and on page 9, I read that the master’s state is indeed replicated. Another similar concern I had was that it might be possible or likely to have problems with multiple replicas, especially with the clear statement that machines are expected to fail. However, the decision to spread replicas of each chunk over multiple racks helps with at least some issues that could take down the data in multiple places. The way that the paper was written gave me confidence that the authors had built a reliable system. The empirical data provided was also useful in this regard. While most of my concerns were not ignored in the paper, I did still have some reservations about the system when I finished reading: 1. The system is highly inefficient when it comes to the number of machines used because of the (configurable) 3x replication factor. The authors briefly discussed using parity or erasure codes, and I’d like to know if this was ever implemented. If not, this might not be a great system for companies looking to save on infrastructure costs, although it does allow for some automation that may save money in the long run. Machines are cheap. 2. The system is clearly built for specific workloads at google, but might not fit workloads at other companies as well. For instance, there is a clear emphasis put on overall bandwidth as opposed to latency seen by one client - other applications might not have the same priorities. 3. Simply put, this system sounds like a pain to set up. Google probably has some great automation tools, but I have managed HDFS servers on AWS and it wasn’t fun. My concerns aren’t a suggestion not to use GFS - they are merely a list of reasons why it shouldn’t be considered as the only option. |
This paper purposed Google File System, which is a scalable distributed file system for large distributed data-intensive applications. The motivation for this system is that Google has observed a marked departure from original file system design assumptions. More specifically, Google found that component failures are common, files are huge, and most files are mutated by appending new data. A GFS cluster is designed as follows: it consists of a single master and multiple chunkservers. Files are divided into fixed-size chunks and identified by a globally unique 64bit chunk handle. Each chunk is stored and replicated multiple times (by default 3) on multiple chunkservers. When the client needs to read or write data, it first communicates with the master to get metadata such as chunk handle and chunk locations. The Master is responsible for storing metadata such as file and chunk namespaces, mapping from files to chunks and the locations of each chunk’s replicas. Some of the information should also be kept persistent so that when the master restarts these states can be retrieved. There’s also a shadow master which can serve read operations when the primary master is down. Besides the high-level architecture, the paper also explains lots of details, for example, the consistency model, lease, mutation order, locking scheme, garbage collection, etc. Together, they form a complete description of the Google file system. The main goal of GFS is to build a file system that meets the needs of Google’s workload. The paper presented measurements from their research & development cluster as well as production data processing cluster and it seems like GFS works extremely well on these workloads. I think besides the specific design of GFS, another lesson we should learn from this paper is that commodity hardware is capable of supporting large-scale data processing workload under the right design decisions and workload assumptions. |
In this paper, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung discuss the implementation approach and considerations taken when creating the google file system (GFS). Specifically, this paper discusses the design overview, system interactions, master operations, fault tolerance and diagnosis, and measurements - both on the small data benchmarks and industry scale. When developing GFS, Ghemawat's design was driven by Google's workloads and technological environment, both current and anticipated. Thus, some traditional choices were abandoned in order to suit their needs. As a result, they developed a scalable distributed file system that runs on inexpensive commodity hardware and serves high aggregate performance to a large number of clients. In one instance, their largest cluster involves hundreds of terabytes of data, across thousands of disks, concurrently accessed by hundreds of clients. Scaling out rather than scaling up enables Google to meet their research and development needs at a cheaper price. This decision to make this public knowledge is important as it allows any smaller company to gain insight and more room for growth. Ghemawat placed many core assumptions that allowed him to carefully choose his architecture. These assumptions include using commodity hardware that fail frequently, a system storage of multi-GB files, workloads with large stream reads and small random reads, workloads with large sequential writes that append to files, and a prioritization on high sustained bandwidth rather than low latency. The architecture chosen consists of a single master and multiple chunksevers that are accessed by clients. The master mediates interactions between these chunks and oversees metadata storage. The interaction between the chunks and master are minimized in order to reduce the overhead for the master - the master will only respond with the chunk handle and chunk location. Thus, applications use the appropriate chunkservers to extract their data. It addition, the master creates and manages chunk replicas to balance loads across the chunkservers. This enables fast recovery of data, data integrity, and an easy diagnosis on problems (since machines are not to be trusted). Even though there are many great technical contributions, there are just as many drawbacks as well. The first and most obvious drawback is a security flaw in their file system. The GFS is located in the user space, which in practice, is not a good thing to do. Using the kernel space is much more appropriate due to the fact that the Linux operating system has a better chance of staving off unwanted viruses and protecting against attacks. Furthermore, having a single master might create a bottleneck in the system if they have constant random reads or writes to data. I would have appreciated it if they included graphs detailing the impact on performance they might have, if it was done constantly. |
This paper outlines the implementation of the Google File System. GFS is built with large workloads in mind; it uses several servers, ande is optimized for multiple clients and reading and writing from large files. Files are broken up into fixed size chunks, which are replicated and stored on a large amount of chunkservers. Clients contact a single master server, which redirects them to the appropriate chunkserver. As little work as possible is done on the master server, so that it doesn’t get overloaded by client requests. The chunkservers communicate with each other to synchronize changes, although each chunk replica need not be in exactly the same state after a change. Because replicas don’t need to be the same, they must keep track of their data integrity individually. Replicas of chunks are created whenever needed, or whenever the number of replicas falls below the needed threshold. Copies of files can be made from a user perspective with little work; by using copy-on-write, the actual copying can be saved until changes are made to either copy. GFS is optimized for large data loads, and is built with all of its components being replaceable without having a significant negative effect on the system. This allows it to scale up to much larger sizes much more easily than other storage systems. Being built on fault tolerance also helps the system to stay consistent. The general usage is mostly similar to other file systems as well, which should help portability. The focus on exclusively large datasets does limit its usage, however. The bulk of optimization is focused on serial as opposed to random access, which makes some applications less useful if they don’t read much data serially. |
The Google File System is developed to meet the industry demand with scenarios like more machine involved and more clients participated. It’s crucial as it successfully solve the problem by considering the fault tolerance on multiple inexpensive machines and optimizing aggregation performance to clients. The GFS is different from traditional file systems for these assumptions. It considers the fault tolerance and recovery on a routine basis. It emphasis on append operation. The chunk size should be reconsidered to both efficiently manage big files and also support small files. The workloads consists of large streaming reads, small random reads and large, sequential writes that append data to files. The system must efficiently support multiple concurrent writes to the same files and high sustained bandwidth is more important than low latency. The architecture of a GFS cluster consists of one master and multiple chunkservers accessed by multiple clients. The files are divided into several chunks delivered to and stored in different chunkservers and are accessed by unique chunk-handle. Given reliability, chunks will be delivered to multiple chunkservers.The Master manages metadata which is mainly about the information of chunks instead of file data. The master and clients only communicate metadata, and the read and write is processed in different chunkservers. Clients only need to backup metadata. One master, big chunk, metadata and concurrency control are crucial characteristics of GFS in my opinion. One master is beneficial to simplify operation and global scheduling. Big chunk will decrease the numbers of chunks which are beneficial to decrease interaction between master and client, relieve metadata storage pressure and decrease cost of transfers of clients. The utilization enables the master to focus on managing chunks instead of heavy work of reading and writing. The concurrency control mainly solves the problems in scenarios where multiple clients concurrently write to the same file. The GFS implement record append to realize this multiple consumers and single producer process and save synchronization cost compared with tradition file system. One thing to mention is the backup strategy of GFS. The data is stored in chunkservers and matedata is stored in master. We should always consider fault tolerance and recovery. The data will be backup in different chunkservers on different racks and metadata will be backup in different chunkservers. There is no doubt that GFS achieved great success. But maybe there are some drawbacks. The GFS make an assumption of more big file than small files, so it uses bigger chunk size, which shows that too many small files(if any) will greatly decrease the efficiency of GFS. Another thing to mention is that clients will backup metadata. Assumed there are frequent crash in chunkservers, the metadata backup in clients will also be frequently invalid. |
This paper introduces how Google File System delivers high aggregate performance to a large number of clients with inexpensive hardwares. GFS is essentially distributed file system that shares similar goals like performance, scalability, reliability, and availability with other distributed systems. What makes GFS different is it is application specific. Author provides a design overview before going into technical details, which is very helpful for readers to understand the whole picture of the distributed system. In the overview section, author provides assumptions that are specific to the application, high level explanation of interface and architecture, how GFS keeps its metadata, and how it deals with consistency. The following sections goes into details of each aspect mentioned in the overview section. Essentially, GFS has a single master that maintains all file system metadata. Files are divided into fixed-size 64 MB chunks, and the chunks have multiple replicas spread across racks for recovery. Clients interact with master to get the chunk location information and then send requests to the closest replica for chunk data. Master controls all chunk placement and monitors chunkserver status with HeartBeat message. There is also a operational log that contains a historical record of metadata changes. The paper explains why this structure is adopted for the application, why a certain size is chosen, why replicas are distributed in a certain way, how these key components interact to keep records consistent, what special features GFS has, how to detect stale replica, and many more aspects. The last section analyzes read, write, and append time, recovery time, and workload of real world clusters. This paper is very fluent in structure and provides reasoning for almost all design decisions. It also points out the potential problem of the system like hot spots and provides possible solutions. One thing that may help improve the paper is to provide summary of unique settings of Google. The paper mentions what types of service clients use the most here and there. If there is a summary of the specific application, it would be very helpful. |
“The Google File System” by Sanjay Ghemawat et al. describes a new file system approach created at Google that aims to support many clients with large reads and writes on an architecture built from (many) commodity hardware machines. Since there is high likelihood that some of the (many) commodity machines will fail in a given time range, GFS must utilize a few approaches to ensure data is not lost and that the system does not go offline: data replication across multiple chunkservers, replicated master metadata, checksumming for confirming data integrity, and fast recovery. To support high aggregate throughput, GFS uses a master-chunkserver-client architecture where clients communicate read/write requests (but not data) to the master in order to get routed to an appropriate chunkserver. The master is not a bottleneck, as the master is not performing the read/write operations; it is only telling clients which chunkserver they should read/write data to/from. The authors position the paper as questioning traditional file system standards and considering whether assumptions in prior research apply to their use case at Google: commodity hardware clusters running large-scale data processing workloads. In addition to their novel architecture and boldness in questioning the standard, I really appreciate that the paper outlines the assumptions (of scenarios to support and not support) that they made when designing their system. This makes it clearer for the reader, to understand where and how the approach could generalize, and how the work compares to related work. It is clear that the authors were actively aware of the assumptions during their research process, and actively considering what their research contributions were. It is also nice that the authors evaluate their approach on real world systems at Google, systems that would be expected to handle large and data-intensive workloads. The file system architecture as a result appears realistic, and the approach promising. The tradeoff of making many assumptions is that the approach likely would not work for a wide range of hardware and data workloads, but there is not always a one-size-fits-all solution. As another critique, in the related work I think it could have been helpful to have a table or otherwise easy to read summary of the different kinds of filesystem architectures and different kinds of workloads, and which architectures are and are not effective for each kind of workload. |
This paper describes the design and implementation details of the Google File System. It considers several goals same as other distributed systems, such as performance, scalability, reliability, and availability, along with their key observation of their application workloads and technological environment. In Section 2, the paper gives its assumptions of the GFS. It is so important that only by considering the assumptions can the system design be correct. For example, "the system is built from inexpensive commodity components that often fail", that's why they need to make several replicas and consider fast recovery. The paper gives the architecture of the system. Telling us why they choose a single master, multi-chunkserver structure and their consideration of choosing 64MB to be the chunk size, which is very different from the Linux file system. Also, the paper gives a detail description of how the client, master, and chunkservers to interact with each other to perform different data operation while keeping the consistency, available of the system. All from a system and engineering perspective. The paper also describes how the GFS achieves the availability of fault tolerance and diagnosis. At last, the paper shows it's experiment results on test cluster and real-world clusters. Overall, the paper gives the design of GFS detailedly, and they fully considered the real-world situations and limitations of building such a large distributed file system. Besides, it's very demonstrative to illustrate the data flow in a flow chart. However, after reading the paper, I'm still confused by a problem, how will the files be stored if I upload lots of files much smaller than the chunk size, for example, 1K photos of size around 1MB. If the system just leaves the chunk empty, it will be a large waste. |
This paper is one of the three most famous paper purposed by Google, the other two are MapReduce and Bigtable. The idea of GFS is a milestone in the area of distributed storage systems and make a big success in the market. The famous open source system Hadoop Distributed File System (HDFS) is designed based on many ideas of GFS. It’s a great pleasure for me to spend time reading this wonderful paper. With the coming of the Internet era, the volume of data grows at a crazy speed. How to effectively and efficiently manage these data come to a question for every internet company including Google. They need to build some storage system provide high reliability, availability, scalability and high performance for the rapidly growing demands of Google’s data processing needs. How to build such a system and apply their business logic to it is a significant problem for every internet company, because for an internet company, data is the most important thing for it. It needs to keep high availability rate for their website, guarantee the reliability of the storage so that it won’t lose user’s data, also it needs to make sure that it can handle accesses from many users at the same time. Based on these demands, the Google File System is introduced. The GFS uses the master-slave pattern which consists of a single master and multiple chunkservers. GFS achieves high performance as well as scalability, reliability and availability. Next, I will summarize the crux of GFS with my understanding. In GFS, all the servers are using commodity machine and it is very flexible to add or remove chunkservers. Since commodity devices are subject to failure, GFS introduces several mechanisms for the reliability including monitoring, failure detection, failure tolerance and recovery. GFS supports files in different size and files are divided into fixed-size 64MB chunks. GFS focus on workload in large streaming and small random reads and large, sequential writes (append). For GFS, the large bandwidth is required while the latency is not a big problem. GFS supports normal file operations include create, delete, open, close, read and write, besides new features like snapshot and record append are also introduced. One of the key ideas for maintaining the reliability of GFS is using chunk replicas on multiple chunkservers. In GFS, master server coordinates the operation of the system including metadata management, chunk lease management, garbage collection, chunk migration and etc. The master uses a periodical HeartBeat message to control chunkservers and collect their states. However, the master does not involve in reads and writes and they try to minimize its interaction in all operations. I think it a good design avoiding master become a bottleneck. Besides, their design decouples the data flow and control flow, which makes it easier to schedule expensive data flow efficiently. The whole system design is driven by observations of workloads and the technological environment in Google, this is also something I learned from this paper. When we try to design or create something new, we should start with real-world practice, identify the requirement clearly, then apply our knowledge to solve the problem. This is a pioneering paper in the area of distributed file systems and it does make a great contribution to the development of distributed file systems. The idea of master-slave mode and usage of commodity hardware has a great impact on modern distributed file systems. In their design, it doesn’t introduce too many complicated mechanisms, they try to make their design as simple as possible, I think this is what we still need to follow nowadays. Also, some design of the system is pretty innovative, like HeartBeat protocol, snapshot utility, chunk managements and etc. Although this paper was presented in 2003, it already includes many important ideas for Bigdata. GFS is a successful product which is still in use (maybe Colossus) as an important infrastructure of other products in Google. Overall, it’s a great paper and I do not find any main drawbacks. Since this paper was written 15 years ago, I think nowadays, it is impossible to use a single master to do things, a single master will definitely become the bottleneck of the system. By the way, GFS is not open source as HDFS, it would be better if Google is willing to open source it. |
This paper introduces a distributed file system developed and used by Google for “large distributed data-intensive applications.” Its main architecture consists of a master node and many non-master “chunkservers” which host “chunks” which contain the stored data. Clients access these chunks by asking the master node for the mapping that tells them which chunkserver to request information from; this can be cached by the client to mitigate bottlenecking at the master node. In order to reduce complexity, a weaker version of consistency (compared to serializability) is guaranteed by GFS, which optimizes append operations over re-write operations. In order to guarantee durability, a log file (called the “operation log”) is maintained for when things go wrong. Several specific feature optimizations are also listed, like lazy garbage collection and stale replica detection, which allow for even better performance. The chief advantages/benefits of GFS are as follows: 1) It runs on a distributed network of cheap commodity servers. This is the biggest strength in my opinion, as it allows for high performance at a cheap cost. 2) It is robust under failures, as its distributed protocol and operation log ensure that the consistency guarantees are always valid. 3) This is optimized for the environment assumed by the paper, which is data-intensive applications which use append operations far more than re-write applications. On the flip side, although GFS is optimized for the environment assumed, the environment IS based on a number of assumptions, which can be treated as weaknesses. For example, there are big assumptions made on the workloads; particularly that they will mostly be either large streaming reads, small random reads, or large appending writes. For workloads that do not conform to this assumption, GFS will not perform as well. Also, although the authors addressed this with a short-term solution, small chunksizes can be a bottleneck to the system if many clients are trying to access the same chunk at once. Replication was offered as a short term solution, but the paper did not mention an algorithm to determine what level of replication is necessary for a given chunk at any given time. |
As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen:
To regain access, please make sure that cookies and JavaScript are enabled before reloading the page.
Related documents.
You can add this document to your study collection(s)
You can add this document to your saved list
(For complaints, use another form )
Input it if you want to receive answer
A discussion between kirk mckusick and sean quinlan about the origin and evolution of the google file system..
During the early stages of development at Google, the initial thinking did not include plans for building a new file system. While work was still being done on one of the earliest versions of the company's crawl and indexing system, however, it became quite clear to the core engineers that they really had no other choice, and GFS (Google File System) was born.
First, given that Google's goal was to build a vast storage network out of inexpensive commodity hardware, it had to be assumed that component failures would be the norm—meaning that constant monitoring, error detection, fault tolerance, and automatic recovery would have to be an integral part of the file system. Also, even by Google's earliest estimates, the system's throughput requirements were going to be daunting by anybody's standards—featuring multi-gigabyte files and data sets containing terabytes of information and millions of objects. Clearly, this meant traditional assumptions about I/O operations and block sizes would have to be revisited. There was also the matter of scalability. This was a file system that would surely need to scale like no other. Of course, back in those earliest days, no one could have possibly imagined just how much scalability would be required. They would learn about that soon enough.
Still, nearly a decade later, most of Google's mind-boggling store of data and its ever-growing array of applications continue to rely upon GFS. Many adjustments have been made to the file system along the way, and—together with a fair number of accommodations implemented within the applications that use GFS—they have made the journey possible.
To explore the reasoning behind a few of the more crucial initial design decisions as well as some of the incremental adaptations that have been made since then, ACM asked Sean Quinlan to pull back the covers on the changing file-system requirements and the evolving thinking at Google. Since Quinlan served as the GFS tech leader for a couple of years and continues now as a principal engineer at Google, he's in a good position to offer that perspective. As a grounding point beyond the Googleplex, ACM asked Kirk McKusick to lead the discussion. He is best known for his work on BSD (Berkeley Software Distribution) Unix, including the original design of the Berkeley FFS (Fast File System).
The discussion starts, appropriately enough, at the beginning—with the unorthodox decision to base the initial GFS implementation on a single-master design. At first blush, the risk of a single centralized master becoming a bandwidth bottleneck—or, worse, a single point of failure—seems fairly obvious, but it turns out Google's engineers had their reasons for making this choice.
MCKUSICK One of the more interesting—and significant—aspects of the original GFS architecture was the decision to base it on a single master. Can you walk us through what led to that decision?
QUINLAN The decision to go with a single master was actually one of the very first decisions, mostly just to simplify the overall design problem. That is, building a distributed master right from the outset was deemed too difficult and would take too much time. Also, by going with the single-master approach, the engineers were able to simplify a lot of problems. Having a central place to control replication and garbage collection and many other activities was definitely simpler than handling it all on a distributed basis. So the decision was made to centralize that in one machine.
MCKUSICK Was this mostly about being able to roll out something within a reasonably short time frame?
QUINLAN Yes. In fact, some of the engineers who were involved in that early effort later went on to build BigTable, a distributed storage system, but that effort took many years. The decision to build the original GFS around the single master really helped get something out into the hands of users much more rapidly than would have otherwise been possible.
Also, in sketching out the use cases they anticipated, it didn't seem the single-master design would cause much of a problem. The scale they were thinking about back then was framed in terms of hundreds of terabytes and a few million files. In fact, the system worked just fine to start with.
MCKUSICK But then what?
QUINLAN Problems started to occur once the size of the underlying storage increased. Going from a few hundred terabytes up to petabytes, and then up to tens of petabytes� that really required a proportionate increase in the amount of metadata the master had to maintain. Also, operations such as scanning the metadata to look for recoveries all scaled linearly with the volume of data. So the amount of work required of the master grew substantially. The amount of storage needed to retain all that information grew as well.
In addition, this proved to be a bottleneck for the clients, even though the clients issue few metadata operations themselves—for example, a client talks to the master whenever it does an open. When you have thousands of clients all talking to the master at the same time, given that the master is capable of doing only a few thousand operations a second, the average client isn't able to command all that many operations per second. Also bear in mind that there are applications such as MapReduce, where you might suddenly have a thousand tasks, each wanting to open a number of files. Obviously, it would take a long time to handle all those requests, and the master would be under a fair amount of duress.
MCKUSICK Now, under the current schema for GFS, you have one master per cell, right?
QUINLAN That's correct.
MCKUSICK And historically you've had one cell per data center, right?
QUINLAN That was initially the goal, but it didn't work out like that to a large extent—partly because of the limitations of the single-master design and partly because isolation proved to be difficult. As a consequence, people generally ended up with more than one cell per data center. We also ended up doing what we call a "multi-cell" approach, which basically made it possible to put multiple GFS masters on top of a pool of chunkservers. That way, the chunkservers could be configured to have, say, eight GFS masters assigned to them, and that would give you at least one pool of underlying storage—with multiple master heads on it, if you will. Then the application was responsible for partitioning data across those different cells.
MCKUSICK Presumably each application would then essentially have its own master that would be responsible for managing its own little file system. Was that basically the idea?
QUINLAN Well, yes and no. Applications would tend to use either one master or a small set of the masters. We also have something we called Name Spaces, which are just a very static way of partitioning a namespace that people can use to hide all of this from the actual application. The Logs Processing System offers an example of this approach: once logs exhaust their ability to use just one cell, they move to multiple GFS cells; a namespace file describes how the log data is partitioned across those different cells and basically serves to hide the exact partitioning from the application. But this is all fairly static.
MCKUSICK What's the performance like, in light of all that?
QUINLAN We ended up putting a fair amount of effort into tuning master performance, and it's atypical of Google to put a lot of work into tuning any one particular binary. Generally, our approach is just to get things working reasonably well and then turn our focus to scalability—which usually works well in that you can generally get your performance back by scaling things. Because in this instance we had a single bottleneck that was starting to have an impact on operations, however, we felt that investing a bit of additional effort into making the master lighter weight would be really worthwhile. In the course of scaling from thousands of operations to tens of thousands and beyond, the single master had become somewhat less of a bottleneck. That was a case where paying more attention to the efficiency of that one binary definitely helped keep GFS going for quite a bit longer than would have otherwise been possible.
It could be argued that managing to get GFS ready for production in record time constituted a victory in its own right and that, by speeding Google to market, this ultimately contributed mightily to the company's success. A team of three was responsible for all of that—for the core of GFS—and for the system being readied for deployment in less than a year.
But then came the price that so often befalls any successful system—that is, once the scale and use cases have had time to expand far beyond what anyone could have possibly imagined. In Google's case, those pressures proved to be particularly intense.
Although organizations don't make a habit of exchanging file-system statistics, it's safe to assume that GFS is the largest file system in operation (in fact, that was probably true even before Google's acquisition of YouTube). Hence, even though the original architects of GFS felt they had provided adequately for at least a couple of orders of magnitude of growth, Google quickly zoomed right past that.
In addition, the number of applications GFS was called upon to support soon ballooned. In an interview with one of the original GFS architects, Howard Gobioff (conducted just prior to his surprising death in early 2008), he recalled, "The original consumer of all our earliest GFS versions was basically this tremendously large crawling and indexing system. The second wave came when our quality team and research groups started using GFS rather aggressively—and basically, they were all looking to use GFS to store large data sets. And then, before long, we had 50 users, all of whom required a little support from time to time so they'd all keep playing nicely with each other."
One thing that helped tremendously was that Google built not only the file system but also all of the applications running on top of it. While adjustments were continually made in GFS to make it more accommodating to all the new use cases, the applications themselves were also developed with the various strengths and weaknesses of GFS in mind. "Because we built everything, we were free to cheat whenever we wanted to," Gobioff neatly summarized. "We could push problems back and forth between the application space and the file-system space, and then work out accommodations between the two."
The matter of sheer scale, however, called for some more substantial adjustments. One coping strategy had to do with the use of multiple "cells" across the network, functioning essentially as related but distinct file systems. Besides helping to deal with the immediate problem of scale, this proved to be a more efficient arrangement for the operations of widely dispersed data centers.
Rapid growth also put pressure on another key parameter of the original GFS design: the choice to establish 64 MB as the standard chunk size. That, of course, was much larger than the typical file-system block size, but only because the files generated by Google's crawling and indexing system were unusually large. As the application mix changed over time, however, ways had to be found to let the system deal efficiently with large numbers of files requiring far less than 64 MB (think in terms of Gmail, for example). The problem was not so much with the number of files itself, but rather with the memory demands all of those files made on the centralized master, thus exposing one of the bottleneck risks inherent in the original GFS design.
MCKUSICK I gather from the original GFS paper [Ghemawat, S., Gobioff, H., Leung, S-T. 2003. The Google File System. SOSP (ACM Symposium on Operating Systems Principles)] that file counts have been a significant issue for you right along. Can you go into that a little bit?
QUINLAN The file-count issue came up fairly early because of the way people ended up designing their systems around GFS. Let me cite a specific example. Early in my time at Google, I was involved in the design of the Logs Processing system. We initially had a model where a front-end server would write a log, which we would then basically copy into GFS for processing and archival. That was fine to start with, but then the number of front-end servers increased, each rolling logs every day. At the same time, the number of log types was going up, and then you'd have front-end servers that would go through crash loops and generate lots more logs. So we ended up with a lot more files than we had anticipated based on our initial back-of-the-envelope estimates.
This became an area we really had to keep an eye on. Finally, we just had to concede there was no way we were going to survive a continuation of the sort of file-count growth we had been experiencing.
MCKUSICK Let me make sure I'm following this correctly: your issue with file-count growth is a result of your needing to have a piece of metadata on the master for each file, and that metadata has to fit in the master's memory.
MCKUSICK And there are only a finite number of files you can accommodate before the master runs out of memory?
QUINLAN Exactly. And there are two bits of metadata. One identifies the file, and the other points out the chunks that back that file. If you had a chunk that contained only 1 MB, it would take up only 1 MB of disk space, but it still would require those two bits of metadata on the master. If your average file size ends up dipping below 64 MB, the ratio of the number of objects on your master to what you have in storage starts to go down. That's where you run into problems.
Going back to that logs example, it quickly became apparent that the natural mapping we had thought of—and which seemed to make perfect sense back when we were doing our back-of-the-envelope estimates—turned out not to be acceptable at all. We needed to find a way to work around this by figuring out how we could combine some number of underlying objects into larger files. In the case of the logs, that wasn't exactly rocket science, but it did require a lot of effort.
MCKUSICK That sounds like the old days when IBM had only a minimum disk allocation, so it provided you with a utility that let you pack a bunch of files together and then create a table of contents for that.
QUINLAN Exactly. For us, each application essentially ended up doing that to varying degrees. That proved to be less burdensome for some applications than for others. In the case of our logs, we hadn't really been planning to delete individual log files. It was more likely that we would end up rewriting the logs to anonymize them or do something else along those lines. That way, you don't get the garbage-collection problems that can come up if you delete only some of the files within a bundle.
For some other applications, however, the file-count problem was more acute. Many times, the most natural design for some application just wouldn't fit into GFS—even though at first glance you would think the file count would be perfectly acceptable, it would turn out to be a problem. When we started using more shared cells, we put quotas on both file counts and storage space. The limit that people have ended up running into most has been, by far, the file-count quota. In comparison, the underlying storage quota rarely proves to be a problem.
MCKUSICK What longer-term strategy have you come up with for dealing with the file-count issue? Certainly, it doesn't seem that a distributed master is really going to help with that—not if the master still has to keep all the metadata in memory, that is.
QUINLAN The distributed master certainly allows you to grow file counts, in line with the number of machines you're willing to throw at it. That certainly helps.
One of the appeals of the distributed multimaster model is that if you scale everything up by two orders of magnitude, then getting down to a 1-MB average file size is going to be a lot different from having a 64-MB average file size. If you end up going below 1 MB, then you're also going to run into other issues that you really need to be careful about. For example, if you end up having to read 10,000 10-KB files, you're going to be doing a lot more seeking than if you're just reading 100 1-MB files.
My gut feeling is that if you design for an average 1-MB file size, then that should provide for a much larger class of things than does a design that assumes a 64-MB average file size. Ideally, you would like to imagine a system that goes all the way down to much smaller file sizes, but 1 MB seems a reasonable compromise in our environment.
MCKUSICK What have you been doing to design GFS to work with 1-MB files?
QUINLAN We haven't been doing anything with the existing GFS design. Our distributed master system that will provide for 1-MB files is essentially a whole new design. That way, we can aim for something on the order of 100 million files per master. You can also have hundreds of masters.
MCKUSICK So, essentially no single master would have all this data on it?
QUINLAN That's the idea.
With the recent emergence within Google of BigTable, a distributed storage system for managing structured data, one potential remedy for the file-count problem—albeit perhaps not the very best one—is now available.
The significance of BigTable goes far beyond file counts, however. Specifically, it was designed to scale into the petabyte range across hundreds or thousands of machines, as well as to make it easy to add more machines to the system and automatically start taking advantage of those resources without reconfiguration. For a company predicated on the notion of employing the collective power, potential redundancy, and economies of scale inherent in a massive deployment of commodity hardware, these rate as significant advantages indeed.
Accordingly, BigTable is now used in conjunction with a growing number of Google applications. Although it represents a departure of sorts from the past, it also must be said that BigTable was built on GFS, runs on GFS, and was consciously designed to remain consistent with most GFS principles. Consider it, therefore, as one of the major adaptations made along the way to help keep GFS viable in the face of rapid and widespread change.
MCKUSICK You now have this thing called BigTable. Do you view that as an application in its own right?
QUINLAN From the GFS point of view, it's an application, but it's clearly more of an infrastructure piece.
MCKUSICK If I understand this correctly, BigTable is essentially a lightweight relational database.
QUINLAN It's not really a relational database. I mean, we're not doing SQL and it doesn't really support joins and such. But BigTable is a structured storage system that lets you have lots of key-value pairs and a schema.
MCKUSICK Who are the real clients of BigTable?
QUINLAN BigTable is increasingly being used within Google for crawling and indexing systems, and we use it a lot within many of our client-facing applications. The truth of the matter is that there are tons of BigTable clients. Basically, any app with lots of small data items tends to use BigTable. That's especially true wherever there's fairly structured data.
MCKUSICK I guess the question I'm really trying to pose here is: Did BigTable just get stuck into a lot of these applications as an attempt to deal with the small-file problem, basically by taking a whole bunch of small things and then aggregating them together?
QUINLAN That has certainly been one use case for BigTable, but it was actually intended for a much more general sort of problem. If you're using BigTable in that way—that is, as a way of fighting the file-count problem where you might have otherwise used a file system to handle that—then you would not end up employing all of BigTable's functionality by any means. BigTable isn't really ideal for that purpose in that it requires resources for its own operations that are nontrivial. Also, it has a garbage-collection policy that's not super-aggressive, so that might not be the most efficient way to use your space. I'd say that the people who have been using BigTable purely to deal with the file-count problem probably haven't been terribly happy, but there's no question that it is one way for people to handle that problem.
MCKUSICK What I've read about GFS seems to suggest that the idea was to have only two basic data structures: logs and SSTables (Sorted String Tables). Since I'm guessing the SSTables must be used to handle key-value pairs and that sort of thing, how is that different from BigTable?
QUINLAN The main difference is that SSTables are immutable, while BigTable provides mutable key value storage, and a whole lot more. BigTable itself is actually built on top of logs and SSTables. Initially, it stores incoming data into transaction log files. Then it gets compacted —as we call it—into a series of SSTables, which in turn get compacted together over time. In some respects, it's reminiscent of a log-structure file system. Anyway, as you've observed, logs and SSTables do seem to be the two data structures underlying the way we structure most of our data. We have log files for mutable stuff as it's being recorded. Then, once you have enough of that, you sort it and put it into this structure that has an index.
Even though GFS does not provide a Posix interface, it still has a pretty generic file-system interface, so people are essentially free to write any sort of data they like. It's just that, over time, the majority of our users have ended up using these two data structures. We also have something called protocol buffers , which is our data description language. The majority of data ends up being protocol buffers in these two structures.
Both provide for compression and checksums. Even though there are some people internally who end up reinventing these things, most people are content just to use those two basic building blocks.
Because GFS was designed initially to enable a crawling and indexing system, throughput was everything. In fact, the original paper written about the system makes this quite explicit: "High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response-time requirements for an individual read and write."
But then Google either developed or embraced many user-facing Internet services for which this is most definitely not the case.
One GFS shortcoming that this immediately exposed had to do with the original single-master design. A single point of failure may not have been a disaster for batch-oriented applications, but it was certainly unacceptable for latency-sensitive applications, such as video serving. The later addition of automated failover capabilities helped, but even then service could be out for up to a minute.
The other major challenge for GFS, of course, has revolved around finding ways to build latency-sensitive applications on top of a file system designed around an entirely different set of priorities.
MCKUSICK It's well documented that the initial emphasis in designing GFS was on batch efficiency as opposed to low latency. Now that has come back to cause you trouble, particularly in terms of handling things such as videos. How are you handling that?
QUINLAN The GFS design model from the get-go was all about achieving throughput, not about the latency at which that might be achieved. To give you a concrete example, if you're writing a file, it will typically be written in triplicate—meaning you'll actually be writing to three chunkservers. Should one of those chunkservers die or hiccup for a long period of time, the GFS master will notice the problem and schedule what we call a pullchunk , which means it will basically replicate one of those chunks. That will get you back up to three copies, and then the system will pass control back to the client, which will continue writing.
When we do a pullchunk we limit it to something on the order of 5-10 MB a second. So, for 64 MB, you're talking about 10 seconds for this recovery to take place. There are lots of other things like this that might take 10 seconds to a minute, which works just fine for batch-type operations. If you're doing a large MapReduce operation, you're OK just so long as one of the items is not a real straggler, in which case you've got yourself a different sort of problem. Still, generally speaking, a hiccup on the order of a minute over the course of an hour-long batch job doesn't really show up. If you are working on Gmail, however, and you're trying to write a mutation that represents some user action, then getting stuck for a minute is really going to mess you up.
We've had similar issues with our master failover. Initially, GFS had no provision for automatic master failover. It was a manual process. Although it didn't happen a lot, whenever it did, the cell might be down for an hour. Even our initial master-failover implementation required on the order of minutes. Over the past year, however, we've taken that down to something on the order of tens of seconds.
MCKUSICK Still, for user-facing applications, that's not acceptable.
QUINLAN Right. While these instances—where you have to provide for failover and error recovery—may have been acceptable in the batch situation, they're definitely not OK from a latency point of view for a user-facing application. Another issue here is that there are places in the design where we've tried to optimize for throughput by dumping thousands of operations into a queue and then just processing through them. That leads to fine throughput, but it's not great for latency. You can easily get into situations where you might be stuck for seconds at a time in a queue just waiting to get to the head of the queue.
Our user base has definitely migrated from being a MapReduce-based world to more of an interactive world that relies on things such as BigTable. Gmail is an obvious example of that. Videos aren't quite as bad where GFS is concerned because you get to stream data, meaning you can buffer. Still, trying to build an interactive database on top of a file system that was designed from the start to support more batch-oriented operations has certainly proved to be a pain point.
MCKUSICK How exactly have you managed to deal with that?
QUINLAN Within GFS, we've managed to improve things to a certain degree, mostly by designing the applications to deal with the problems that come up. Take BigTable as a good concrete example. The BigTable transaction log is actually the biggest bottleneck for getting a transaction logged. In effect, we decided, "Well, we're going to see hiccups in these writes, so what we'll do is to have two logs open at any one time. Then we'll just basically merge the two. We'll write to one and if that gets stuck, we'll write to the other. We'll merge those logs once we do a replay—if we need to do a replay, that is." We tended to design our applications to function like that—which is to say they basically try to hide that latency since they know the system underneath isn't really all that great.
The guys who built Gmail went to a multihomed model, so if one instance of your Gmail account got stuck, you would basically just get moved to another data center. Actually, that capability was needed anyway just to ensure availability. Still, part of the motivation was that they wanted to hide the GFS problems.
MCKUSICK I think it's fair to say that, by moving to a distributed-master file system, you're definitely going to be able to attack some of those latency issues.
QUINLAN That was certainly one of our design goals. Also, BigTable itself is a very failure-aware system that tries to respond to failures far more rapidly than we were able to before. Using that as our metadata storage helps with some of those latency issues as well.
The engineers who worked on the earliest versions of GFS weren't particularly shy about departing from traditional choices in file-system design whenever they felt the need to do so. It just so happens that the approach taken to consistency is one of the aspects of the system where this is particularly evident.
Part of this, of course, was driven by necessity. Since Google's plans rested largely on massive deployments of commodity hardware, failures and hardware-related faults were a given. Beyond that, according to the original GFS paper, there were a few compatibility issues. "Many of our disks claimed to the Linux driver that they supported a range of IDE protocol versions but in fact responded reliably only to the more recent ones. Since the protocol versions are very similar, these drives mostly worked but occasionally the mismatches would cause the drive and the kernel to disagree about the drive's state. This would corrupt data silently due to problems in the kernel. This problem motivated our use of checksums to detect data corruption."
That didn't mean just any checksumming, however, but instead rigorous end-to-end checksumming, with an eye to everything from disk corruption to TCP/IP corruption to machine backplane corruption.
Interestingly, for all that checksumming vigilance, the GFS engineering team also opted for an approach to consistency that's relatively loose by file-system standards. Basically, GFS simply accepts that there will be times when people will end up reading slightly stale data. Since GFS is used mostly as an append-only system as opposed to an overwriting system, this generally means those people might end up missing something that was appended to the end of the file after they'd already opened it. To the GFS designers, this seemed an acceptable cost (although it turns out that there are applications for which this proves problematic).
Also, as Gobioff explained, "The risk of stale data in certain circumstances is just inherent to a highly distributed architecture that doesn't ask the master to maintain all that much information. We definitely could have made things a lot tighter if we were willing to dump a lot more data into the master and then have it maintain more state. But that just really wasn't all that critical to us."
Perhaps an even more important issue here is that the engineers making this decision owned not just the file system but also the applications intended to run on the file system. According to Gobioff, "The thing is that we controlled both the horizontal and the vertical—the file system and the application. So we could be sure our applications would know what to expect from the file system. And we just decided to push some of the complexity out to the applications to let them deal with it."
Still, there are some at Google who wonder whether that was the right call if only because people can sometimes obtain different data in the course of reading a given file multiple times, which tends to be so strongly at odds with their whole notion of how data storage is supposed to work.
MCKUSICK Let's talk about consistency. The issue seems to be that it presumably takes some amount of time to get everything fully written to all the replicas. I think you said something earlier to the effect that GFS essentially requires that this all be fully written before you can continue.
MCKUSICK If that's the case, then how can you possibly end up with things that aren't consistent?
QUINLAN Client failures have a way of fouling things up. Basically, the model in GFS is that the client just continues to push the write until it succeeds. If the client ends up crashing in the middle of an operation, things are left in a bit of an indeterminate state.
Early on, that was sort of considered to be OK, but over time, we tightened the window for how long that inconsistency could be tolerated, and then we slowly continued to reduce that. Otherwise, whenever the data is in that inconsistent state, you may get different lengths for the file. That can lead to some confusion. We had to have some backdoor interfaces for checking the consistency of the file data in those instances. We also have something called RecordAppend, which is an interface designed for multiple writers to append to a log concurrently. There the consistency was designed to be very loose. In retrospect, that turned out to be a lot more painful than anyone expected.
MCKUSICK What exactly was loose? If the primary replica picks what the offset is for each write and then makes sure that actually occurs, I don't see where the inconsistencies are going to come up.
QUINLAN What happens is that the primary will try. It will pick an offset, it will do the writes, but then one of them won't actually get written. Then the primary might change, at which point it can pick a different offset. RecordAppend does not offer any replay protection either. You could end up getting the data multiple times in the file.
There were even situations where you could get the data in a different order. It might appear multiple times in one chunk replica, but not necessarily in all of them. If you were reading the file, you could discover the data in different ways at different times. At the record level, you could discover the records in different orders depending on which chunks you happened to be reading.
MCKUSICK Was this done by design?
QUINLAN At the time, it must have seemed like a good idea, but in retrospect I think the consensus is that it proved to be more painful than it was worth. It just doesn't meet the expectations people have of a file system, so they end up getting surprised. Then they had to figure out work-arounds.
MCKUSICK In retrospect, how would you handle this differently?
QUINLAN I think it makes more sense to have a single writer per file.
MCKUSICK All right, but what happens when you have multiple people wanting to append to a log?
QUINLAN You serialize the writes through a single process that can ensure the replicas are consistent.
MCKUSICK There's also this business where you essentially snapshot a chunk. Presumably, that's something you use when you're essentially replacing a replica, or whenever some chunkserver goes down and you need to replace some of its files.
QUINLAN Actually, two things are going on there. One, as you suggest, is the recovery mechanism, which definitely involves copying around replicas of the file. The way that works in GFS is that we basically revoke the lock so that the client can't write it anymore, and this is part of that latency issue we were talking about.
There's also a separate issue, which is to support the snapshot feature of GFS. GFS has the most general-purpose snapshot capability you can imagine. You could snapshot any directory somewhere, and then both copies would be entirely equivalent. They would share the unchanged data. You could change either one and you could further snapshot either one. So it was really more of a clone than what most people think of as a snapshot. It's an interesting thing, but it makes for difficulties—especially as you try to build more distributed systems and you want potentially to snapshot larger chunks of the file tree.
I also think it's interesting that the snapshot feature hasn't been used more since it's actually a very powerful feature. That is, from a file-system point of view, it really offers a pretty nice piece of functionality. But putting snapshots into file systems, as I'm sure you know, is a real pain.
MCKUSICK : I know. I've done it. It's excruciating—especially in an overwriting file system.
QUINLAN Exactly. This is a case where we didn't cheat, but from an implementation perspective, it's hard to create true snapshots. Still, it seems that in this case, going the full deal was the right decision. Just the same, it's an interesting contrast to some of the other decisions that were made early on in terms of the semantics.
All in all, the report card on GFS nearly 10 years later seems positive. There have been problems and shortcomings, to be sure, but there's surely no arguing with Google's success and GFS has without a doubt played an important role in that. What's more, its staying power has been nothing short of remarkable given that Google's operations have scaled orders of magnitude beyond anything the system had been designed to handle, while the application mix Google currently supports is not one that anyone could have possibly imagined back in the late '90s.
Still, there's no question that GFS faces many challenges now. For one thing, the awkwardness of supporting an ever-growing fleet of user-facing, latency-sensitive applications on top of a system initially designed for batch-system throughput is something that's obvious to all.
The advent of BigTable has helped somewhat in this regard. As it turns out, however, BigTable isn't actually all that great a fit for GFS. In fact, it just makes the bottleneck limitations of the system's single-master design more apparent than would otherwise be the case.
For these and other reasons, engineers at Google have been working for much of the past two years on a new distributed master system designed to take full advantage of BigTable to attack some of those problems that have proved particularly difficult for GFS.
Accordingly, it now seems that beyond all the adjustments made to ensure the continued survival of GFS, the newest branch on the evolutionary tree will continue to grow in significance over the years to come. Q
LOVE IT, HATE IT? LET US KNOW
[email protected]
© 2009 ACM 1542-7730/09/0800 $10.00
August 6, 2024
Most state tax systems fall short of the public’s perception of fairness by charging the rich lower tax rates than everyone else. Minnesota is among a small group of states that has chosen a different path. In Who Pays? , our comprehensive study of state and local taxes, Minnesota stands apart from the pack with a moderately progressive tax system that asks slightly more of the rich than of low- and middle-income families.
Recent reforms signed by Gov. Tim Walz, the Democratic Party’s presumptive Vice-Presidential nominee, have contributed to this reality. Our analysis shows that taxes on working-class families declined markedly over the last few years in Minnesota, while taxes on high-income people went up slightly over this same period.
The most notable changes were signed into law by Gov. Walz in 2023 as part of a sweeping tax reform package. Some changes were temporary, like taxpayer rebate checks and expanded property tax credits. But the bill also included a host of important, permanent reforms.
Chief among those was a new Child Tax Credit that is expected to slash child poverty in Minnesota by one-third, according to Columbia University’s Center on Poverty and Social Policy. The link between Child Tax Credits and child wellbeing is well established, as the financial security afforded by these credits is associated with improved child and maternal health, better educational achievement, and stronger future economic outcomes.
Other tax cuts signed by Gov. Walz include expanded exemptions for Social Security income and for student loan forgiveness, plus an extension of the Child Care Tax Credit to newborn children.
To help pay for these and other substantial tax cuts, the 2023 bill included a variety of well-targeted tax increases on high-income people and profitable corporations. Certain tax deductions claimed by high-income filers have been scaled back. Capital gains, dividends, and other investment income over $1 million per year is now subject to a modest 1 percent surtax. And multinational corporations reporting income overseas now face higher taxes as well, as the state opted to piggyback on a law written by Congressional Republicans targeting companies’ “low-taxed income.”
While the Minnesota tax code is somewhat progressive, it is far from radical. The state has embraced practical, administrable reforms that have lowered taxes for working-class families, reduced child poverty, and addressed the public’s frustrations with the tax treatment of multinational companies and wealthy people. At the end of the day, Minnesota does better than most states in living up to what most people would consider to be a bare minimum standard of tax fairness: the idea that wealthy people should not pay lower tax rates than everyone else.
All Blog Posts
Five tax takeaways from 2024 state legislative sessions , reality interrupts the fever dream of income tax elimination in kentucky, property tax circuit breakers can help states create more equitable tax codes.
COMMENTS
The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our ser-vice as well as research and development efforts that require large data sets. The largest cluster to date provides hun-dreds of terabytes of storage across thousands of disks on over a thousand machines, and it is ...
Google Inc. developed the Google File System (GFS), a scalable distributed file system (DFS), to meet the company's growing data processing needs. GFS offers fault tolerance, dependability, scalability, availability, and performance to big networks and connected nodes. GFS is made up of a number of storage systems constructed from inexpensive commodity hardware parts. The search engine ...
The Google file system's main goal is to support their applications' workload. Which affected their design decisions, they implemented what they actually need, rather than the de-facto distributed file system.
Case Study GFS: Evolution on Fast-forward. A discussion between Kirk McKusick and Sean Quinlan about the origin and evolution of the Google File System. During the early stages of development at Google, the initial thinking did not include plans for building a new file system. While work was still being done on one of the earliest versions of ...
The Google File System demonstrates the qualities es-sential for supporting large-scale data processing workloads on commodity hardware. While some design decisions are specific to our unique setting, many may apply to data pro-cessing tasks of a similar magnitude and cost consciousness.
The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of ...
This week, we will discuss an interesting case study for distributed storage: the Google File System.
The Google File System reduced hardware flaws while gains of commercially available servers. Google FS is another name for GFS. It manages two types of data namely File metadata and File Data.
Frank Schmuck and Roger Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the First USENIX Conference on File and Storage Technologies, pages 231--244, Monterey, California, January 2002. Digital Library Google Scholar [11]
This paper presents GFS, the scalable distributed file system implemented at Google to support data-intensive applications with fault tolerance mechanisms and high aggregate performance to a large number of clients.
The Google File System (GFS) Chapter 12 presented a detailed study of the topic of distributed file systems, analyzing their requirements and their overall architecture and examining two case studies in detail, namely NFS and AFS.
Have you ever wondered how Google manages to store and process vast amounts of data? Well, my friend, the answer lies in their ingenious file system design.
The paper describes Google File System (GFS), a scalable distributed file system designed specifically for distributed data-intensive applications at Google. It runs on commodity hardware and provides fault tolerance and delivers high aggregate performance to a large number of clients. Problem.
ABSTRACT. A comparative analysis study between Google file system and Hadoop distributed file system was conducted in this study. Using comarision techniques for architecture and development of GFS and HDFS, allows us use to deduce that both GFS and HDFS are considered two of the most used distributed file systems for dealing with huge clusters ...
The Google file system (GFS) is a proprietary distributed file system, which was created as a solution to serve the highly concurrent…
The authors position the paper as questioning traditional file system standards and considering whether assumptions in prior research apply to their use case at Google: commodity hardware clusters running large-scale data processing workloads.
During the early stages of development at Google, the initial thinking did not include plans for building a new file system. While work was still being done on one of the earliest versions of the company's crawl and indexing system, however, it became ...
Ace your courses with our free study and lecture notes, summaries, exam prep, and other resources
NFS and Google File System Case Study advertisement Case Study - GFS Outline NFS Google File System (GFS) Network File Systems I'm on a Unix machine in MC 325; I go to another machine in MC 325 I see the same files
This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use. We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on ...
Google File System Google file system is a scalable distributed file system developed by Google to provide efficient and reliable access to data using large clusters of commodity hardware. It is designed to meet the rapidly growing demand of Google's data processing need.
Case Study : The Google File System. Paper Summary Name: Yuyang Zhao NUID: 001147013 First Paper: The Google File System This paper introduces the design of GFS, which is scalable, reliable and available. The design mainly focuses on the parts of norm component failure, huge data access, optimizing appending operation and concurrency problem.
GFS: Evolution on Fast-forward. A discussion between Kirk McKusick and Sean Quinlan about the origin and evolution of the Google File System. During the early stages of development at Google, the initial thinking did not include plans for building a new file system. While work was still being done on one of the earliest versions of the company ...
Most state tax systems fall short of the public's perception of fairness by charging the rich lower tax rates than everyone else. Minnesota is among a small group of states that has chosen a different path. In Who Pays?, our comprehensive study of state and local taxes, Minnesota stands apart from the pack with a moderately progressive tax system that asks slightly more of the rich than of ...