[repost]Kosmos File System (KFS) is a New High End Google File System Option


There’s a new clustered file system on the spindle: Kosmos File System (KFS). Thanks to Rich Skrenta for turning me on to KFS and I think his blog post says it all. KFS is an open source project written in C++ by search startup Kosmix. The team members have a good pedigree so there’s a better than average chance this software will be worth considering.

After you stop trying to turn KFS into “Kentucky Fried File System” in your mind, take a look at KFS’ intriguing feature set:

  • Incremental scalability: New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the new nodes.
  • Availability: Replication is used to provide availability due to chunk server failures. Typically, files are replicated 3-way.
  • Per file degree of replication: The degree of replication is configurable on a per file basis, with a max. limit of 64.
  • Re-replication: Whenever the degree of replication for a file drops below the configured amount (such as, due to an extended chunkserver outage), the metaserver forces the block to be re-replicated on the remaining chunk servers. Re-replication is done in the background without overwhelming the system.
  • Re-balancing: Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is done to help with balancing disk space utilization amongst nodes.
  • Data integrity: To handle disk corruptions to data blocks, data blocks are checksummed. Checksum verification is done on each read; whenever there is a checksum mismatch, re-replication is used to recover the corrupted chunk.
  • File writes: The system follows the standard model. When an application creates a file, the filename becomes part of the filesystem namespace. For performance, writes are cached at the KFS client library. Periodically, the cache is flushed and data is pushed out to the chunkservers. Also, applications can force data to be flushed to the chunkservers. In either case, once data is flushed to the server, it is available for reading.
  • Leases: KFS client library uses caching to improve performance. Leases are used to support cache consistency.
  • Chunk versioning: Versioning is used to detect stale chunks.
  • Client side fail-over: The client library is resilient to chunksever failures. During reads, if the client library determines that the chunkserver it is communicating with is unreachable, the client library will fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.
  • Language support: KFS client library can be accessed from C++, Java, and Python.
  • FUSE support on Linux: By mounting KFS via FUSE, this support allows existing linux utilities (such as, ls) to interface with KFS.
  • Tools: A shell binary is included in the set of tools. This allows users to navigate the filesystem tree using utilities such as, cp, ls, mkdir, rmdir, rm, mv. Tools to also monitor the chunk/meta-servers are provided.
  • Deploy scripts: To simplify launching KFS servers, a set of scripts to (1) install KFS binaries on a set of nodes, (2) start/stop KFS servers on a set of nodes are also provided.

    This seems to compare very favorably to GFS and is targeted at:

  • Primarily write-once/read-many workloads
  • Few millions of large files, where each file is on the order of a few tens of MB to a few tens of GB in size
  • Mostly sequential access

    As Rich says everyone needs to solve the “storage problem” and this looks like an exciting option to add to your bag of tricks. What we are still missing though is a Bigtable like database on top of the file system for scaling structured data.

    If anyone is using KFS please consider sharing your experiences.

    Related Articles

  • Hadoop
  • Google Architecture
  • You Can Now Store All Your Stuff on Your Own Google Like File System.
  • original:

    Kosmos File System (KFS) is a New High End Google File System Option

    [repost]Product: GlusterFS


    original:

    Product: GlusterFS

    Adapted from their website:
    GlusterFS is a clustered file-system capable of scaling to several peta-bytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system. Storage bricks can be made of any commodity hardware such as x86-64 server with SATA-II RAID and Infiniband HBA).

    Cluster file systems are still not mature for enterprise market. They are too complex to deploy and maintain though they are extremely scalable and cheap. Can be entirely built out of commodity OS and hardware. GlusterFS hopes to solves this problem.

    GlusterFS achieved 35 GBps read throughput. The GlusterFS Aggregated I/O Benchmark was performed on 64 bricks clustered storage system over 10 Gbps Infiniband interconnect. A cluster of 220 clients pounded the storage system with multiple dd (disk-dump) instances, each reading / writing a 1 GB file with 1MB block size. GlusterFS was configured with unify translator and round-robin scheduler.

    The advantages of GlusterFS are:


    * Designed for O(1) scalability and feature rich.
    * Aggregates on top of existing filesystems. User can recover the files and folders even without GlusterFS.
    * GlusterFS has no single point of failure. Completely distributed. No centralized meta-data server like Lustre.
    * Extensible scheduling interface with modules loaded based on user’s storage I/O access pattern.
    * Modular and extensible through powerful translator mechanism.
    * Supports Infiniband RDMA and TCP/IP.
    * Entirely implemented in user-space. Easy to port, debug and maintain.
    * Scales on demand.

    Related Articles

  • Technical Presentation on GlusterFS
  • Open Fest 5th Annual Conference
  • Zresearch
  • GlusterFS FAQ
  • [repost]Sify.com Architecture – A Portal at 3900 Requests Per Second


    original:

    Sify.com Architecture – A Portal at 3900 Requests Per Second

    Sify.com is one of the leading portals in India. Samachar.com is owned by the same company and is one of the top content aggregation sites in India, primarily targeting Non-resident Indians from around the world. Ramki Subramanian, an Architect at Sify, has been generous enough to describe the common back-end for both these sites. One of the most notable aspects of their architecture is that Sify does not use a traditional database. They query Solr and then retrieve records from a distributed file system. Over the years many people have argued for file systems over databases. Filesystems can work for key-value lookups, but they don’t work for queries, using Solr is a good way around that problem. Another interesting aspect of their system is the use of Drools for intelligent cache invalidation. As we have more and more data duplicated in multiple specialized services, the problem of how to keep them synchronized is a difficult one. A rules engine is a clever approach.

    Platform / Tools

    • Linux
    • Lighty
    • PHP5
    • Memcached
    • Apache Solr
    • Apache ActiveMQ / Camel
    • GFS (clustered File System)
    • Gearman
    • Redis
    • Mule with ActiveMQ.
    • Varnish
    • Drools

    Stats

    • ~150 million page views a month.
    • Serves 3900 Request / second.
    • Back-end is runs on 4 blades hosting about 30 VMs.

    Architecture

    • The system is completely virtualized. We have put to use most of VMs capabilities also, like we move VMs across blades when one blade is down or when the load needs to be redistributed. We have templatized the VMs and so we can provision systems in less than 20 minutes. It is currently manual, but in the next version of the system we are planning on automating the whole provisioning, commissioning, de-commissioning, moving around VMs and also auto-scaling.
    • No Databases
    • 100% Stateless
    • RESTful interface supporting: XML, JSON, JS, RSS / Atom
    • Writes and reads have different Paths.
      • Writes are queued, transformed and routed through ActiveMQ/Camel to other HTTP services. It is used as an ESB (enterprise service bus).
      • Reads, like search, are handled from PHP directly by the web-servers.
    • Solr is used as an indexing / searching engine. If somebody asks for a file giving the key, it is directly served out of storage. If somebody says “give me all files where author=Todd,” it hits Solr and then storage. Queries are performed using Apache Solr as our search-engine and we have a distributed setup for the same.
    • All files are stored in the clustered file system (GFS). Queries hit Solr and it returns the data that we want. If we need full data, we hit the storage after fetching the ids from the search. This approach makes the system completely horizontally scalable and there is zero dependency on a database. It works very well for us after the upgrade to latest version. We just run 2 nodes for the storage and we can add few more nodes if need be.
    • Lighty front ends GFS. Lighty is really very good for serving static files. It can casually take 8000+ requests per second for the kind of files we have (predominantly small XMLs and images).
    • All of the latest NoSQL databases like CouchDB, MongoDB, Cassandra, etc. would just be replacements for our storage layer. None of them are close to Solr/Lucene in search capability. MongoDB is the best in the lot in terms of querying but the “contains” and the like searches needs to be done with a regex and that is a disaster with 5 million docs! We believe our Distributed file-system based approach more scalable than many of those NoSQL database systems for storage at this point.

    Future

    • CouchDB or Hadoop or Cassandra for Event analytics (user clicks, real time graphs and trends).
    • Intelligent Cache invalidation using Drools. Data will be pushed through a queue and a Drools engine will determine which URLs need to be invalidated. It will go clear them in our cache engine or Akamai. The approach is like this. Once a query (URL) hits our backend, we will log that query. The logged query will then be parsed and pushed into the Drools system. The system would take that input and create rules dynamically into the system if it is not already existing. That’s part A. Then our Content Ingestion system will keep pushing all content it is getting into a Drools queue. Once the data comes in, we will fire all the rules against the content. For every matched rule, generate the URLs and we will give a delete request to the cache servers (Akamai or Varnish) for those URLs. Thats part B. Part B is not as simple as mentioned above. There will be many different cases. For example, we support “NOW”, greater than, less than, NOT, etc in the query, those will really give us big headache.
      • There are mainly 2 reasons we are doing all this, very high cache-hit rates and almost immediate updates to end-users. And remember the 2 reasons have never got along well in the past!
      • I think it will perform well and scale. Drools is really good at this kind of problem. Also on analysis, we figured out the queries are mostly constant across many days. For example, we have close to 40,000 different queries a day and it will be repeating every day in almost same pattern. Only the data will change for that query. So, we could setup multiple instances and just replicate the rules in different systems, that way we can scale it horizontally too.
    • Synchronous reads, but fewer layers, less PHP intervention and socket connections.
    • Distributed (write to different Shards) and asynchrounous writes using Queue/ESB(Mule).
    • Heavy caching using Varnish or Akamai.
    • Daemons for killing crons and stay more close to real-time.
    • Parallel and background processing using Gearman and automatic process additions for auto-scaling processing.
    • Realtime distribution of content using Kaazing or eJabberd to both end users and internal systems.
    • Redis for caching digests of content to determine duplicates.
    • We are looking at making the whole thing more easily administrable and turn on VMs and process from within the app-admin. We have looked at Apache Zookeeper and looking at RESTful APIs provided by VMWare and Xen and to do the integration with our system. This will enable us to do auto-scaling.
    • The biggest advantage we have is the bandwidth in the data center has not a constraint as we are ISPs ourselves. I’m looking at ways to use that advantage in the system and see how we can build clusters that can process huge amounts of content quickly, in parallel.

    Lessons Learned

    • ActiveMQ proved disastrous many times! Very poor socket handling. We use to hit the TCP socket limits in less than 5 minutes from a restart. Though its claimed that its fixed in 5.0 and 5.2, it wasn’t working for us. We tried in many ways to make it live longer, like a day at least. We hacked around by deploying old libraries with new releases and made it stay up longer. After all that, we deployed two MQs (message queues) to make sure at least the editorial updates of content is going through OK.
      • Later we figured out that problem was not only that, but using topics was also a problem. Using Topic with just four subscribers would just make MQ hang in a few hours. We killed the whole Topic based approach after huge hair loss and moved them all to a queue. Once the data comes in to the main queue, we push the data in to four different queues. Problem fixed. Of course over period of 15 days or something, it will throw some exception or OOME (out of memory error) and will force us to restart. We are just living with it. In the next version, we are using Mule to handle all of this and clustering at the same too. We are also trying to figure out a way to get out of the dependency in the order of messages, that will make it easier to distribute.
    • Solr
      • Restarts. We have to keep restarting it very frequently. Don’t really know the reason yet, but because its has redundancies we better placed than the MQ. We have gone to the extent of automating the restarts by doing a query and if there is no response or time-outs, we restart Solr.
      • Complex Queries. For complex queries the query response time is really poor. We have about 5 million docs and lot of queries do return in less than a second, but when we have a query with a few “NOT”s and many fields and criteria, it takes 100+ secs. We worked around this by splitting the query into more simpler ones and merging the results in PHP space.
      • Realtime. Another serious issue we have is that the Solr does not reflect the changes committed in real-time. It takes anywhere between 4 mins to 10 mins! Given the industry we are in and the competition, 10 mins late news makes us irrelevant. Looked at Zoie-Solr plugin but our Ids are alpha-numeric and Zoie doesn’t support that. We are looking at fixing that ourselves in Zoie.
    • GFS Locking issue. This used to be very serious issue for us. GFS will lock down the whole cluster and it will make our storage completely inaccessible. There was an issue with GFS 4.0 and we upgraded to 5.0 and it seems to be fine from then.
    • Lighty and PHP do not get along very well. Performance wise both are good but Apache/PHP is more stable. Lighty goes cranky some times with PHP_FCGI process hanging and CPU usage goes to 100%.

    I’d really like to thank Ramki for taking the time write about how their system works. Hopefully you can learn something useful from their experience that will help you on your own adventures. If you would like to share the architecture for your fabulous system, both paying it forward and backward, please contact me and we’ll get started.

    • Linux
    • Lighty
    • PHP5
    • Memcached
    • Apache Solr
    • Apache ActiveMQ / Camel
    • GFS (clustered File System)
    • Gearman
    • Redis
    • Mule with ActiveMQ.
    • Varnish
    • Drools
    • ~150 million page views a month.
    • Serves 3900 Request / second.
    • Back-end is runs on 4 blades hosting about 30 VMs.
    • ActiveMQ proved disastrous many times! Very poor socket handling. We use to hit the TCP socket limits in less than 5 minutes from a restart. Though its claimed that its fixed in 5.0 and 5.2, it wasn’t working for us. We tried in many ways to make it live longer, like a day at least. We hacked around by deploying old libraries with new releases and made it stay up longer. After all that, we deployed two MQs (message queues) to make sure at least the editorial updates of content is going through OK.
      • Later we figured out that problem was not only that, but using topics was also a problem. Using Topic with just four subscribers would just make MQ hang in a few hours. We killed the whole Topic based approach after huge hair loss and moved them all to a queue. Once the data comes in to the main queue, we push the data in to four different queues. Problem fixed. Of course over period of 15 days or something, it will throw some exception or OOME (out of memory error) and will force us to restart. We are just living with it. In the next version, we are using Mule to handle all of this and clustering at the same too. We are also trying to figure out a way to get out of the dependency in the order of messages, that will make it easier to distribute.
    • Solr
      • Restarts. We have to keep restarting it very frequently. Don’t really know the reason yet, but because its has redundancies we better placed than the MQ. We have gone to the extent of automating the restarts by doing a query and if there is no response or time-outs, we restart Solr.
      • Complex Queries. For complex queries the query response time is really poor. We have about 5 million docs and lot of queries do return in less than a second, but when we have a query with a few “NOT”s and many fields and criteria, it takes 100+ secs. We worked around this by splitting the query into more simpler ones and merging the results in PHP space.
      • Realtime. Another serious issue we have is that the Solr does not reflect the changes committed in real-time. It takes anywhere between 4 mins to 10 mins! Given the industry we are in and the competition, 10 mins late news makes us irrelevant. Looked at Zoie-Solr plugin but our Ids are alpha-numeric and Zoie doesn’t support that. We are looking at fixing that ourselves in Zoie.
    • GFS Locking issue. This used to be very serious issue for us. GFS will lock down the whole cluster and it will make our storage completely inaccessible. There was an issue with GFS 4.0 and we upgraded to 5.0 and it seems to be fine from then.
    • Lighty and PHP do not get along very well. Performance wise both are good but Apache/PHP is more stable. Lighty goes cranky some times with PHP_FCGI process hanging and CPU usage goes to 100%.