In a previous post we shared our views on compute components for enterprise grade cloud infrastructure. In this post, we share our views on storage.
Recently, we have seen the culmination of several trends which, when combined, are strong enablers for building enterprise-grade storage systems for the cloud, as has been the case with many of the larger cloud service providers. Storage has been the main substrate on which various cloud infrastructures have been built. It was the case with Amazon (S3 and elastic storage), Facebook (user content), Google (web and user content), Azure (Web content). Even across industries, highly scalable, cost effective products or services have been built using one key offering that drove the cost economics and capacity (or scalabilty).
With that perspective, we wanted to explore the viability of building an enterprise grade storage infrastructure for the cloud. We share some on our learnings on building one.
The key trends that enabled us to build relatively easily are:
- DFS: The maturing of software-only self-contained DFS (Distributed File System) both open source and commercial.
- COTS Storage Platforms: The abundance of attractively-priced and sized COTS (Commodity of the Shelf) storage nodes and high-capacity 3.5″ SAS/SATA disk drives, approaching 1PB per 4U chassis.
- The emergence of open source software stacks (such as OpenStack) with a multitude of standard storage interfaces: Swift (Object), Cinder (Block) and Manila (File).
The meta trend lines show that flash is altering the landscape for storage as it moves into servers and is focused on performance. Disk-based storage, with its ever-decreasing price, combined with the distributed file system-based storage (like HDFS), is driving the cost dimension.
This is not lost on the storage value chain. Component companies (HGST, Seagate) see the opportunity to move up the value chain (blob storage). While the traditional enterprise storage solutions addressed what we categorize as mid-range in capacity (500TB – 10PB), the cloud storage providers (Box, Dropbox) and public cloud providers (Amazon AWS, Google GCE and Microsoft Azure) have built or are using exabyte-scale storage that reshapes the storage value chain for scalable, low cost, highly available storage services. At the other end (<500TB) is the emergence of all-flash arrays (AFAs).
Recognizing these trends, we have designed and built a storage system (we call it SPOD – Storage POD) that has the characteristics of cloud-based storage from a cost perspective, addressing enterprise-class requirements (performance and availability). We leverage the economics of the cloud value chain and the emergence of new distributed file systems (DFS). Initial goals were:
- Backup a 1TB DB (HANA) instance in < 5min?
- Restore the same backup instance (1TB) in < 5min?
- Cost no more than 1c/GB/month (1TB = $10/TB/month).
- Scalable storage to multi-hundred petabytes and geographically dispersed.
- HA: 99.9999% availability.
- Ability to extend this for DR services.
- Run analytics on petabytes of data.
- Secured data (encrypted).
We evaluated several DFSs (Distributed File Systems) and DBDs (Distributed Block Devices). We looked at three major areas for evaluation purposes:
- Features (replication, encryption, snapshotting, etc.)
- Price (CapEx and OpEx)
- Performance (throughput and IOPS)
Application access to storage is via file, block or object. A brief overview of the respective access method is depicted below.
If we look at what is generally used by cloud providers, we see either Block (such as AWSEBS service) or Object (such as AWS S3). Both are relatively simple services which scale out well. Block is used as a more reliable alternative to directly connected disks assigned to machines exclusively in enterprise (SAN), while Object provides a highly-scalable, distributed storage service, with simple get/put semantics. Block and Object are both valid cloud storage paradigms, but they did not fit well with SAP’s current applications and platforms stacks that prefer POSIX file abstraction. Recent surveys indicated that approximately 80% of all storage used today is file-based within SAP.
Ultimately, we decided an HDFS-compliant solution was best, and chose MapR-FS as the NFS and POSIX-compliant file system. While this provided a file abstraction, we needed an S3-compliant object store as well. Given that there were many choices for that in the market, we decided to take that as a second phase design target. Below is a visual of MapR-FS and how it fits within the Apache Hadoop ecosystem.
On the platform hardware, we were inspired by the Backblaze design from a cost (BOM) perspective. There were two key design (HW) goals with this. One was to get the total system cost as close as possible to the cost of drives used in the system, and to adopt the fastest Ethernet technology and match the throughput of the storage system to that. To that end, there are a number of platforms from a variety of partners that provide 48-90 disks per node that are aggressively priced. We decided to standardize on 40G, as the connectivity between nodes. Adopting 40G was appropriate as 40G is crossing over cost/GBit over 10G and overall networking cost was less than 10%. With these two decisions, we looked at available components and what could fit in a standard EIA 19″ rack (42U).
The system design approach was both practical and simple. Once we settled on 40G Ethernet as the backplane inside the rack as well as intra-rack, we identified servers with adequate disks that can sustain throughput that can feed the 2x40G pipes, which is approximately 10GB/s. At 200MB/s per disk (and some going down to 100MB/s), we needed a minimum of 48 drives and perhaps going up to 90 drives to sustain the network throughput.
The result is a rack composed of the following components:
Suggested BOM for what we built
|Rack, PDU and Others
|Total (3PB Raw, 1PB with 3-way replication)
|Depreciated Cost/GB/month (3-way replicated)
In comparison, the consumer storage as a service provide disk storage at 1c/GB/month (Google and Dropbox). Microsoft is cheaper at 0.67c/GB/month, which includes operating cost and profit. The depreciated cost per drive (8TB – $450) is $0.001/GB/month.
We have now built a single rack with 8 nodes making a total of 1.6PB of raw disk storage. With 2-way replication we get to 800TB of storage. Migration to 8TB drives and new pricing for various components, we expect to achieve <0.67c/GB/month. This is better than our initial target.
Performance: We have performed a number of benchmarks. Our target is 6-8GB/s per node and at the rack level >20GB/s. On read/write performance, we achieved >6GB/s with 3 clients reading and writing to the SPOD.
Use Cases: The primary and initial use case is HANA, for backup. While HANA is an in-memory database with its own Tier-1 persistence architecture, the database backup and DR. With this, we believe we can get a 10x reduction in cost and another 10x in performance. That is our initial and primary target. With performance of these levels, this solution is applicable as main store for a class of databases like Sybase IQ, which traditionally has used block devices as the primary store. A new release of Sybase IQ is slated to use a distributed file system that could leverage such a high capacity, performance storage tier. A third use case is Hadoop. There are number of other use cases including document store and for platforms like IoT, where there is a need for a cheap, scalable, reliable, append-only and streaming-oriented data store.
We have achieved >10x in cost reduction, while substantially improving overall storage read/write throughput. This platform could be the basis for all file storage for HANA, IQ and other data processing landscapes. It’s enterprise-ready in the sense that it’s highly available (3-way replication gives us 99.99999% availability) with built-in DR and HA capabilities leveraging MapR’s MFS.
VN:F [1.9.22_1171]Cloud Infrastructure #2: Enterprise grade storage in the cloud (SPOD),