Fortune 1000 companies can use this document to generate a comprehensive Request For Information and a focused and efficient Proof of Concept to test for current and future storage needs.
Transition was everywhere in 2019. Enterprises that had rushed to get to the public cloud started bringing applications back. They also began to deploy Tier 1 workloads with software-defined technologies instead of storage arrays. Change was the only constant as it related to the storage deployments of the Fortune 1000.
While change was in the data center air, the key requirements for the next-generation of storage infrastructure became very clear. Datera is comprised of the world’s leading SDS architects and former end users, and we have worked with scores of Fortune 1000 companies to understand their data storage needs.
While we recognize that every organization’s applications and needs are different, from this unique vantage point we developed a set of common requirements and best practices to help you and your organization get a fast start on the path to a new and better data infrastructure.
Storage Categories and Goals the Fortune 1000 are Evaluating
When getting started we recommend first investigating the four main categories of storage technology that drive different sets of requirements.
- Enterprise-Class Flash Arrays. Arrays from the leading vendors are strewn all over the floor, whether deployed on a standalone basis or more recently as converged appliances. Enterprises want to retain the pros of arrays—the performance levels, the 9s of availability, while distancing themselves from the cons—the high cost, the inflexibility, the lock-in, the homogeneity of media choices, and even the need for Fibre Channel to achieve performance and stability.
- Public Cloud Services. The impact of the public cloud cannot be overstated. AWS, Google Cloud and Microsoft Azure—none of which use arrays to build their hyperscale data centers—showed the market that infrastructure could be done in a new, more agile, and more cost-effective way. Similarly, enterprises looked to see if they too could build their infrastructure in this manner to achieve the same level of operational agility and velocity and to do it on Ethernet rather than Fibre like the cloud players do, but they also wanted to avoid the massive cost of inflation they have experienced in their monthly bills, often 5X higher than on-premises infrastructure.
- Hyperconverged Infrastructure (HCI). HCI growth remains strong, particularly in emerging regions. It provides an easy on-ramp to a shared infrastructure and software-defined approach, but shows inherent system limitations in scale, performance, and hardware utilization. Enterprises would like to retain the simple deployment and procurement models that HCI software vendors provide, but to do so without the common problems that have plagued the leading HCI platforms such as “noisy neighbor” syndrome where certain applications or tenants overtax the infrastructure and compromise other applications, and the inability to scale beyond monolithic orchestration within a single cluster.
- Software-Defined Storage (SDS). SDS is seen as combining the best attributes of the other storage choices—the dedicated performance of arrays, the agility of the public cloud, and the potential to consolidate applications and tenants of HCI—with the additional benefits of automation, while lessening the vendor lock-in that has pervaded the industry since its inception. While its benefits have long been evident, it’s important to test multiple vendors against one another to understand differences in performance and availability with data management services turned on (e.g. encryption, compression, deduplication). Equally important is a test of the reliability of automation to drive quality of service and to quantify its value in understanding admin resourcing.
The Fortune 1000 test new storage approaches to maintain and expand the benefits they’ve seen in the past while finding new ways to eliminate old headaches and reduce the cost profile.
Fortune 1000 Requirements for High-Performance Block Workloads at Hyperscale
In this section, we include a list of core requirements the Fortune 1000 should test against to understand which storage category can deliver. You may further refine based on your particular use cases.
- LATENCY: The system must provide 1M or more IOPS with under 1-millisecond latencies. Storage needs can change at a moment’s notice, so it is essential that a system can expand rapidly to achieve performance and capacity requirements. SQL and NoSQL databases require high IOPS and low latency storage systems that can scale performance and capacity with ease. Testing 1 Million IOPS under 1 millisecond is a common threshold, so we’d suggest you start there and add more if your specific workloads require it. Also, test the ability to expand this with the fully supported addition of asymmetric media nodes, including NVMe and Storage Class Memory (SCM) such as Intel Optane.
- THROUGHPUT: The system must support a minimum of 64 GB/s of overall throughput. Throughput has become more important for most organizations than raw storage performance since throughput is the ultimate measure of application (rather than storage) performance and highly valuable in multi-tenant environments. The combination of database and other workloads may push the network’s overall performance as well, which can require the network and storage teams to agree on the testing. This has proven to be valuable in enabling a move to 100GbE and 200GbE networks (similar to the public cloud providers) and can yield massive savings in administration time and costs when compared with complex Fibre Channel networks.
- ASYMMETRIC SCALING: The system must be able to scale granularly (node by node) to a hyper-scale threshold of multiple petabytes, yielding additional granular capacity, performance, durability, and resilience with each additional node. The system must be able to scale asymmetrically and rapidly—typically from a few hundred terabytes to multiple petabytes—and to do so non-disruptively without downtime. The test should include adding different kinds of nodes along the way to demonstrate that the environment not only incorporates the new capacity and horsepower, but rebalances the system without the need for manual tuning. Scaling the environment should not drive significant new admin time because the savings achieved on the capital side could be offset by extra costs in personnel. Pay close attention here, since many enterprises see massive differences in scaling between systems. At a minimum, test the ability to scale up within a rack and scale-out across racks and across aisles within a single data center, since this is what scale-out architectures must achieve to provide the flexibility enterprises seek.
- PERFORMANCE WITH DATA MANAGEMENT SERVICES ENABLED. System should show minimal performance degradation even when more than 60% utilized. Vendors have a habit of painting a very rosy picture of theoretical performance, which is often measured without using features that utilize CPU cycles in storage hardware. Enterprises often see a massive dropoff in system-wide performance in the systems they test when even basic data management services were utilized—including compression, encryption, snapshots and deduplication—that render those systems a non-starter. Be sure to test the systems under loads when application traffic is high to understand how the system would respond. The tests should incorporate both elements—data management off and on, traffic high and low—to give the best picture of real-world performance. Architects testing the system should also record a time sequence of the monitoring tools to show the ebbs and flows of the system over time and how it responds. To not do so would invite trouble in an actual deployment.
- CONTINUOUS DATA AVAILABILITY. The system must be architected to be available and survive multi-node, multi-rack failures within the data center. More than just data durability or uptime, a system must offer non-disruptive software updates, survive multiple component failures, power outages, rack failures, and unexpected data center events. A real test of availability is possible using a combination of snapshots (replicated locally and remotely to a public cloud), stretch clusters, failure domains, and replica count. All vendors frequently speak about “9s” of availability, but planned downtime is frequently not used in those calculations. The test should incorporate the ability to maintain complete availability while simultaneously making changes to QoS policies, as well as adding new nodes.
- CLOUD OPERATIONS. The system must support application and tenant aggregation and consolidation with simple provisioning and self-service utilization for application owners. The term “cloud” entails a variety of different needs for the Fortune 1000, and with much less consistency than service providers or software-as-a-service companies. But the common thread is a need to support multiple orchestrators including VMware, Kubernetes, Openstack, and bare metal in order to support a variety of applications and the velocity of stateful and stateless events. It is essential to test these not merely in isolation with a separate cluster for each, but in a common cluster for all. Otherwise, you may run the risk of bringing on a new system that becomes an island on its own with stranded data and hardware accompanied by administrative overhead. Further, we highly recommend that the test include the use of policy-based administration that can allow administrators to set up and administer groups of applications as a class rather than on an individual basis. Testing the ability to support multiple application orchestration is simply a baseline requirement.
- AUTONOMOUS DATA PLACEMENT. The system must autonomously assign and re-assign workloads to the proper node according to preset requirements. Whether based on application traffic (to move the data as close to the application as possible) and on the storage media resident on the node (for instance, putting the right data on an NVMe drive), the system should automatically self-optimize system-wide performance and availability. Initial testing should include evaluating systems for their ability to place that data based on policy, and advanced tests should examine the quality of service delivered by workload to understand whether the system is delivering proper placement and whether the policy is aligned properly to the SLA required.
- NEW TECHNOLOGY INCORPORATION. New technologies, at both the server (CPU) and media level, must be able to be rapidly deployed and utilized by the system without adding administration time to put them to use. To test this capability, enterprises start with a variety of server types and media types and then add new and different nodes during the life of the test. Similar to the testing of autonomous data placement, as new nodes are incorporated administrators should determine if data is indeed moved automatically to the new node and specifically what data is moved to utilize the new CPU and media available. Growing an environment can be easy, but if the system does not automatically take advantage of the new capacity and horsepower, that growth generates needless expense.
- ETHERNET ENABLED BGP PEERING: The system should have the ability to use standard iSCSI deployed over L3 networking at the core for data operations. The test should include a demonstration of BGP integration into the routing fabric, which can drive a new layer of agility in placement of the data across data center and significantly greater agility than Fibre Channel or standard L2 networking.
- SELF-HEALING. The system should have predictive analytic capabilities that incorporate system-wide information, often called telemetry, into a feedback loop to continuously improve against desired attributes. Testing the system-wide monitoring capability should include an understanding of latency, performance, and availability information from each node, as well as the system’s ability to provide notification to the test administrator of any issue at both the network and storage layers. Advanced systems use telemetry to help all users with practical capacity/performance planning and best practices in real-time. Testing for this capability ensures that you select a system that has the potential to learn from itself and improve your environment over its lifecycle.
- LOCK-IN. The system should support a wide variety of hardware profiles—different server vendors, different server models, different server generations, and a variety of different media—to eliminate the potential for vendor lock-in. The infrastructure industry is notorious for vendors locking customers in to an artificially limited set of choices designed to enrich their top-lines. Enterprises that experienced this phenomenon with their array purchases and even public cloud contracts are looking for open systems that generate hardware options, not lock-in. Test environments should therefore seek to incorporate a variety of different hardware options from the start. Advanced testing should seek to incorporate multiple variables into a single cluster, including different vendors, node profiles, media types and server generations. Ensuring this variety is key to the long-term value of the system as well as to getting the best terms on hardware purchases at every expansion opportunity.
Fortune 1000 customers have every vendor in the IT industry at their beck and call. Selecting the right technologies to test and using the right test parameters—outlined above—will enable them to make the transition to a more automated, scalable and performant future for their data operations.
To learn more about the Datera platform and why the Fortune 1000 are using our software-defined storage solution to architect a new data future, please examine the following core whitepaper library: