Note: I had written this post 2 years ago but somehow never noticed it in my drafts folder…
What was old is new. Long ago we used internal RAID on servers for most applications, in some cases we would go as far as using internal HBAs with external JBODs to allow 2 physical servers to share some logical volumes, or to get the most out of a “high capacity” (at the time they seemed high, but by today’s standards many phones offer more addressable capacity) RAID enclosures. Overtime we moved all of this critical data to a shared storage system, perhaps a SAN (storage area network). The SAN vendors have continued to charge high prices for decreasing value, it left the storage market ripe for disruption with distributed storage that leverages commodity hardware, delivered as software. No longer will we find it acceptable to pay $2500 for a disk drive in a SAN that we can buy on the street for $250.
This leads me to repeating the past, I find myself in desperate need of brushing up on managing the RAID controllers that are in my hosts. Perhaps this is for VSAN, or ScaleIO, or some other converged storage offering that can leverage my existing compute nodes and all of what was formerly idle storage potential. As we make this transition we find that all of our selection criteria we had for our compute hosts are no longer valid, or at least not ideal for this converged deployment. Up until now the focus has been on compute density, either CPU cores per rack unit or physical RAM per rack unit…in fact many blade vendors found a nice market by maximizing focus on just that.
What these silo compute servers all had in common was minimal internal storage, we didn’t need it. We needed massive density compute to make room for our really expensive SAN with all of its pretty lights. As we move down this path of converged compute and storage, we need to dig out some of our selection criteria from a decade ago. We now need to weigh disk slots per rack unit into our figures. It turns out we can decrease our CPU+RAM density by large sums, but through implementing converged storage offerings we can drastically reduce our cost to provide the entire package of compute and storage. We must look at the balance of compute to storage more closely as these resources are becoming tightly coupled, there are new considerations that we are not accustomed to that if not accounted for can lead to project failure.
When the hypervisor first started gaining ground there was a lot of debate over the consolidation ratio that made sense. Some vendors/integrators argued that Big Iron made the most sense, a server that has massive CPU and RAM density and allowed for ridiculous VM:host ratios. What we found is that this becomes a pretty massive failure domain, the larger the failure domain the larger the capacity we have to reserve. Our cost of the HA (high availability) insurance is directly equal to our host density. Likewise when we use maintenance mode, the time to enter maintenance mode for each host directly correlates to the utilized RAM density on a host. The more RAM that is used on a host the longer it will take for every maintenance cycle for that host.
This is relevant as when we look at converged storage (or hyper converged as some may refer to it) we have to consider the same exact thing. We now have the traditional compute items to account for, but we also need to factor in storage. Our host is now a failure domain for storage, so we must reserve 1 host (or more) of capacity…this also means that when hosts go into maintenance mode, worst case we have to move an entire host of stored data to insure accessibility.