Storage Benchmarking (part 1 of…)

This will be a series of blog posts that will try to help you establish a consistent storage benchmarking methodology.

Storage is an area that I have focused on for much of my career, I’ve been fortunate to be involved in a lot of very challenging and fun storage projects over the years.  I spent a good period of time in the trenches, either for storage manufacturers or a reseller, performing professional services for custom storage implementations.  It is common place for apprehension on any changes in storage systems, if that is vendor, disk types, RAID type and grouping of disks, and more.  Storage is a complicated (and expensive) beast, it is still one of the most expensive (and profitable) components within the entire datacenter.

Virtualization changed the world for IT, it is one of the most disruptive single concepts that truly changed how IT does business.  I know I don’t have to give history lessons on the impact this has had for the hardware vendors, but it is apparent it was mostly negative for server and network vendors and extremely positive for storage vendors. Prior to virtualization being common place storage area networks (SANs), or other shared storage, was not overly common…when it did exist it was for niche applications that represented a very small part of the services an IT department managed.  With virtualization, and really all of the great things that came with VMware’s virtualization (e.g. high availability, Vmotion, etc), shared storage became a fundamental component in all datacenters.  Shared storage, if it be fiber channel, iSCSI, or NFS, went from representing a fraction of the data storage in a company to addressing essentially all data stored.

Storage has become more critical today than it ever was, we have optimized and streamlined all other aspects of the IT platform…however most storage vendors are still building products that are based on ancient principles, changes are complex and high risk.  It is critical that you have a methodology to actually assess any changes to your storage environment to determine if it is a net gain or loss for your business, and storage performance directly impacts the success of the business.  The challenge is that storage benchmarking is really hard, it is a monumental task to take on and actually do it well.  Various tools have been used to try and do point performance tests, however they are not adequate for assessing replacing an entire storage platform or changing the configuration of an existing one.

Now that I have a long winded introduction lets start looking at what it takes to be successful in storage benchmarking.  Like any initiative it needs to be an actual project with a process, while you can run and download Iometer and run a test within minutes…does that test actually mean anything?  Does it correlate to anything other than another similar run of Iometer?  Likely not.

I happen to focus on storage for a growing public cloud provider, I have spent a significant portion of time over the past 3+ years benchmarking storage platforms.  I have tried various tools to try to assess a rating for storage systems that are under evaluation, and what is important to me may not be important to you.  You need to determine what are the critical aspects for your business, here are the primary areas that I score systems on, in no particular order:

  • Reliability
  • Availability
  • Durability & Security
  • Sustainability
  • Scalability
  • Performance

That is a lot of abilities, so I will break down what I mean in each one into what I am referring to.  You will need to determine what are the specific requirements within each of the scoring areas are for your use case, as nothing is valid if it isn’t within the context of your use case.

Reliability

Reliability is a pretty critical for my use case, and this extends to how predictable are all of the other areas that we are evaluating.  When a component fails, do you have a consistent outcome?  Does performance become unpredictable?  Does the “useable” capacity that is reported change after a failure?  There are a lot of gotchas, and the key is to know and be able to predict them as you must factor them into your implementation plan for the solution.  This is an area that isn’t entirely objective as you just can’t test all conditions and context is everything.

Availability

This is a measure of resiliency, fault tolerance, survivability or in other words how consistent can you access my data..even during failure conditions.  This is where storage vendors fit their dog-and-pony show in the datacenter in, they pull this disk and that cable to “prove” that the data is available even after these failures.  I won’t go into the specific tests that I use for this, you may trust your vendor or you may not…how this is assessed must be within the context of each specific storage system.  How you test/prove availability for legacy architecture, scale-out architecture, or hyper converged architecture type storage systems can vary greatly.

Durability & Security

How confident are you that when written is the data that returned and, perhaps even more critical, that the data returned is the actual data that was written.  Security is grouped with durability as there are two aspects of security, confidence that an unauthorized entity cannot modify or otherwise access my data.  This primarily relates to checksums on written data (to set confident point of reference), scrubbing on stored data (to compare to point of reference), and repair of data (create new copy from parity or mirroring).  This is nearly impossible to actually test for, it is something you must query your vendor about.  There are many ways faults can be introduced that can cause loss of data integrity: bit rot, medium read error, controller/cable failures, and more.  Most vendors address this through checksums, in fact it wasn’t long ago that many vendors used this as the primary differentiator between “enterprise” storage systems or not…and it is often assumed to be present in modern enterprise storage systems, but you may be surprised to find that it doesn’t exist in many popular solutions.

Security is often addressed through encryption, some vendors may claim that their self encrypting disks (SED) are all that is necessary, however if the keys are stored on the disk then the encryption does nothing more than provide rapid-erase (if the disk is still operational), such as before you send it for replacement or otherwise decommission it.

In proprietary systems you really have to trust your vendor, and ideally the vendor has 3rd party validation of their offering so that you aren’t just taking the word of an individual that may be more interested in closing the deal than you keeping your job or your company surviving the “what if”, when it does happen.

Sustainability

Another area that may be more subjective than not, as it is difficult to assess this without a great amount of actual experience using the particular system…so if you are looking at a new product you have to be subjective based on your exposure while evaluating, or try to find IDC, or other, rankings comparing the offering.

Ultimately, is it operational within your environmental boundaries, both physical, staffing (expertise), etc.  Does it integrate seamlessly with any existing processes or tools that you depend on (e.g. monitoring and alerting systems)?  Can you or your staff adequately manage the system during a time of crisis (you know, 6am on a holiday when something horrific happens)?  Having logical and intuitive interfaces is a big differentiator here, or do you refer to documentation anytime you try to manage the system?

Scalability

Scalability is absolutely critical for my use case, and it actually is a comprehensive topic that addresses all of the other abilities and performance.  Does reliability of the system decrease, maintain or increase with scale?  Does the risk for data loss increase or decrease with scale?  Can you sustainably operate 100s or 1000s of these systems?  Do you need 100 operators to manage 1000 systems?

Performance

This is the area that I will devote other posts to, as this is where things get more complex.  Vendors typically have comparison between their solution and others that cover the other topics, however reference performance benchmarks are just that, a reference.  Any benchmark is only valid as a comparison to another benchmark executed the same way using the same workload, if that is a synthetic benchmark or not…and if all instances being compared were setup consistently to the respective vendors “best” practices.

To be continued…

Advertisement

vCloud Director – Using Guest Customization Scripts (Linux)

The intent of this article is to cover the steps for leveraging scripting within guest customization. A vCloud user may wish to peruse this as an avenue of automatically installing additional software that is hostname specific, e.g. security management software that integrates a Linux OS to Active Directory.

I am going to assume the reader knows how to login to vCloud Director, either within an organization or within the system context. I also assume that an existing virtual machine exists that we will work with, in my example I will use Linux (CentOS).

  1. Stop the vApp if it is currently running (we cannot edit the properties of a running VM)
  2. Open the vApp so that we can see the individual virtual machines

    wpid-voila_capture569-2012-03-15-18-24.png

  3. Right click the virtual machine (or use the action menu) to access the Properties
  4. Switch to the Guest OS Customization tab
  5. Select the option to “Enable guest customization”

    wpid-voila_capture582-2012-03-15-18-24.png

  6. This enables basic guest customization, such as configuring the guest OS hostname, setting the root password and network configuration.
  7. Scroll down within the guest customization tab
  8. You will see a text box, we can input script content within this text box. Alternatively you can upload the script that will be injected into the guest OS during the customization process. I will first start with a simple script that calls an existing shell script within the guest OS. Please also notice that we have specific sections for “precustomization” and “postcustomization”, pre-customization is before the standard vCloud Director customization process and the other is post this process. If the script that you wish to use is dependent upon the hostname or network connectivity, then you would be best served by using a post-customization script. 

In my example I am calling out to two scripts myscript-pre.sh and myscript-post.sh — these scripts must be in place within the OS file system before it can be ran

    .wpid-voila_capture581-2012-03-15-18-24.png

    NOTE: If you wish to upload a script using the Browse button it must be a text only script, it cannot be an executable binary.

  9. Click OK to save those changes
  10. Power on the virtual machine as usual
  11. Create your script within the guest OS in the path you specified
  12. My test script is quite lame, so don’t laugh. The goals are to answer questions that I’ve seen, such as if the network is available and which user context the script runs under.
    • Pre-customization:

      wpid-voila_capture587-2012-03-15-18-24.png

    • Post-customization:

      wpid-voila_capture588-2012-03-15-18-24.png

  13. Shutdown your virtual machine

    wpid-voila_capture577-2012-03-15-18-24.png

  14. Right click and select to Power On and Force Recustomization
  15. After customization completes, login and verify that your script ran.
    • Pre-customization:
      wpid-voila_capture589-2012-03-15-18-24.png
    • Post-customization:

      wpid-voila_capture585-2012-03-15-18-24.png

Observations:

There seems to be little documentation from VMware on “when” exactly a pre-customization script is ran vs a post-customization script. The time is only 23 seconds apart, so what exactly occurs during those 23 seconds? Logging services (syslogd) and most other system services do not start until after the pre-customization script has ran, so little output exists for what occurs during that window (or prior). It appears that pre-customization occurs at the time that vmware-tools start, on my system that is S03…which is the 2nd service to start (after microcode_ctl). You can also compare your time stamps to /var/log/messages in order to see what events are occurring.

In looking at the /var/log/vmware-imc/customization.log we can see a bit more detail as to timing.

wpid-voila_capture586-2012-03-15-18-24.png

Pre-customization occurs before the default vCloud Director customization scripts set execute, which set hostname and network configs (and generate SID or join an AD domain on Windows).

Post-customization is likely the area that most scripts will need to be executed, after the network configuration is set. In testing I encountered a situation that a script that was dependent on additional network services (e.g. to support NFS) would fail if executed directly as a post-customization script, a work around that resolved this was just adding a “sleep 30” prior to the script execution.

An area of challenge is troubleshooting these scripts as there is no way to run customization in an interactive form. The easiest way to confirm things are going to work is making sure the script can run as root if you execute it directly from a login shell. Next you can insert it into the post-customization process and assume that it will work. VMware has published a couple of KB articles that discuss which log files are relevant to the process, you can review those logs for any errors. Ideally your script itself will have error logging capability.

If you wish for advanced customization capabilities, then your best bet is probably to not use the vCloud Director customization at all…or at least only use it to configure the networking. vCenter Orchestrator is far more feature rich and extensible, the limitations on what can be done in vCenter are most likely only constrained by the amount of effort you put into developing your workflows. The customization process used within vCloud Director is more similar to that of Lab Manager than of vCenter, so if you run into trouble you may try searching under Lab Manager discussion groups.

References: