Skip to main content

Troubleshooting Storage Performance in vSphere


When we troubleshoot performance related issues, the first think which would hit our mind it "Storage". So let's have a sneak peak about the basic troubleshooting of the storage related issues. 

Poor storage performance is generally the result of high I/O latency. vCenter or esxtop will report the various latencies at each level in the storage stack from the VM down to the storage hardware.  vCenter cannot provide information for the actual latency seen by the application since that includes the latency at the Guest OS and the application itself, and these items are not visible to vCenter. vCenter can report on the following storage stack I/O latencies in vSphere.

 Storage Stack Components in a vSphere environment
LatencyInStorageStack
GAVG (Guest Average Latency) total latency as seen from vSphere
KAVG (Kernel Average Latency) time an I/O request spent waiting inside the vSphere storage stack. 
QAVG (Queue Average latency) time spent waiting in a queue inside the vSphere Storage Stack.
DAVG (Device Average Latency) latency coming from the physical hardware, HBA and Storage device.

To provide some rough guidance, for most application workloads (typically 8k I/O size, 80% Random, 80% Read) we generally say anything greater than 20 to 30 ms of I/O Latency may be a performance concern. Of course as with all things performance related some applications are more sensitive to I/O latency then others so the 20-30ms guidance is a rough guidance rather than a hard rule. So we expect that GAVG or total latency as seen from vCenter should be less than 20 to 30 ms.  as seen in the picture, GAVG is made up of KAVG and DAVG.  Ideally we would like all our I/O to quickly get out on to the wire and thus spend no significant amount of time just sitting in the vSphere storage stack,  so we would ideally like to see KAVG very low.  As a rough guideline KAVG should usual be 0 ms and anything greater than 2ms may be an indicator of a performance issue. 
So what are the rule of thumb indicators of bad storage performance? 
•             High Device Latency: Device Average Latency (DAVG) consistently greater than 20 to 30 ms may cause a performance problem for your typical application. 
•             High Kernel Latency: Kernel Average Latency (KAVG) should usually be 0 in an ideal environment, but anything greater than 2 ms may be a performance problem.

Poor storage performance is generally the result of high I/O latency, but what can cause high storage performance and how to address it?   There are a lot of things that can cause poor storage performance
– Under sized storage arrays/devices unable to provide the needed performance
– I/O Stack Queue congestion
– I/O Bandwidth saturation, Link/Pipe Saturation
– Host CPU Saturation
– Guest Level Driver and Queuing Interactions
– Incorrectly Tuned Applications
– Under sized storage arrays 

As I mentioned above the key storage performance indicators to look out for are 1. High Device Latency  (DAVG consistently greater than 20 to 30 ms) and 
2. High Kernel Latency( KAVG greater than 2 ms). Once you have identified that you have High Latency you can now proceed to trying to understand why the latency is high and what is causing the poor storage performance. In this post, we will look at the top reason for high Device latency.
The Top reason for high device latency is simply not having enough storage hardware to meet your application’s needs (Yes, I have said it a third time now), that is a sure fire way to have storage performance issues.  It may seem basic, but too often administrators only size their storage on the capacity size they need to support their environment but not on the Performance IOPS/Latency/Throughput that they need.   When sizing your environment you really should consult your Application and Storage Vendor’s best practices and sizing guidelines to understand what storage performance your application will need any what your storage hardware can deliver.
How you configure your storage hardware, the type of drives you use, the raid configuration, the number of disk spindles in the array, etc… will all affect the maximum storage performance your hardware will be able to deliver.  Your storage vendor will be able to provide you the most accurate model and advice for the particular storage product you own, but if you need some rough guidance you can use the guidance provided in the chart below.
  Untitled-1 copy

The slide shows the general IOPs and Read & Write throughput you can expect per spindle depending on the RAID configuration and/or drive type you have in your array.    Also frequently I’m asked what is the typical I/O profile for a VM, the guidance varies greatly depending on the applications running in your environment, but a “typical” I/O workload for a VM would roughly be 8KB I/O size, 80% Random, 80% Read.  Storage intensive applications like Databases, Mail Servers,  Media Streaming, … have their own I/O profiles that may differ greatly from this “typical” profile.
One good way to make sure your storage is able to handle the demands of your datacenter, is to benchmark your storage.  There are several free and Open Source tools like IOmeter that can be used to stress test and benchmark your storage.  If you haven’t already taken a look at the I/O Analyzer tool delivered as a VMware Fling,  you might want to take a peek at it.  I/O Analyzer is a virtual appliance tool that provides a simple and standardized approach to storage performance analysis in VMware vSphere virtualized environments ( http://labs.vmware.com/flings/io-analyzer ).
Also when sizing your storage make sure your storage workloads are balanced “appropriately” across the paths in the environment, across the controllers and storage processors in the array and balanced and spread across the appropriate number of spindles in the array.  I’ll talk a bit more about “appropriately” balanced later on in this series as it varies depending on your storage array and your particular goals/needs.     
Simply sizing your storage correctly for the expected workload, in terms of size and performance capabilities, will go very far to making sure you don’t run into storage performance problems and making sure your Device Latency (DAVG) is less than that 20-30ms guidance.  

Source & Courtesy: http://blogs.vmware.com/vsphere/2012/06/troubleshooting-storage-performance-in-vsphere-part-3-ssd-performance.html


Popular posts from this blog

AD LDS – Syncronizing AD LDS with Active Directory

First, we will install the AD LDS Instance: 1. Create and AD LDS instance by clicking Start -> Administrative Tools -> Active Directory Lightweight Directory Services Setup Wizard. The Setup Wizard appears. 2. Click Next . The Setup Options dialog box appears. For the sake of this guide, a unique instance will be the primary focus. I will have a separate post regarding AD LDS replication at some point in the near future. 3. Select A unique instance . 4. Click Next and the Instance Name dialog box appears. The instance name will help you identify and differentiate it from other instances that you may have installed on the same end point. The instance name will be listed in the data directory for the instance as well as in the Add or Remove Programs snap-in. 5. Enter a unique instance name, for example IDG. 6. Click Next to display the Ports configuration dialog box. 7. Leave ports at their default values unless you have conflicts with the default values. 8. Click N...

HOW TO EDIT THE BCD REGISTRY FILE

The BCD registry file controls which operating system installation starts and how long the boot manager waits before starting Windows. Basically, it’s like the Boot.ini file in earlier versions of Windows. If you need to edit it, the easiest way is to use the Startup And Recovery tool from within Vista. Just follow these steps: 1. Click Start. Right-click Computer, and then click Properties. 2. Click Advanced System Settings. 3. On the Advanced tab, under Startup and Recovery, click Settings. 4. Click the Default Operating System list, and edit other startup settings. Then, click OK. Same as Windows XP, right? But you’re probably not here because you couldn’t find that dialog box. You’re probably here because Windows Vista won’t start. In that case, you shouldn’t even worry about editing the BCD. Just run Startup Repair, and let the tool do what it’s supposed to. If you’re an advanced user, like an IT guy, you might want to edit the BCD file yourself. You can do this...

DNS Scavenging.

                        DNS Scavenging is a great answer to a problem that has been nagging everyone since RFC 2136 came out way back in 1997.  Despite many clever methods of ensuring that clients and DHCP servers that perform dynamic updates clean up after themselves sometimes DNS can get messy.  Remember that old test server that you built two years ago that caught fire before it could be used?  Probably not.  DNS still remembers it though.  There are two big issues with DNS scavenging that seem to come up a lot: "I'm hitting this 'scavenge now' button like a snare drum and nothing is happening.  Why?" or "I woke up this morning, my DNS zones are nearly empty and Active Directory is sitting in a corner rocking back and forth crying.  What happened?" This post should help us figure out when the first issue will happen and completely av...