Wednesday, November 30, 2016

 

Storage Benchmarking with diskspd plus a LINQPad Script for Generating diskspd Batch Scripts

It is always a good idea to measure a performance baseline when commissioning (or choosing) new storage hardware or a new server, particularly for SQL Server. It is not uncommon for SAN’s to be non-optimally configured, so knowing how close the storage’s performance comes to the vendor’s advertised numbers is important. You should also benchmark when you make any hardware/configuration changes to storage.

In the past, SQLIO was one of the commonly used tools to perform I/O testing, but SQLIO has now been superceded. diskspd.exe is Microsoft’s replacement for SQLIO, with a more comprehensive set of testing features and expanded output. Like SQLIO, Diskspd is also a command line tool which means it can easily be scripted to perform reads and writes of various I/O block sizes including random and sequential access patterns to simulate different types of workloads.

Where can I download diskspd?

diskspd is stand-alone executable with no dependencies required to run it. You can download diskspd from Microsoft TechNet – Diskspd, a Robust Storage Testing Tool.
Download the executable and unzip it into an appropriate folder. Once unzipped you will see 3 subfolders with different executable targets: amd64fre (for 64-bit systems: the most common server target), x86fre (for 32-bit systems) and armfre (for ARM systems). The source code is hosted on Github here.

Analyzing I/O Performance: What Metrics should I measure?

The three main characteristics that are used to describe storage performance are (from Glenn Berry’s post: Analyzing I/O Performance for SQL Server):

Latency

Latency is the duration between issuing a request and receiving the response. The measurement begins when the operating system sends a request to the storage and ends when the storage completes the request. Reads are complete when the operating system receives the data; writes are complete when the drive signals the operating system that it has received the data.
For writes, the data may still be in a cache on the drive or disk controller, depending on your caching policy and hardware. Write-back caching is much faster than write-through caching, but it requires a battery backup for the disk controller. For SQL Server usage, you want to make sure you are using write-back caching rather than write-through caching if at all possible. You also want to make sure your hardware disk cache is actually enabled: some vendor disk management tools disable it by default.

IOPS (Input/Output Operations per Second )

The second metric is Input/Output Operations per Second (IOPS). A constant latency of 1ms means that a drive can process 1,000 IOs per second with a queue depth of 1. As more IOs are added to the queue, latency will increase. One of the key advantages of flash storage is that it can read/write to multiple NAND channels in parallel, along with the fact that there are no electro-mechanical moving parts to slow disk access down. IOPS actually equals queue depth divided by the latency, and IOPS by itself does not consider the transfer size for an individual disk transfer. You can translate IOPS to MB/sec and MB/sec to latency as long as you know the queue depth and transfer size.
The majority of storage vendors report their IOPS performance using a 4k block size, which is largely irrelevant for SQL Server workloads, since the majority of the time SQL Server reads data in 64k chunks. IOPS Are A Scam. To convert 4k block IOPS into 64k block IOPS simply divide by 16 or to convert IOPS into MB/s measurements multiply IOPS * block transfer size.

Throughput

Sequential throughput is the rate that you can transfer data, typically measured in megabytes per second (MB/sec) or gigabytes per second (GB/sec). Your sequential throughput metric in MB/sec equals the IOPS times the transfer size. For example, 556 MB/sec equals 135,759 IOPS times a 4096 bytes transfer size, while 135,759 IOPS times a 8192 bytes transfer size would be 1112 MB/sec of sequential throughput. Despite its everyday importance to SQL Server, sequential disk throughput often gets short-changed in enterprise storage, both by storage vendors and by storage administrators. It is also actually fairly common to see the actual magnetic disks in a direct attached storage (DAS) enclosure or a storage area network (SAN) device be so busy that they cannot deliver their full rated sequential throughput.
Sequential throughput is critical for many common database server activities, including full database backups and restores, index creation and rebuilds, and large data warehouse-type sequential read scans (when your data does not fit into the SQL Server buffer pool).

How do I use diskspd?

WARNING: Ideally, you should perform DskSpd testing when there is no other activity on the server and storage. You could be generating a large amount of disk IO, network traffic and/or CPU load when you run DiskSpd. If you’re in a shared environment, you might want to talk to your administrator(s) before running such a test. This could negatively impact anyone else using other VMs in the same host, other LUNs on the same SAN or other traffic on the same network.

Ensure the user used to run diskspd has been granted the ‘Perform volume maintenance tasks’ right: run secpol.msc -&>; Local Policies -> User Rights Assignment -> ‘Perform volume maintenance tasks’

NOTE: You should run diskspd from an elevated command prompt (by choosing “Run as Administrator”). This will ensure file creation is fast. Otherwise, diskspd will fall back to a slower method of creating files.

diskpsd parameters

You can get a complete list of all the supported command line parameters and usage by entering the following at a command prompt:
> diskspd.exe
The most common parameters are:
Parameter Description
-d Test duration in seconds. Aim for at least 60 seconds
-W Test warm up time in seconds
-b I/O Block size (K/M/G). e.g. –b8K means an 8KB block size, -b64K means a 64KB block size: both are relevant for SQL Server
-o Number of outstanding I/Os (queue depth) per target, per worker thread
-t Worker threads per test file
-Su Disable software caching
-Sw Enable writethrough (no hardware write caching). Normally used together (-Suw) to replace deprecated -h (or equivently use -Sh)
-L Capture latency info
-r Random data access tests
-si Thread coordinated Sequential data access tests
-w Write percentage. For example, –w10 means 10% writes, 90% reads
-Z<size>[K|M|G|b] Workload test write source buffer size. Used to supply random data (entropy) for writes, which is a good idea for SQL Server testing and for testing de-duping behaviour on flash arrays
-c<size>[K|M|G|b] Create workload file(s) of specified size
e.g:
diskspd.exe -Suw -L -W5 –Z1G -d60 –c440G -t8 -o4 -b8K -r -w20 E:\iotest.dat > output.txt

This will run a 60 second random I/O test using a 440GB test file located on the E: drive, with a 20% write and 80% read ratio, using an 8K block size and a 5 second warm up. It will use eight worker threads, each with four outstanding I/Os and a write entropy buffer of 1GB, and save the results to a text file named output.txt. This set of parameters is representative of a SQL Server OLTP workload.
Note: The test file size (you can have multiple test files) should be larger than the SAN’s DRAM cache (and ideally not an exact multiple of it).

LINQPad Script

To automate the creation of a bunch of testing scenarios, rather than manually editing (which is tedious and error prone), I’ve written a simple C# LINQPad script:

  
  const string batchScriptFilename = @"c:\temp\diskspd.bat";

  // Flags used in each run and do not vary
  string disableHardwarecaching = "-Suw";       // -Suw: Disable both hardware and software buffering. SQL Server does this.
                                                // Su = disable software caching, Sw = enable writethrough (no hardware write caching)
  string captureLatency    = "-L";              // capture disk latency numbers
  string warmWorkLoad      = "-W5";             // Warm up time in seconds
  string entropyRandomData = "-Z1G";            // Used to supply random data (K/M/G) for writes, which is good for SQL Server testing.
  string testduration      = "-d120";           // Test duration in seconds NB: At least 60 seconds, 2-3 minutes is good 
  string testFileSize      = "-c440G";          // Nothing smaller than the SAN's cache size (and not an exact multiple of it)
  string testFileFullPath  = @"E:\iotest.dat";  // Test file name (goes at the end of the command)
  string resultsFilename   = @"output.txt";     // File to output the text results

  // prefix results file name with date
  resultsFilename = DateTime.Now.Date.ToString("yyyyMMdd") + "_" + resultsFilename;
  
  // Lists of varying params to use
  var randomOrSequential = new List<string> { "-r", "-si" };                 // -r = Random, -si = Sequential
  var writepercentage = new List<string> { "-w0", "-w10", "-w25", "-w100" }; // -w0 means no writes: -w10 = 90%/10% reads/writes           
  var blocksize = new List<string> { "-b8K", "-b64K", "-b512K", "-b2M" };    // 2M represents SQL Server read ahead, 512K backups
  var overlappedIOs = new List<string> { "-o2", "-o4", "-o8", "-o16"};       // This is queue depth
  var workerthreads = new List<string> { "-t4", "-t8", "-t16", "-t32" };     // Worker threads

  int runTimeSeconds = randomOrSequential.Count() * writepercentage.Count() * blocksize.Count() * 
                       overlappedIOs.Count() * workerthreads.Count() * 
                       (Int32.Parse(testduration.Substring(2)) + Int32.Parse(warmWorkLoad.Substring(2)));

  using (StreamWriter fs = new StreamWriter(batchScriptFilename))
  {
      fs.WriteLine("REM Expected run time: {0} Minutes == {1:0.0} Hours", runTimeSeconds / 60, runTimeSeconds / 3600.0);

      string cmd = string.Format("diskspd.exe {0} {1} {2} {3} {4} {5} ",
                                 disableHardwarecaching, captureLatency, warmWorkLoad,
                                 entropyRandomData, testduration, testFileSize);
      // Yes, LINQ could be used!
      for (int i1 = 0; i1 < writepercentage.Count(); i1++)
      {
          for (int i2 = 0; i2 < randomOrSequential.Count(); i2++)
          {
              for (int i3 = 0; i3 < blocksize.Count(); i3++)
              {
                  for (int i4 = 0; i4 < overlappedIOs.Count(); i4++)
                  {
                      for (int i5 = 0; i5 < workerthreads.Count(); i5++)
                      {
                          fs.WriteLine(string.Format("{0} {1} {2} {3} {4} {5} {6} >> {7}",
                                              cmd,
                                              workerthreads[i5],
                                              overlappedIOs[i4],
                                              blocksize[i3],
                                              randomOrSequential[i2],
                                              writepercentage[i1],
                                              testFileFullPath,
                                              resultsFilename
                                            ));
                      }
                  }
                  fs.WriteLine("");
              }
              fs.WriteLine("");
          }
          fs.WriteLine("");
      }
  }

 

Short Test Batch Script

A batch script to perform an initial (relatively) quick test would look something like the following:
REM Expected run time: 98 Minutes == 1.6 Hours
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o2 -b8K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o2 -b8K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o2 -b8K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o4 -b8K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o4 -b8K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o4 -b8K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o8 -b8K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o8 -b8K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o8 -b8K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o16 -b8K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o16 -b8K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o16 -b8K -r -w0 E:\iotest.dat >> output.txt

diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o2 -b64K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o2 -b64K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o2 -b64K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o4 -b64K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o4 -b64K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o4 -b64K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o8 -b64K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o8 -b64K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o8 -b64K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o16 -b64K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o16 -b64K -r -w0 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o16 -b64K -r -w0 E:\iotest.dat >> output.txt

diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o2 -b8K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o2 -b8K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o2 -b8K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o4 -b8K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o4 -b8K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o4 -b8K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o8 -b8K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o8 -b8K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o8 -b8K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o16 -b8K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o16 -b8K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o16 -b8K -r -w20 E:\iotest.dat >> output.txt

diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o2 -b64K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o2 -b64K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o2 -b64K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o4 -b64K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o4 -b64K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o4 -b64K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o8 -b64K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o8 -b64K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o8 -b64K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t4 -o16 -b64K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t8 -o16 -b64K -r -w20 E:\iotest.dat >> output.txt
diskspd.exe -Suw -L -W3 -Z1G -d120 -c440G  -t16 -o16 -b64K -r -w20 E:\iotest.dat >> output.txt

 

Interpreting the diskspd results

diskspd produces quite a bit of output per run. The first section is a recap of the parameters that were used in the command line:
Command Line: diskspd.exe -Suw -L -W5 -Z1G -d120 -c440G -t16 -o4 -b64K -r -w10 E:\iotest.dat

Input parameters:

    timespan:   1
    -------------
    duration: 120s
    warm up time: 5s
    cool down time: 0s
    measuring latency
    random seed: 0
    path: 'E:\iotest.dat'
        think time: 0ms
        burst size: 0
        software cache disabled
        hardware write cache disabled, writethrough on
        write buffer size: 1073741824
        performing mix test (read/write ratio: 90/10)
        block size: 65536
        using random I/O (alignment: 65536)
        number of outstanding I/O operations: 4
        thread stride size: 0
        threads per file: 16
        using I/O Completion Ports
        IO priority: normal

This is a great improvement over sqlio which did not echo the run parameters or provide a readable summary of the parameters, making it hard to decipher runs at a later date.
Next is a summary of CPU information. This information can help determine if your storage test is CPU bottlenecked:
actual test time:   120.00s
thread count:       16
proc count:     32

CPU |  Usage |  User  |  Kernel |  Idle
-------------------------------------------
   0|  10.21%|   1.09%|    9.11%|  89.79%
   1|  10.31%|   1.09%|    9.22%|  89.69%
   2|  10.14%|   1.08%|    9.06%|  89.86%
   3|  18.26%|   0.94%|   17.32%|  81.74%
   4|   7.86%|   1.12%|    6.74%|  92.14%
   5|   7.79%|   0.91%|    6.87%|  92.21%
   6|   7.55%|   1.15%|    6.41%|  92.45%
   7|   7.71%|   1.13%|    6.58%|  92.29%
   8|   0.00%|   0.00%|    0.00%|   0.00%
 ...
-------------------------------------------
avg.|   2.49%|   0.27%|    2.23%|  22.51%

The results for each thread should be very similar in most cases.
After the CPU summary is the I/O summary, split into total (read + write), followed by separate read and write statistics:
Total IO
thread |       bytes     |     I/Os     |     MB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
-----------------------------------------------------------------------------------------------------
     0 |     10107486208 |       154228 |      80.33 |    1285.23 |    3.109 |     3.640 | E:\iotest.dat (440GB)
     1 |     10038870016 |       153181 |      79.78 |    1276.50 |    3.130 |     4.082 | E:\iotest.dat (440GB)
     2 |     10062594048 |       153543 |      79.97 |    1279.52 |    3.123 |     4.048 | E:\iotest.dat (440GB)
     3 |     10012590080 |       152780 |      79.57 |    1273.16 |    3.138 |     3.954 | E:\iotest.dat (440GB)
     4 |     10169417728 |       155173 |      80.82 |    1293.10 |    3.090 |     3.909 | E:\iotest.dat (440GB)
     5 |     10148446208 |       154853 |      80.65 |    1290.44 |    3.096 |     4.159 | E:\iotest.dat (440GB)
     6 |     10158669824 |       155009 |      80.73 |    1291.74 |    3.093 |     4.024 | E:\iotest.dat (440GB)
     7 |     10205724672 |       155727 |      81.11 |    1297.72 |    3.079 |     3.901 | E:\iotest.dat (440GB)
     8 |     10096607232 |       154062 |      80.24 |    1283.85 |    3.112 |     3.896 | E:\iotest.dat (440GB)
     9 |     10057023488 |       153458 |      79.93 |    1278.81 |    3.124 |     4.187 | E:\iotest.dat (440GB)
    10 |     10092347392 |       153997 |      80.21 |    1283.30 |    3.113 |     3.951 | E:\iotest.dat (440GB)
    11 |      9996730368 |       152538 |      79.45 |    1271.15 |    3.143 |     3.894 | E:\iotest.dat (440GB)
    12 |     10157883392 |       154997 |      80.73 |    1291.64 |    3.093 |     4.040 | E:\iotest.dat (440GB)
    13 |     10157424640 |       154990 |      80.72 |    1291.58 |    3.093 |     3.934 | E:\iotest.dat (440GB)
    14 |     10177937408 |       155303 |      80.89 |    1294.19 |    3.087 |     3.978 | E:\iotest.dat (440GB)
    15 |     10223681536 |       156001 |      81.25 |    1300.00 |    3.073 |     3.642 | E:\iotest.dat (440GB)
-----------------------------------------------------------------------------------------------------
total:      161863434240 |      2469840 |    1286.37 |   20581.94 |    3.106 |     3.955

Remember: The I/Os are recorded in whatever blocksize the test specified. In the case above, the I/Os are 64K I/Os.
Last, but not least are the latency measurements:
  %-ile |  Read (ms) | Write (ms) | Total (ms)
----------------------------------------------
    min |      0.535 |      0.729 |      0.535
   25th |      2.531 |      3.446 |      2.565
   50th |      2.796 |      3.792 |      2.849
   75th |      3.088 |      4.227 |      3.211
   90th |      3.439 |      4.743 |      3.745
   95th |      3.763 |      5.179 |      4.169
   99th |      4.818 |      6.761 |      5.274
3-nines |     38.694 |     42.926 |     39.374
4-nines |    207.585 |    209.483 |    207.734
5-nines |    208.562 |    210.939 |    209.483
6-nines |    209.058 |    211.330 |    210.939
7-nines |    209.256 |    211.330 |    211.330
8-nines |    209.256 |    211.330 |    211.330
9-nines |    209.256 |    211.330 |    211.330
    max |    209.256 |    211.330 |    211.330

This last section shows the latency percentile distribution of the test results from the minimum to the maximum value in milliseconds, split into reads, writes and total latency. It’s essential to know how the storage will perform and respond under load, so this section should examined carefully. The “n-nines” in the ‘%-ile’ column refers to the number of nines, where 3-nines means 99.9%, 4-nines means 99.99% etc. If you want to accurately measure the higher percentiles, you should run longer duration tests that generate a larger number of I/O operations.
What you want to look for in the latency results is the point at which the values make a large jump. In this test, 99% of the reads had a latency of 4.818 milliseconds or less, but if we go higher, 99.9% of the reads had a latency of 38.694 milliseconds or less.

    

Powered by Blogger