SMB Direct and RDMA performance demo from TechEd (includes summary, PowerShell scripts and links)

SMB Direct and RDMA performance demo from TechEd (includes summary, PowerShell scripts and links)

  • Comments 9
  • Likes

Overview

 

My last TechEd demo showed some interesting performance data for SMB3 over RDMA (SMB Direct), including the latest small IO improvements in Windows Server 2012 R2.

Since I keep getting questions about the setup I used, here are some of the details about the hardware, software, results and script used.

 

Hardware

 

For that demo, I used a single storage server and a single compute server.

I used an EchoStreams FlacheSAN2working as File Server, with 2 Intel CPUs at 2.40 Ghz and 64GB of RAM. It includes 6 LSI SAS adapters and 48 Intel SSDs attached directly to the server. This is an impressively packed 2U unit.

The Hyper-V Server was a Dell PowerEdge R720 with 2 Intel CPUs at 2.70 GHz and 64GB of RAM.

Both the file server and the Hyper-V host had 3 RDMA-capable 54 Gbps NICs (Mellanox ConnectX-3 using FDR InfiniBand) were used simultaneously via SMB Multichannel.

 

image

 

Software and Configuration

 

For that demo, I ran Windows Server 2012 R2 on both the storage and the compute servers.

I used a standalone SMB3 file server backed by Storage Spaces, with 1 single share backed by a single mirrored space carved from a single storage pool with all 48 SSDs.

In previous demos I used multiple pools and spaces, but I switched to using a single pool and a single space, since it was simpler and provided me with similar performance.

From a networking perspective, the 3 NICs were configured each in a separate subnet. I also had a fourth NIC (1GbE) for DNS, AD and management traffic.

Speaking of that, this setup also communicated with a third Windows Server 2012 R2 server used solely as DNS Server and Active Directory domain controller.

 

Script to prepare the demo

 

Here’s the PowerShell script that I used to configure the environment for this specific demo:

 

# Create pool and mirrored space

$s = Get-StorageSubSystem -FriendlyName *Spaces*
$d = Get-PhysicalDisk –CanPool $true
New-StoragePool -FriendlyName Pool1 -StorageSubSystemFriendlyName $s.FriendlyName -PhysicalDisks $d

Set-ResiliencySetting -Name Mirror -NumberofColumnsDefault 24 -StoragePool  ( Get-StoragePool -FriendlyName Pool1 )
New-VirtualDisk -FriendlyName Space1 -StoragePoolFriendlyName Pool1 -ResiliencySettingName Mirror –UseMaximumSize

# Initialize disk, partition and volume

$c = Get-VirtualDisk -FriendlyName Space1 | Get-Disk
Set-Disk -Number $c.Number -IsReadOnly 0
Set-Disk -Number $c.Number -IsOffline 0
Initialize-Disk -Number $c.Number -PartitionStyle GPT
New-Partition -DiskNumber $c.Number -DriveLetter X -UseMaximumSize
Initialize-Volume -DriveLetter X -FileSystem NTFS -Confirm:$false

# Create data files for SQLIO

1..16 | % {
$f=”X:\test”+$_+”.dat”
fsutil file createnew $f (128GB)
fsutil file setvaliddata $f (128GB)
$f=”X:\small”+$_+”.dat”
fsutil file createnew $f (8MB)
fsutil file setvaliddata $f (8MB)
}

# Create the SMB Share

New-SmbShare -Name Share1 -Path X:\ -FullAccess Domain\Administrator, Domain\HV1$, Domain\HV2$
Set-SmbPathAcl –ShareName Share1

 

Results

 

Here’s the summary of the results from the 3 phases of the demo:

 

Demo 1: Small IOs (8KB) from real storage

 

The first demo used 16 instances of SQLIO generating 8KB IOs against 16 distinct files on the SMB server.

As shown in the screenshot below, we hit over 600,000 IOs per second (IOPS). At that point, the data rate was about 5 gigabytes per seconds and the client was using a little over 60% of the CPU.

In this demo, every one of the 600,000 IOs are flowing through the entire Microsoft storage stack, from physical disks to storage spaces to NTFS to the SMB server over the network to the SMB client and finally to the SQLIO app.

There was a fair amount of queuing to keep all 48 SSDs and the entire stack busy (over 235 queue depth), but the overall latency was still below 1 millisecond (performance monitor shows 0).

 

image

 

Demo 2: Small IOs (8KB) from the SMB server cache

 

The second demo focus on the raw network performance of SMB Direct and SMB Multichannel by using cached IOs.

We’re still travelling from SQLIO to the SMB client over the network to the SMB server, but we’re satisfying the 8KB IOs from the RAM-based cache on the server side.

To accomplish this I basically used a similar workload as before, but employed smaller files and the SQLIO option to allow caching of the IOs.

As shown on the screenshot below, we have 1.1 million IOPS of 8KB each. At this rate, we are CPU bound at around 98% of the SMB client CPU.

Again you can see a deep queue (nearly 300 queued IOs), but the latency is still under 1 millisecond (performance monitor showing 0 again).

Note that, even with small IOs, we are hitting over 9 gigabytes per second in terms of bandwidth.

 

 image

 

Demo 3: Larger IOs (32KB) from real storage

 

The last of the 3 demos used larger IOs (32KB) in order to reach higher bandwidth utilization.

Before the Windows Server 2012 R2 optimizations, reaching high bandwidth in this configuration would require larger IOs, like 128KB, 256KB or 512KB.

In fact, this was the first time I was able to nearly saturate this 3 * 54 Gbps network setup using 32KB IOs, which is not really a very large size.

You can see we’re hitting the incredible rate of 16.4 gigabytes per second, which is nearly saturating our 162 Gbps bandwidth.

To put it into perspective, that’s about 14 times the through of a regular 10GbE NIC (which typically delivers 1.1 gigabytes per second each way) or over 20 times the rate of a regular 8GB Fibre Channel HBA (which delivers about 800 megabytes per second each way).

Note also that we’re using about 64% of the CPU and the latency is under 2ms (performance monitor shows 1 millisecond).

 

image

 

Script to run the demo

 
Finally, as requested, here is the PowerShell script I used to generate the workload during the demo.

It effectively runs 16 instances of SQLIO to give me 16 independent processes, each running against one of the 16 cores in the machine.

Each instance uses a separate file on the share, which is mapped to the X: drive. I used either X:\Test<n>.dat (demos 1 and 3) or X:\Small<x>.dat (demo 2).

Note also somewhat unusual SQLIO options -BYRT to buffer IOs (demo 2) and -a to affinitize the instance to a specific set of CPU cores (used in all 3 demos).

 

During the demo, I focus on looking at Performance Monitor itself (not the output of SQLIO), so role of the script is really to drive the workload.

However, to make it look pretty, I used a few tricks. For instance, I clear the screen between demos and repaint the history of the results.

I also use the trick on Write-Host to stay on the same line and overwrite the contents. This is useful, for instance, when counting from job 1 to job 16.

 

Cls
"   ___  _  ______            _ _     _    "
"  ( _ )| |/ / __ )        __| (_)___| | __"
"  / _ \| ' /|  _ \ _____ / _' | / __| |/ /"
" | (_) | . \| |_) |_____| (_| | \__ \   < "
"  \___/|_|\_\____/       \__,_|_|___/_|\_\"
""
"Workload: 8KB random read IOs coming from remote disk"

1..16 | % {
    $file = $_
    $cpum = 1 -shl ($file - 1)
    Write-Host "`rStarting job ", $file -NoNewLine
    $ScriptBlock = {
        param($f, $m)
        $pa = "-a" + $m
        $pf = "X:\Test" + $f + ".dat"
        c:\sqlio\sqlio2.exe -s1000 -T100 -t1 -o16 -b8 $pa -BN -LS -frandom $pf
    }

    $job = Start-Job -ScriptBlock $ScriptBlock -ArgumentList $file, $cpum
}
Write-Host "`rAll jobs have been started"
" "
Read-Host "Press [ENTER] to continue"

# Stop All jobs

$item = 0
get-job | % {
   $item++
   Write-Host "`rStopping job ", $item -NoNewLine
   Stop-Job $_
   Remove-Job $_
}

Cls
"   ___  _  ______            _ _     _    "
"  ( _ )| |/ / __ )        __| (_)___| | __"
"  / _ \| ' /|  _ \ _____ / _' | / __| |/ /"
" | (_) | . \| |_) |_____| (_| | \__ \   < "
"  \___/|_|\_\____/       \__,_|_|___/_|\_\"
""
"Workload: 8KB random read IOs coming from remote disk"
"Results: around 600,000 8KB IOPs"
" "
"   ___  _  ______                       _          " 
"  ( _ )| |/ / __ )        ___ __ _  ___| |__   ___ "
"  / _ \| ' /|  _ \ _____ / __/ _' |/ __| '_ \ / _ \"
" | (_) | . \| |_) |_____| (_| (_| | (__| | | |  __/"
"  \___/|_|\_\____/       \___\__,_|\___|_|_|_|\___|"
" "
"Workload: 8KB random read IOs coming remote cached (RAM)"
" "

1..16 | % {
    $file = $_
    $cpum = 1 -shl ($file - 1)
    Write-Host "`rStarting job ", $file -NoNewLine
    $ScriptBlock = {
        param($f, $m)
        $pa = "-a" + $m
        $pf = "X:\Small" + $f + ".dat"
        c:\sqlio\sqlio2.exe -s1000 -T100 -t1 -o32 -b8 $pa -BYRT -LS -frandom $pf
    }
    $job = Start-Job -ScriptBlock $ScriptBlock -ArgumentList $file, $cpum
}
Write-Host "`rAll jobs have been started"
" "
Read-Host "Press [ENTER] to continue"

# Stop All jobs

$item = 0
get-job | % {
   $item++
   Write-Host "`rStopping job ", $item -NoNewLine
   Stop-Job $_
   Remove-Job $_
}
Write-Host "`rAll jobs have been stopped"

Cls
"   ___  _  ______            _ _     _    "
"  ( _ )| |/ / __ )        __| (_)___| | __"
"  / _ \| ' /|  _ \ _____ / _' | / __| |/ /"
" | (_) | . \| |_) |_____| (_| | \__ \   < "
"  \___/|_|\_\____/       \__,_|_|___/_|\_\"
""
"Workload: 8KB random read IOs coming from remote disk"
"Results: around 600,000 8KB IOPs"
" "
"   ___  _  ______                       _          " 
"  ( _ )| |/ / __ )        ___ __ _  ___| |__   ___ "
"  / _ \| ' /|  _ \ _____ / __/ _' |/ __| '_ \ / _ \"
" | (_) | . \| |_) |_____| (_| (_| | (__| | | |  __/"
"  \___/|_|\_\____/       \___\__,_|\___|_|_|_|\___|"
" "
"Workload: 8KB random read IOs coming remote cached (RAM)"
"Results: around 1,000,000 8KB IOPs"
" "
"  _____ ___  _  __ ___            _ _     _        "   
" |___ /___ \| |/ / __ )        __| (_)___| | __    "
"   |_ \ __) | ' /|  _ \ _____ / _' | / __| |/ /    "
"  ___) / __/| . \| |_) |_____| (_| | \__ \   <     "
" |____/_____|_|\_\____/       \__,_|_|___/_|\_\    "
" "
"Workload: 32KB random read IOs coming from remote disk"
" "
1..16 | % {
    $file = $_
    $cpum = 1 -shl ($file - 1)
    Write-Host "`rStarting job ", $file -NoNewLine
    $ScriptBlock = {
        param($f, $m)
        $pa = "-a" + $m
        $pf = "X:\Test" + $f + ".dat"
        c:\sqlio\sqlio2.exe -s1000 -T100 -t1 -o32 -b32 $pa -BN -LS -frandom $pf
    }

    $job = Start-Job -ScriptBlock $ScriptBlock -ArgumentList $file, $cpum
}
Write-Host "`rAll jobs have been started"
" "
Read-Host "Press [ENTER] to continue"

# Stop All jobs

$item = 0
get-job | % {
   $item++
   Write-Host "`rStopping job ", $item -NoNewLine
   Stop-Job $_
   Remove-Job $_
}
Write-Host "`rAll jobs have been stopped"

Cls
"   ___  _  ______            _ _     _    "
"  ( _ )| |/ / __ )        __| (_)___| | __"
"  / _ \| ' /|  _ \ _____ / _' | / __| |/ /"
" | (_) | . \| |_) |_____| (_| | \__ \   < "
"  \___/|_|\_\____/       \__,_|_|___/_|\_\"
""
"Workload: 8KB random read IOs coming from remote disk"
"Results: around 600,000 8KB IOPs"
" "
"   ___  _  ______                       _          " 
"  ( _ )| |/ / __ )        ___ __ _  ___| |__   ___ "
"  / _ \| ' /|  _ \ _____ / __/ _' |/ __| '_ \ / _ \"
" | (_) | . \| |_) |_____| (_| (_| | (__| | | |  __/"
"  \___/|_|\_\____/       \___\__,_|\___|_|_|_|\___|"
" "
"Workload: 8KB random read IOs coming remote cached (RAM)"
"Results: around 1,000,000 8KB IOPs"
" "
"  _____ ___  _  __ ___            _ _     _        "   
" |___ /___ \| |/ / __ )        __| (_)___| | __    "
"   |_ \ __) | ' /|  _ \ _____ / _' | / __| |/ /    "
"  ___) / __/| . \| |_) |_____| (_| | \__ \   <     "
" |____/_____|_|\_\____/       \__,_|_|___/_|\_\    "
" "
"Workload: 32KB random read IOs coming from remote disk"
"Results: around 500,000 IOPs, around 16.5 GBytes/sec throughput"
" "
" "

 

Links

 

Finally, if you want to review the demo or the full presentation, here are a few links:

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Hello Jose,

    could you please provide detailed part numbers for the LSI Controllers?

  • @Marco

    We used six PCIe x8 6Gbps SAS HBAs. I think there were the LSI SAS 9207-8i model, but I'm not 100% sure. The trick was to use six of them, each controlling 8 of the SSDs in the host.

    You can ask the folks from EchoStreams for details. The blog post has a link to their site.

  • Hi,

    If you have multiple adapters which you can't team what do you do about your clients connecting to just one address? DNS Round Robin?

    Thanks

  • Dear Jose,

    Please note that the images above are missing.
    Could you please upload them again?

    Thanks,

  • Dear Jose,
    Could you please advise about the images in above post?

    (404 - File or directory not found.
    The resource you are looking for might have been removed, had its name changed, or is temporarily unavailable.)

    Thank you.

  • I did notice the problem with the images. Ticket opened with the technet.com folks.

  • @Cornishpasty02

    SMB Multichannel will take care of discovering the additional paths and connecting to all of them. More at http://blogs.technet.com/b/josebda/archive/2012/05/13/the-basics-of-smb-multichannel-a-feature-of-windows-server-2012-and-smb-3-0.aspx