To scale out or not to scale out, that is the question

To scale out or not to scale out, that is the question

  • Comments 12
  • Likes

Hi folks, Ned here again. If Shakespeare had run Windows, Hamlet would be a play about configuring failover clusters for storage. Today I discuss the Scale-Out File Server configuration and why general use file server clustered shares may still have a place in your environment. These two types of Windows file servers have different goals and capabilities that come with trade-offs; it’s critical to understand them when designing a solution for you or your customer. We’ve not always done a great job explaining the differences between these two file servers, and this post aims to clear things up.

It's not enough to speak, but to speak true. So let’s get crackin’.

The taming of the SOFS

We released Scale-Out File Server (SOFS) in Windows Server 2012. SOFS adds highly available, active-active, file data access for application data workloads to Windows Server clusters through SMB, Continuous Availability (CA), and Cluster Shared Volumes (CSV). CA file shares ensure - amongst other things - that when you connect through SMB 3, the server synchronously writes through to the disk for data integrity in the event of a node failure. In other words, it makes sure that your files are consistent and safe even if the power goes out.

You get the option when you configure the File Server role in a Windows Server failover cluster:

image

SOFS has some other key benefits:

  • Simultaneous active shares – This is the actual “scale-out” part of SOFS. Every cluster node makes the CA shares accessible simultaneously, rather than one node hosting a share actively and the others available to host it after a failover. The nodes then redirect the user to the underlying data on the storage-owning node with the help of Cluster Shared Volumes (CSV). Even better, no matter which node’s share you connect to, if that node later crashes, you automatically connect to another node right away through SMB Transparent Failover. This requires shares configured with the CA attribute, which are set by default on clustered shares during creation. The effect is a single share namespace for the cluster available from all nodes.
  • More bandwidth – Since all the servers host all the shares, all of the nodes’ aggregate network throughput becomes available - although naturally, the underlying storage is still potentially limiting. Adding more nodes adds more bandwidth. SMB Multichannel and SMB Direct (aka Remote Direct Memory Access) take this further by ensuring that the client-server conversations efficiently utilize the available network throughput and minimize the CPU usage on the nodes.
  • Simpler management - SOFS streamlines share management now that the active-passive shares and physical disk resource issues are under the scale-out umbrella; with active-active nodes, you don’t need to manage balancing shares onto each node so that all servers have work to do. Server Manager and the SmbShare Windows PowerShell module also unify the command-line experience and expose useful functionality in a straightforward fashion.

image

Claus Joergensen has a good blog post on these capabilities, and if you want to give to set up a test environment, Jose Barreto is the man with the step-by-step plans.

All of this has a single customer in mind: application data accessed via SMB, like Hyper-V virtual machine disks and SQL database files. With your hypervisor running on one cluster and your storage on another cluster, you can manage – and scale - each aspect of the stack as separate high-performance entities, without the need for expensive SAN fabrics. Truly awesome stuff that sounds like a great fit for any high availability scenarios.

image

For those who want to use SOFS for regular user shares, though: proceed with caution.

Enter Information Worker, holding a laptop

Inside Microsoft, we define “Information Worker” as the standard business user scenario. In other words, a person sitting at their physical client or virtual desktop session, and connecting to file servers to access unstructured data. This means SMB shares filled with home folders, roaming user profiles, redirected folders, departmental data, and common shared data; decades of documents, spreadsheets, and PDFs.

image
Ooh, there’s leftover cake in the break room!

Performance

The typical file operations from IW users are very different when compared to application data like Hyper-V or SQL. IW workloads are metadata heavy (operations like opening files, closing files, creating new files, or renaming existing files). IW operations also involve a great many files, with plenty of copies and deletes, and of course, tons of editing. Even though individual users aren’t doing much, file servers have many users. These operations may involve masses of opens, writes, and closes, and often on files without pre-allocated space. This can mean frequent VDL extension, which means many trips to the disk and back, all over SMB.

Right away, you can see that going through a share enabled with CA to provide the data integrity guarantee might have an impact on performance, when compared to previous releases of Windows Server, which did not have shares with CA and thus did not provide this data integrity guarantee. Continuous Availability requires that data write-through to the disk to ensure integrity in the event of a node failure in SOFS, so everything is synchronous and any buffering only helps on subsequent reads, not writes. A user that needs to copy many big files to a file server - such as by adding them to a redirected My Documents folder - can see significantly slower performance on CA shares. A user that spent a week working from home and returns with their offline files cache brimming will see slower uploads to CA shares.

Nothing is broken here – this is just a consequence of how IW workloads operate. A big VHDX or SQL database file also sees slower creation time through a CA share, but it’s largely a cost paid once, because the files have plenty of pre-allocated space to use up, and subsequent IO prices are comparatively much lower. We also optimize SMB for them, such as with SMB Direct’s handling of 8K IOs.

To demonstrate this, I performed a few small-scale tests in my gross test environment. Don’t worry too much about the raw numbers; just focus on the relative performance differences.

Environment:

  • Single-node Windows Server 2012 R2 RTM cluster with CSV on Storage Spaces with 1-column mirrors and two scale-out SMB shares (one with CA enabled and one without)
  • One Windows Server 2012 R2 RTM client and one Windows 8.1 RTM client
  • A single DC in its own forest
  • All of the above virtualized in Hyper-v
  • Each test repeated many times to ensure a reasonably reliable average.
  • I used Windows PowerShell’s Measure-Command cmdlet for timing in the first three test types and event logging for the redirected folders test.

Note: to set this up, see Jose’s demo here. My only big change was to use one node instead of three, so I had more resources in my very gross test environment.

Methodologies:

  • Internal MS test tool that generates a synthetic 1GB file with random contents.
  • Robocopy of a 1GB file with random contents (using no optional parameters).
  • Windows PowerShell Copy-Item cmdlet copy of a real-life sample user data set comprising 1,238 Files in 96 Folders for 2GB total (using –force –recurse).
  • Sync of a redirected My Documents shell folder comprising 4,609 Files in 143 Folders for 5GB total, calculating the time between Folder Redirection operational event log events 1006 and 1001.

Results:

Test method
CA, avg sec
Non-CA, avg sec
Non-CA to CA IW perf comparison
MS Internal synthetic file creation (1GB)
59
40
1.475 X faster
Robocopy.exe (1GB)
58
42
1.38 X faster
Copy-Item cmdlet (2GB)
107
73
1.465 X faster
Folder Redirection full sync (5GB)
689
545
1.26 X faster

Important: again, this could be faster in absolute terms on your systems with similar data, as my test system is very gross. It could also be slower if your server is quite busy, has crusty drivers installed, is on a choked-out network, etc.

The good news

MS Office 2013’s big three – Word, Excel, and PowerPoint – performed well with both CA and non-CA shares and don’t have notable performance differences in my tests even when editing and saving individual files that were hundreds of MB in size. This is because later versions of Office operate very asynchronously, using local temporary files rather than forcing the user to wait on remote servers. On a remote 210MB PPTX, the save times on an edited file were nearly identical, so I didn’t bother posting any results.

The not-so-good news

Office’s good performance is less likely in other user applications; MS Office has been at this game for 22 years. One internal test application I used to generate files had non-CA performance similar to the synthetic file creation test above. However, when the same tool ran against a CA share, it was 8.6 times slower, because of how it continuously asked the server to allocate more space for the file and kept paying the synchronous write-through cost. There’s no way to know what the more “write-through inefficient” apps are until you find out in testing.

Important: even general-purpose file server clusters have CA set on their shares by default when created via the cluster admin tool, Server Manager, or New-SmbShare. You should consider removing that setting if you require performance over data write-through integrity on shares on clusters. On non-clustered file servers, you cannot enable CA.

This is conceivably useful even with SOFS and application data workloads: for instance, you could create two shares to the same folder. One is for Hyper-V to mount VHDXs remotely, and one is to copy VHDXs to that share when configuring new VMs, such as through SCVMM.

Final important note: make sure you install (at a minimum) KB2883200 on your Windows Server 2012 R2 servers and Windows 8.1 clients; it makes copying to shares a little faster. Better yet, stay up to date on your file server by using this list of currently available hotfixes for the File Services technologies in Windows Server 2012 and in Windows Server 2012 R2

Capabilities

The performance issues are actually manageable; many users probably won’t notice any write-through impact, depending on their work patterns. The real issue here is that Scale-Out requires CSV. Moreover, this paints your environment into a corner, because many IW applications do not support that file system.

At first, you configure files on a scale-out cluster share and it works fine. Nevertheless, a year later, when you decide you need more file server capabilities like Work Folders, Dynamic Access Control, File Classification Infrastructure, and FSRM file quotas and screens – you are blocked.

Let’s go to the big board.

Technology Area
Feature
General Use File Server Cluster
Scale-Out File Server
SMB
SMB Continuous Availability
Yes
Yes
SMB Multichannel
Yes
Yes
SMB Direct
Yes
Yes
SMB Encryption
Yes
Yes
SMB Transparent failover
Yes1
Yes
File System
NTFS
Yes
NA
Resilient File System (ReFS)
Yes
NA
Cluster Shared Volume File System (CSV)
NA
Yes
File Management
 
 
BranchCache
Yes
No4
Data Deduplication (Windows Server 2012)
Yes
No4
Data Deduplication (Windows Server 2012 R2)
Yes
Yes
DFS Namespace (DFSN) root server root
Yes
No4
DFS Namespace (DFSN) folder target server
Yes
Yes
DFS Replication (DFSR)
Yes
No4
File Server Resource Manager (Screens and Quotas)
Yes
No4
File Classification Infrastructure
Yes
No4
Dynamic Access Control (claim-based access, CAP)
Yes
No4
Folder Redirection
Yes
Yes2
Offline Files (client side caching)
Yes
Yes5
Roaming User Profiles
Yes
Yes2
Home Directories
Yes
Yes2
Work Folders
Yes
No4
NFS
NFS Server
Yes
No4
Applications
Hyper-V
Yes3
Yes
Microsoft SQL Server
Yes3
Yes

1 Only works if CA is enabled on shares

2 Not recommended on Scale-Out File Servers.

3 Not recommended on general use file servers.

4 Requires NTFS

5 CSC is less compatible with CA shares than the other IW technologies, due to how it decides a share is offline combined with the SMB 3 client. This means that Offline Files will stay online even if the user no longer has access to the share, for 3-6 minutes.

Ultimately, this means that if you, your boss, or your customer decides “after that recent audit, we need to use DAC+FCI for more manageable security and we definitely need to screen out MP3 files and Grumpy Cat meme pics”, you will be forced to recreate the entire configuration using NTFS and general use file server clusters. This does not sound pleasant, especially when you now have to shift around terabytes of data.

image

Moreover, let’s not forget about down-level clients like Windows 7; any CA shares require SMB 3.0 or later and older clients connecting to them cannot use SOFS features. While a Windows 7 or Vista client can connect to a CA share, you need Windows 8 or later to use the CA feature.

As for XP? It cannot connect to a CA share at all. This doesn’t matter though, because you already got rid of XP. Right?

The wheel is come full circle

Finally, though, is the big question: if you accept the performance overhead, what does continuous availability provided by SOFS buy you with IW workloads?

The answer: little.

Many end-user applications don’t need the guarantees of continuous availability that SQL and Hyper-V demand in their workload. Your IW applications like Office and Windows Explorer are often quite resistant to the short-term server outages during traditional cluster failover. MS Office especially – it has lived for years in a world of unreliable networking; it uses temp files, works offline, and retries constantly without telling the user if there are intermittent problems contacting a file on a share.

The bottom line is that Word and all its friends will be just fine using traditional general use shares on clusters. Make sure that before you go down the scale-out route in a particular cluster design, it’s the right approach for the task.

image

If you caught all the pseudo-Shakespeare references in this article, post the count in the commons and win a fabulous No-Prize!

Until next time,

- Ned “Exit, pursued by a bear” Pyle

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • This is perhaps the most valuable Microsoft Blog I have ever read.  Please keep writing in this style!

  • High quality post.  Full of sound and fury!

  • Thank you, Ned. This is very helpful and timely as I'm in the testing and validation stage of our SoFS storage solution.

    My performance results are all over the map--some astoundingly great and some bafflingly slow. As it is now, I'm very hesitant to put this into production, because I really can't tell what are expected results and what aren't. Your post here makes me wonder if at least some of what I'm seeing is to be expected.

    Setup:

    - Two-node 2012 R2 Hyper-V cluster

    <-> Infiniband RDMA NICs

    - Two-node 2012 R2 SoFS cluster

    <-> LSI SAS controllers

    - 3 JBOD enclosures with a mix of 4TB HGST SAS HDDs* and 200 GB STEC SAS SSDs*

    *MPIO is globally set to Least Blocks

    I've created shares of mirrored spaces using virtual disks from 1 to 8 columns, and the throughput when creating a VHDX is consistently awful--it's always about 47 MB/s maximum, regardless how many columns. If I create the same file from the SoFS coordinator node (pointed to the UNC path, though), the throughput is near 200 MB/s for even a single disk, and 500-600 MB/s for 3 columns. Should I really expect performance to be so poor? The performance difference for the examples you listed never exceeded 1.5x, but I'm seeing worse than 12x (and potentially worse still if I tried 8 columns).

    When I connect to the CAP share from a (separate) 2008 R2 Hyper-V cluster (using a 1 GbE connection--there are no RDMA NICs in that cluster) and create a VHDX file, I get ~110 MB/s. That makes me think something is quite wrong, but I don't know what or why.

    Additionally, in numerous SQLIO tests on HDD mirrored spaces, the SoFS solution often outperforms our other two SAN environments, sometimes by far ...with the exception of 8 KB sequential write tests. I think it's important to make sure the HDD environment is working properly before enhancing it with flash, so I've been testing HDDs alone, and then adding SSDs later. The HDD-only numbers seem low to me, but how do I know whether the performance is at expected levels? Regardless the number of columns, they hover around 7-8 MB/s at a queue depth of 2 (8-12 threads), whereas our other HDD-based SANs are consistently around 60-100 MB/s (1 GbE connections only) regardless of queue depth or number of threads. Even at a queue depth of 16, our SoFS solution doesn't push more than 30 MB/s at 3 columns and 46 MB/s at 8 columns.

    (The 8 KB random write numbers also seem relatively low, but none of the other SAN environments seem to do well at that, either. Flash helps significantly here, but I'm having trouble finding the right size for the WBC. The performance numbers--IOPS, MB/s, and especially latency--completely plummet, even much worse than without the WBC, in certain scenarios. Presumably the WBC gets full, but the behavior here is not good--it seems simply to stop accepting writes until the cache is written to disk, resulting in latency numbers in exceeding 60s in some cases.)

    Is what I'm seeing normal? Or is something wrong with my setup? The 2008 R2 VHDX creation test makes me think it's the latter.

  • Thanks guys. :)

    That is very interesting, Ryan. I have some further questions and want to also get some thoughts from our Spaces team here, can you email us at filecabml@microsoft.com? Once we figure everything out we can reply on the comment. :)

  • Thank you very much, Ned. I just sent a message to filecabml@microsoft.com and will be happy to answer any questions. I'm grateful for your guidance!

  • Not a single unnecessary phrase...talk about precise communication! A genius and a Marine! Guess it wouldn't be the first time...