This blog post is a compilation of my raw notes from SNIA’s SDC 2013 (Storage Developers Conference).

Notes and disclaimers:

  • These notes were typed during the talks and they may include typos and my own misinterpretations.
  • Text in the bullets under each talk are quotes from the speaker or text from the speaker slides, not my personal opinion.
  • If you feel that I misquoted you or badly represented the content of a talk, please add a comment to the post.
  • I spent limited time fixing typos or correcting the text after the event. Just so many hours in a day...
  • I have not attended all sessions (since there are 4 or 5 at a time, that would actually not be possible :-)…
  • SNIA usually posts the actual PDF decks a few weeks after the event. Attendees have access immediately.
  • You can find the event agenda at http://www.snia.org/events/storage-developer2013/agenda2013

SMB3 Meets Linux: The Linux Kernel Client
Steven French, Senior Engineer SMB3 Architecture, IBM

  • Title showing is (with the strikethrough text): CIFS SMB2 SMB2.1 SMB3 SMB3.02 and Linux, a Status Update.
  • How do you use it? What works? What is coming?
  • Who is Steven French: maintains the Linux kernel client, at SMB3 Architect for IBM Storage
  • Excited about SMB3
  • Why SMB3 is important: cluster friendly, large IO sizes, more scalable.
  • Goals: local/remote transparency, near POSIX semantics to Samba, fast/efficient/full function/secure method, as reliable as possible over bad networks
  • Focused on SMB 2.1, 3, 3.02 (SMB 2.02 works, but lower priority)
  • SMB3 faster than CIFS. SMB3 remote file access near local file access speed (with RDMA)
  • Last year SMB 2.1, this year SMB 3.0 and minimal SMB 3.02 support
  • 308 kernel changes this year, a very active year. More than 20 developers contributed
  • A year ago 3.6-rc5 – now at 3.11 going to 3.12
  • Working on today copy offload, full linux xattr support, SMB3 UNIX extension prototyping, recover pending locks, starting work on Multichannel
  • Outline of changes in the latest releases (from kernel version 3.4 to 3.12), version by version
  • Planned for kernel 3.13: copy chunk, quota support, per-share encryption, multichannel, considering RDMA (since Samba is doing RDMA)
  • Improvements for performance: large IO sizes, credit based flow control, improved caching model. Still need to add compounding,
  • Status: can negotiate multiple dialects (SMB 2.1, 3, 3.02)
  • Working well: basic file/dir operations, passes most functional tests, can follow symlinks, can leverage durable and persistent handles, file leases
  • Need to work on: cluster enablement, persistent handles, witness, directory leases, per-share encryption, multichannel, RDMA
  • Plans: SMB 2.1 no longer experimental in 3.12, SMB 2.1 and 3 passing similar set of functional tests to CIFS
  • Configuration hints: adjusting rsize, wsize, max_pending, cache, smb3 signing, UNIX extension, nosharelock
  • UNIX extensions: POSIX pathnames, case sensitive path name, POSIX delete/rename/create/mkdir, minor extensions to stat/statfs, brl, xattr, symlinks, POSIX ACLs
  • Optional POSIX SMB3 features outlined: list of flags used for each capability
  • Question: Encryption: Considering support for multiple algorithms, since AES support just went in the last kernel.
  • Development is active! Would like to think more seriously about NAS appliances. This can be extended…
  • This is a nice, elegant protocol. SMB3 fits well with Linux workloads like HPC, databases. Unbelievable performance with RDMA.
  • Question: Cluster enablement? Durable handle support is in. Pieces missing for persistent handle and witness are small. Discussing option to implement and test witness.
  • Need to look into the failover timing for workloads other than Hyper-V.
  • Do we need something like p-NFS? Probably not, with these very fast RDMA interfaces…

Mapping SMB onto Distributed Storage
Christopher R. Hertel, Senior Principal Software Engineer, Red Hat
José Rivera, Software Engineer, Red Hat

  • Trying to get SMB running on top of a distributed file system, Gluster
  • Chris and Jose: Both work for RedHat, both part of the Samba team, authors, etc…
  • Metadata: data about data, pathnames, inode numbers, timestamps, permissions, access controls, file size, allocation, quota.
  • Metadata applies to volumes, devices, file systems, directories, shares, files, pipes, etc…
  • Semantics are interpreted in different contexts
  • Behavior: predictable outcomes. Make them the same throughout the environments, even if they are not exactly the same
  • Windows vs. POSIX: different metadata + different semantics = different behavior
  • That’s why we have a plugfest downstairs
  • Long list of things to consider: ADS, BRL, deleteonclose, directory change notify, NTFS attributes, offline ops, quota, etc…
  • Samba is a Semantic Translator. Clients expect Windows semantics from the server, Samba expects POSIX semantics from the underlying file system
  • UNIX extensions for SMB allows POSIX clients to bypass some of this translation
  • If Samba does not properly handle the SMB protocol, we call it a bug. If cannot handle the POSIX translation, it’s also a bug.
  • General Samba approach: Emulate the Windows behavior, translate the semantics to POSIX (ensure other local processes play by similar rules)
  • The Samba VFS layers SMB Protocol Initial Request Handling  VFS Layer  Default VFS Layer  actual file system
  • Gluster: Distributed File System, not a cluster file system. Brick  Directory in the underlying file system. Bricks bound together as a volume. Access via SMB, NFS, REST.
  • Gluster can be FUSE mounted. Just another access method. FUSE hides the fact that it’s Gluster underneath.
  • Explaining translations: Samba/Gluster/FUSE. Gluster is adaptable. Translator stack like Samba VFS modules…
  • Can add support for: Windows ACLs, oplocks, leases, Windows timestamps.
  • Vfs_glusterfs: Relatively new code, similar to other Samba VFS modules. Took less than a week to write.
  • Can bypass the lower VFS layers by using libgfapi. All VFS calls must be implemented to avoid errors.
  • CTDB offers three basics services: distributed metadata database (for SMB state), node failure detection/recovery, IP address service failover.
  • CTDB forms a Samba cluster. Separate from the underlying Gluster cluster. May duplicate some activity. Flexible configuration.
  • SMB testing, compared to other access methods: has different usage patterns, has tougher requirements, pushes corner cases.
  • Red Hat using stable versions, kernel 2.x or something. So using SMB1 still…
  • Fixed: Byte range locking. Fixed a bug in F_GETLK to get POSIX byte range locking to work.
  • Fixed:  SMB has strict locking and data consistency requirements. Stock Gluster config failed ping_pong test. Fixed cache bugs  ping_pong passes
  • Fixed: Slow directory lookups. Samba must do extra work to detect and avoid name collisions. Windows is case-INsensitive, POSIX is case-sensitive. Fixed by using vfs_glusterfs.
  • Still working on: CTDB node banning. Under heavy load (FSCT), CTDB permanently bans a running node. Goal: reach peak capacity without node banning. New CTDB versions improved capacity.
  • Still working on: CTDB recovery lock file loss. Gluster is a distributed FS, not a Cluster FS. In replicated mode, there are two copies of each file. If Recovery Lock File is partitioned, CTDB cannot recover.
  • Conclusion: If implementing SMB in a cluster or distributed environment, you should know enough about SMB to know where to look for trouble… Make sure metadata is correct and consistent.
  • Question: Gluster and Ceph have VFS. Is Samba suitable for that? Yes. Richard wrote a guide on how to write a VFS. Discussing a few issues around passing user context.
  • Question: How to change SMB3 to be more distributed? Client could talk to multiple nodes. Gluster working on RDMA between nodes. Protocol itself could offer more about how the cluster is setup.

Pike - Making SMB Testing Less Torturous
Brian Koropoff, Consulting Software Engineer, EMC Isilon

  • Pike – written in Python – starting with a demo
  • Support for a modest subset of SMB2/3. Currently more depth than breadth.
  • Emphasis on fiddly cases like failover, complex creates
  • Mature solutions largely in C (not convenient for prototyping)
  • Why python: ubiquitous, expressive, flexible, huge ecosystem.
  • Flexibility and ease of use over performance. Convenient abstractions. Extensible, re-usable.
  • Layers: core primitives (abstract data model), SMB2/3 packet definitions, SMB2/3 client model (connection, state, request, response), test harness
  • Core primitives: Cursor (buffer+offset indicating read/write location), frame (packet model), enums, anti-boilerplate magic. Examples.
  • SMB2/SMB3 protocol (pike.smb2) header, request/response, create {request/response} context, concrete frame. Examples.
  • SMB2/SMB3 model: SMB3 object model + glue. Future, client, connection (submit, trasceive, error handling), session, channel (treeconnect, create, read), tree, open, lease, oplocks.
  • Examples: Connect, tree connect, create, write, close. Oplocks. Leases.
  • Advanced uses. Manually construct and submit exotic requests. Override _encode. Example of a manual request.
  • Test harness (pike,test): quickly establish connection, session and tree connect to server. Host, credentials, share parameters taken from environment.
  • Odds and ends: NT time class, signing, key derivation helpers.
  • Future work: increase breadth of SMB2/3 support. Security descriptors, improvement to mode, NTLM story, API documentation, more tests!
  • http://github.com/emc-isilon/pike - open source, patches are welcome. Has to figure out how to accept contributions with lawyers…
  • Question: Microsoft has a test suite. It’s in C#, doesn’t work in our environment. Could bring it to the plugfest.
  • Question: I would like to work on implementing it for SMB1. What do you think? Not a priority for me. Open to it, but should use a different model to avoid confusion.
  • Example: Multichannel. Create a session, bind another channel to the same session, pretend failover occurred. Write fencing of stable write.

 Exploiting the High Availability features in SMB 3.0 to support Speed and Scale
James Cain, Principal Software Architect, Quantel Ltd

  • Working with TV/Video production. We only care about speed.
  • RESTful recap. RESTful filesystems talk from SDC 2010. Allows for massive scale by storing application state in the URLs instead of in the servers.
  • Demo (skipped due to technical issues): RESTful SMB3.
  • Filling pipes: Speed (throughput) vs. Bandwidth vs. Latency. Keeping packets back to back on the wire.
  • TCP Window size used to limit it. Mitigate by using multiple wires, multiple connections.
  • Filling the pipes: SMB1 – XP era. Filling the pipes required application participation. 1 session could do about 60MBps. Getting Final Cut Pro 7 to lay over SMB1 was hard. No choice to reduce latency.
  • Filling the pipes: SMB 2.0 – Vista era. Added credits, SMB2 server can control overlapped requests using credits. Client application could make normal requests and fill the pipe.
  • Filling the pipes: SMB 2.1 – 7 era. Large MTU helps.
  • Filling the pipes: SMB 3 – 8 era. Multi-path support. Enables: RSS, Multiple NICs, Multiple machines, RDMA.
  • SMB3 added lots of other features for high availability and fault tolerance. SignKey derivation.
  • Filesystem has DirectX GUI :-) - We use GPUs to render, so our SMB3 server has Cuda compute built in too. Realtime visualization tool for optimization.
  • SMB3 Multi-machine with assumed shared state. Single SMB3 client talking to two SMB3 servers. Distributed non-homogeneous storage behind the SMB servers.
  • Second NIC (channel) initiation has no additional CREATE. No distinction on the protocol between single server or multiple server. Assume homogeneous storage.
  • Asking Microsoft to consider “NUMA for disks”. Currently, shared nothing is not possible. Session, trees, handles are shared state.
  • “SMB2++” is getting massive traction. Simple use cases are well supported by the protocol. SMB3 has a high cost of entry, but lower than writing n IFS in kernel mode.
  • There are limits to how far SMB3 can scale due to its model.
  • I know this is not what the protocol is designed to do. But want to see how far I can go.
  • It could be help by changing the protocol to have duplicate handle semantics associated with the additional channels.
  • The protocol is really, really flexible. But I’m having a hard time doing what I was trying to do.
  • Question: You’re basic trying to do Multichannel  to multiple machines. Do you have a use case? I’m experimenting with it. Trying to discover new things.
  • Question: You could use CTDB to solve the problem. How much would it slow down? It could be a solution, not an awful lot of state.             

SMB3 Update
David Kruse, Development Lead, Microsoft

  • SMB 3.02 - Don’t panic! If you’re on the road to SMB3, there are no radical changes.
  • Considered not revving the dialect and doing just capability bits, but thought it would be better to rev the dialect.
  • Dialects vs. Capabilities: Assymetric Shares, FILE_ATTRIBUTE_INTEGRITY_STREAMS.
  • SMB 2.0 client attempting MC or CA? Consistency/documentation question.
  • A server that receives a request from a client with a flag/option/capability that is not valid for the dialect should ignore it.
  • Showing code on how to mask the capabilities that don’t make sense for a specific dialect
  • Read/Write changes: request specific flag for unbuffered IO. RDMA flag for invalidation.
  • Comparing “Traditional” File Server Cluster vs. “Scale-Out” File Server cluster
  • Outlining the asymmetric scale-out file server cluster. Server-side redirection. Can we get the client to the optimal node?
  • Asymmetric shares. New capability in the TREE_CONNECT response. Witness used to notify client to move.
  • Different connections for different shares in the same scale-out file server cluster. Share scope is the unit of resource location.
  • Client processes share-level “move” in the same fashion as a server-level “move” (disconnect, reconnects to IP, rebinds handle).
  • If the cost accessing the data is the same for all nodes, there is no need to move the client to another node.
  • Move-SmbWitnessClient will not work with asymmetric shares.
  • In Windows, asymmetric shares are typically associated with Mirrored Storage Spaces, not iSCSI/FC uniform deployment. Registry key to override.
  • Witness changes: Additional fields: Sharename, Flags, KeepAliveTimeOutInSeconds.
  • Witness changes: Multichannel notification request. Insight into arrival/loss of network interfaces.
  • Witness changes: Keepalive. Timeout for async IO are very coarse. Guarantees client and server discover lost peer in minutes instead of hours.
  • Demos in Jose’s blog. Thanks for the plug!
  • Diagnosability events. New always-on events. Example: failed to reconnect a persistent handle includes previous reconnect error and reason. New events on server and client.
  • If Asymmetric is not important to you, you don’t need to implement it.
  • SMB for IPC (Inter-process communications) – What happened to named pipes?
  • Named pipes over SMB has been declined in popularity. Performance concerns with serialized IO. But this is a property of named pipes, not SMB.
  • SMB provides: discovery, negotiation, authentication, authorization, message semantics, multichannel, RDMA, etc…
  • If you can abstract your application as a file system interface, you could extend it to removte via SMB.
  • First example: Remote Shared Virtual Disk Protocol
  • Second example: Hyper-V Live Migration over SMB. VID issues writes over SMB to target for memory pages. Leverages SMB Multichannel, SMB Direct.
  • Future thoughts on SMB for IPC. Not a protocol change or Microsoft new feature. Just ideas shared as a thought experiment.
    • MessageFs – User mode-client and user-mode server. Named Pipes vs. MessageFs. Each offset marks a distinct transaction, enables parallel actions.
    • MemFs – Kernel mode component on the server side. Server registers a memory region and clients can access that memory region.
    • MemFs+ - What if we combine the two? Fast exchange for small messages plus high bandwidth, zero copy access for large transfers. Model maps directly to RDMA: send/receive messages, read/write memory access.
  • One last thing… On Windows 8.1, you can actually disable SMB 1.0 completely.

Architecting Block and Object Geo-replication Solutions with Ceph
Sage Weil, Founder & CTO, Inktank

  • Impossible to take notes, speaker goes too fast :-)

1 S(a) 2 M 3 B(a) 4
Michael Adam, SerNet GmbH - Delivered by Volker

  • What is Samba? The open source SMB server (Samba3). The upcoming open source AD controller (Samba4). Two different projects.
  • Who is Samba? List of team members. Some 35 or so people… www.samba.org/samba/team
  • Development focus: Not a single concentrated development effort. Various companies (RedHat, SuSE, IBM, SerNet, …) Different interests, changing interests.
  • Development quality: Established. Autobuild selftest mechanism. New voluntary review system (since October 2012).
  • What about Samba 4.0 after all?
    • First (!?) open source Active Directory domain controller
    • The direct continuation of the Samba 3.6 SMB file server
    • A big success in reuniting two de-facto separated projects!
    • Also a big and important file server release (SMB 2.0 with durable handles, SMB 2.1 (no leases), SMB 3.0 (basic support)
  • History. Long slide with history from 2003-06-07 (Samba 3.0.0 beta 1) to 2012-12-11 (Samba 4.0.0). Samba4 switched to using SMB2 by default.
  • What will 4.1 bring? Current 4.1.0rc3 – final planned for 2013-09-27.
  • Samba 4.1 details: mostly stabilization (AD, file server). SMB2/3 support in smbclient, including SMB3 encryption. Server side copy. Removed SWAT.
  • Included in Samba 4.0: SMB 2.0 (durable handles). SMB 2.1 (multi-credit, large MTU, dynamic reauth), SMB 3.0 (signing, encryption, secure negotiate, durable handles v2)
  • Missing in Samba 4.0: SMB 2.1 (leasing*, resilient file handles), SMB 3.0 (persistent file handles, multichannel*, SMB direct*, witness*, cluster features, storage features*, …) *=designed, started or in progress
  • Leases: Oplocks done right. Remove 1:1 relationship between open and oplock, add lease/oplock key. http://wiki.samba.org/index.php/Samba3/SMB2#Leases
  • Witness: Explored protocol with Samba rpcclient implementation. Working on pre-req async RPC. http://wiki.samba.org/index.php/Samba3/SMB2#Witness_Notification_Protocol
  • SMB Direct:  Currently approaching from the Linux kernel side. See related SDC talk. http://wiki.samba.org/index.php/Samba3/SMB2#SMB_Direct
  • Multichannel and persistent handles: just experimentation and discussion for now. No code yet.

Keynote: The Impact of the NVM Programming Model
Andy Rudoff, Intel

  • Title is Impact of NVM Programming Model (… and Persistent Memory!)
  • What do we need to do to prepare, to leverage persistent memory
  • Why now? Programming model is decades old!
  • What changes? Incremental changes vs. major disruptions
  • What does this means to developers? This is SDC…
  • Why now?
  • One movements here: Block mode innovation (atomics, access hints, new types of trim, NVM-oriented operations). Incremental.
  • The other movement: Emerging NVM technologies (Performance, performance, perf… okay, Cost)
  • Started talking to companies in the industry  SNIA NVM Programming TWG - http://snia.org/forums/sssi/nvmp
  • NVM TWG: Develop specifications for new software “programming models”as NVM becomes a standard feature of platforms
  • If you don’t build it and show that it works…
  • NVM TWG: Programming Model is not an API. Cannot define those in a committee and push on OSVs. Cannot define one API for multiple OS platforms
  • Next best thing is to agree on an overall model.
  • What changes?
  • Focus on major disruptions.
  • Next generation scalable NVM: Talking about resistive RAM NVM options. 1000x speed up over NND, closer do DRAM.
  • Phase Change Memory, Magnetic Tunnel Junction (MT), Electrochemical Cells (ECM), Binary Oxide Filament Cells, Interfacial Switching
  • Timing. Chart showing NAND SATA3 (ONFI2, ONFI3), NAND PCIe Gen3 x4 ONFI3 and future NVM PCIE Gen3 x4.
  • Cost of software stack is not changing, for the last one (NVM PCIe) read latency, software is 60% of it?!
  • Describing Persistent Memory…
  • Byte-addressable (as far as programming model goes), load/store access (not demand-paged), memory-like performance (would stall a CPU load waiting for PM), probably DMA-able (including RDMA)
  • For modeling, think battery-backed RAM. These are clunky and expensive, but it’s a good model.
  • It is not tablet-like memory for the entire system. It is not NAND Flash (at least not directly, perhaps with caching). It is not block-oriented.
  • PM does not surprise the program with unexpected latencies (no major page faults). Does not kick other things out of memory. Does not use page cache unexpectedly.
  • PM stores are not durable until data is flushed. Looks like a bug, but it’s always been like this. Same behavior that’s been around for decades. It’s how physics works.
  • PM may not always stay in the same address (physically, virtually). Different location each time your program runs. Don’t store pointers and expect them to work. You have to use relative pointers. Welcome to the world of file systems…
  • Types of Persistent Memory: Battery-backed RAM. DRAM saved on power failure. NVM with significant caching. Next generation NVM (still quite a bit unknown/emerging here).
  • Existing use cases: From volatile use cases (typical) to persistent memory use case (emerging). NVDIMM, Copy to Flash, NVM used as memory.
  • Value: Data sets with no DRAM footprint. RDMA directly to persistence (no buffer copy required!). The “warm cache” effect. Byte-addressable. Direct user-mode access.
  • Challenges: New programming models, API. It’s not storage, it’s not memory. Programming challenges. File system engineers and database engineers always did this. Now other apps need to learn.
  • Comparing to the change that happened when we switched to parallel programming. Some things can be parallelized, some cannot.
  • Two persistent memory programming models (there are four models, more on the talk this afternoon).
  • First: NVM PM Volume mode. PM-aware kernel module. A list of physical ranges of NVMs (GET_RANGESET).
  • For example, used by file systems, memory management, storage stack components like RAID, caches.
  • Second: NVM PM File. Uses a persistent-memory-aware file system. Open a file and memory map it. But when you do load and store you go directly to persistent memory.
  • Native file APIs and management. Did a prototype on Linux.
  • Application memory allocation. Ptr=malloc(len). Simple, familiar interface. But it’s persistent and you need to have a way to get back to it, give it a name. Like a file…
  • Who uses NVM.PM.FILE. Applications, must reconnect with blobs of persistence (name, permissions)
  • What does it means to developers?
  • Mmap() on UNIX, MapViewOfFile() on Windows. Have been around for decades. Present in all modern operating systems. Shared or Copy-on-write.
  • NVM.PM.FILE – surfaces PM to application. Still somewhat raw at this point. Two ways: 1-Build on it with additional libraries. 2-Eventually turn to language extensions…
  • All these things are coming. Libraries, language extensions. But how does it work?
  • Creating resilient data structures. Resilient to a power failure. It will be in state you left it before the power failure. Full example: resilient malloc.
  • In summary: models are evolving. Many companies in the TWG. Apps can make a big splash by leveraging this… Looking forward to libraries and language extensions.

Keynote: Windows Azure Storage – Scaling Cloud Storage
Andrew Edwards, Microsoft

  • Turning block devices into very, very large block devices. Overview, architecture, key points.
  • Overview
  • Cloud storage: Blobs, disks, tables and queues. Highly durable, available and massively scalable.
  • 10+ trillion objects. 1M+ requests per seconds average. Exposed via easy and open REST APIs
  • Blobs: Simple interface to retrieve files in the cloud. Data sharing, big data, backups.
  • Disks: Built on top on blobs. Mounted disks as VHDs stored on blobs.
  • Tables: Massively scalable key-value pairs. You can do queries, scan. Metadata for your systems.
  • Queues: Reliable messaging system. Deals with failure cases.
  • Azure is spread all over the world.
  • Storage Concepts: Accounts  ContainerBlobs/TableEntities/QueuesMessages. URLs to identify.
  • Used by Microsoft (XBOX, SkyDrive, etc…) and many external companies
  • Architecture
  • Design Goals: Highly available with strong consistency. Durability, scalability (to zettabytes). Additional information in the SOSP paper.
  • Storage stamps: Access to blog via the URL. LB  Front-end  Partition layer  DFS Layer. Inter-stamp partition replication.
  • Architecture layer: Distributed file system layer. JBODs, append-only file system, each extent is replicated 3 times.
  • Architecture layer: Partition layer. Understands our data abstractions (blobs, queues, etc). Massively scalable index. Log Structure Merge Tree. Linked list of extents
  • Architecture layer: Front-end layer. REST front end. Authentication/authorization. Metrics/logging.
  • Key Design Points
  • Availability with consistency for writing. All writes we do are to a log. Append to the last extent of the log.
  • Ordered the same across all 3 replicas. Success only if 3 replicas are commited. Extents get sealed (no more appends) when they get to a certain size.
  • If you lose a node, seal the old two copies, create 3 new instances to append to. Also make a 3rd copy for the old one.
  • Availability with consistency for reading. Can read from any replica. Send out parallel read requests if first read is taking higher than 95% latency.
  • Partition Layer: spread index/transaction processing across servers. If there is a hot node, split that part of the index off. Dynamically load balance. Just the index, this does not move the data.
  • DFS Layer: load balancing there as well. No disk or node should be hot. Applies to both reads and writes. Lazily move replicas around to load balancing.
  • Append only system. Benefits: simple replication, easier diagnostics, erasure coding, keep snapshots with no extra cost, works well with future dirve technology. Tradeoff: GC overhead.
  • Our approach to the CAP theorem. Tradeoff in Availability vs. Consistency. Extra flexibility to achieve C and A at the same time.
  • Lessons learned: Automatic load balancing. Adapt to conditions. Tunable and extensible to tune load balancing rules. Tune based on any dimension (CPU, network, memory, tpc, GC load, etc.)
  • Lessons learned: Achieve consistently low append latencies. Ended up using SSD journaling.
  • Lessons learned: Efficient upgrade support. We update frequently, almost consistently. Handle them almost as failures.
  • Lessons learned: Pressure point testing. Make sure we’re resilient despite errors.
  • Erasure coding. Implemented at the DFS Layer. See last year’s SDC presentation.
  • Azure VM persistent disks: VHDs for persistent disks are directly stored in Windows Azure Storage blobs. You can access your VHDs via REST.
  • Easy to upload/download your own VHD and mount them. REST writes are blocked when mounted to a VM. Snapshots and Geo replication as well.
  • Separating compute from storage. Allows them to be scaled separately. Provide flat network storage. Using a Quantum 10 network architecture.
  • Summary: Durability (3 copies), Consistency (commit across 3 copies). Availability (can read from any of the 3 relicas). Performance/Scale.
  • Windows Azure developer website: http://www.windowsazure.com/en-us/develop/net
  • Windows Azure storage blog: http://blogs.msdn.com/b/windowsazurestorage
  • SOSP paper/talk: http://blogs.msdn.com/b/windowsazure/archive/2011/11/21/windows-azure-storage-a-highly-available-cloud-storage-service-with-strong-consistency.aspx

SMB Direct update
Greg Kramer, Microsoft
Tom Talpey, Microsoft

  • Two parts: 1 - Tom shares Ecosystem status and updates, 2 - Greg shares SMB Direct details
  • Protocols and updates: SMB 3.02 is a minor update. Documented in MS-SMB2 and MS-SMBD. See Dave's talk yesterday.
  • SMB Direct specifies the SMB3 RDMA transport, works with both SMB 3.0 and SMB 3.02
  • Windows Server 2012 R2 – GA in October, download from MSDN
  • Applications using SMB3 and SMB Direct: Hyper-V VHD, SQL Server
  • New in R2: Hyper-V Live Migration over SMB, Shared VHDX (remote shared virtual disk, MS-RSVD protocol)
  • RDMA Transports: iWARP (IETF RDMA over TCP), InfiniBand, RoCE. Ethernet: iWARP and RoCE – 10 or 40GbE, InfiniBand: 32Gbps (QDR) or 54Gbps (FDR)
  • RDMA evolution: iWARP (IETF standard, extensions currently active in IETF). RoCE (routable RoCE to improve scale, DCB deployment still problematic). InfiniBand (Roadmap to 100Gbps, keeping up as the bandwidth/latency leader).
  • iWARP: Ethernet, routable, no special fabric required, Up to 40GbE with good latency and full throughput
  • RoCE: Ethernet, not routable, requires PFC/DCB, Up to 40GbE with good latency and full throughput
  • InfinBand: Specialized interconnect, not routable, dedicated fabric and switching, Up to 56Gbps with excellent latency and throughput
  • SMB3 Services: Connection management, authentication, multichannel, networking resilience/recovery, RDMA, File IO Semantics, control and extension semantics, remote file system access, RPC
  • The ISO 7-layer model: SMB presents new value as a Session layer (RDMA, multichannel, replay/recover). Move the value of SMB up the stack.
  • SMB3 as a session layer: Applications can get network transparency, performance, recovery, protection (signing, encryption, AD integration). Not something you see with other file systems or file protocols.
  • Other: Great use by clustering (inter-node communication), quality of service, cloud deployment
  • In summary. Look to SMB for even broader application (like Hyper-V Live Migration did). Broader use of SMB Direct. Look to see greater application “fidelity” (sophisticated applications transparently server by SMB3)
  • Protocol enhancements and performance results
  • Where can we reduce IO costs? We were extremely happy about performance, there was nothing extremely easy to do next, no low-hanging fruit.
  • Diagram showing the App/SMB client/Client RNIC/Server RNIC. How requests flow in SMB Direct.
  • Interesting: client has to wait for the invalidation completion. Invalidation popped up as an area of improvement. Consumes cycles, bus. Adds IO, latency. But it’s required.
  • Why pend IO until invalidation is completed? This is storage, we need to be strictly correct. Invalidation guarantees: data is in a consistent state after DMA, peers no longer has access.
  • Registration caches cannot provides these guarantees, leading to danger of corruption.
  • Back to the diagram. There is a way to decorate a request with the invalidation  Send and Invalidate. Provides all the guarantees that we need!
  • Reduces RNIC work requests per IO by one third for high IOPs workload. That’s huge! Already supported by iWARP/RoCE/InfiniBand
  • No changes required at the SMB Direct protocol. Minor protocol change in SMB 3.02 to support invalidation. New channel value in the SMB READ and SMB WRITE.
  • Using Send and Invalidate (Server). Only one invalidate per request, have to be associated with the request in question. You can leverage SMB compounding.
  • Only the first memory descriptor in the SMB3 read/write array may be remotely invalidated. Keeping it simple.
  • Using Send and Invalidate (Client). Not a mandate, you can still invalidate “manually” if not using remote invalidate. Must validate that the response matches.
  • Performance Results (drumroll…)
  • Benchmark configuration: Client and Server config: Xeon E5-2660. 2 x ConnectX-3 56Gbps InfiniBand. Shunt filter in the IO path. Comparing WS2012 vs. WS2012 R2 on same hardware.
  • 1KB random IO. Uses RDMA send/receive path. Unbuffered, 64 queue depth.
    • Reads: 881K IOPs. 2012 R2 is +12.5% over 2012. Both client and server CPU/IO reduced (-17.3%, -36.7%)
    • Writes: 808K IOPs. 2012 R2 is +13.5% over 2012. Both client and server CPU/IO reduced (-16%, -32.7%)
  • 8KB random IO. Uses RDMA read/writes. Unbuffered, 64 queue depth.
    • Reads: 835K IOPs. 2012 R2 is +43.3% over 2012. Both client and server CPU/IO reduced (-37.1%, -33.2%)
    • Writes: 712K IOPs. 2012 R2 is +30.2% over 2012. Both client and server CPU/IO reduced (-26%, -14.9%)
  • 512KB sequential IO. Unbuffered, 12 queue depth. Already maxing out before. Remains awesome. Minor CPU utilization decrease.
    • Reads: 11,366 MBytes/sec. 2012 R2 is +6.2% over 2012. Both client and server CPU/IO reduced (-9.3%, -14.3%)
    • Writes: 11,412 MBytes/sec: 2012 R2 is +6% over 2012. Both client and server CPU/IO reduced (-12.2%, -10.3%)
  • Recap: Increased IOPS (up to 43%) and high bandwidth. Decrease CPU per IO (up to 36%).
  • Client has more CPU for applications. Server scales to more clients.
  • This includes other optimization in both the client in the server. NUMA is very important.
  • No new hardware required. No increase number of connections, MRs, etc.
  • Results reflect the untuned, out-of-the-box customer experience.
  • One more thing… You might be skeptical, especially about the use of shunt filter.
  • We never get to see this in our dev environment, we don’t have the high end gear. But...
  • Describing the 3U Violin memory array running Windows Server in a clustered configuration. All flash storage. Let’s see what happens…
  • Performance on real IO going to real, continuously available storage:
    • 100% Reads – 4KiB: >1Million IOPS
    • 100% Reads – 8KiB: >500K IOPS
    • 100% Writes – 4KiB: >600K IOPS
    • 100% Writes – 8KiB: >300K IOPS
  • Questions?

A Status Report on SMB Direct (RDMA) for Samba
Richard Sharpe, Samba Team Member, Panzura

  • I work at Panzura but has been done on my weekends
  • Looking at options to implement SMB Direct
  • 2011 – Microsoft introduced SMB direct at SDC 2011. I played around with RDMA
  • May 2012 – Tutorial on SMB 3.0 at Samba XP
  • Mellanox supplied some IB cards to Samba team members
  • May 2013 – More Microsoft presentations with Microsoft at Samba XP
  • June 2013 – Conversations with Mellanox to discuss options
  • August 2013 – Started circulating a design document
  • Another month or two before it’s hooked up with Samba.
  • Relevant protocol details: Client connections via TCP first (port 445). Session setup, connects to a share. Queries network interfaces. Place an RDMA Connection to server on port 5445, brings up SMB Direct Protocol engine
  • Client sends negotiate request, Dialect 0x300, capabilities field. Server Responds.
  • Diagram with SMB2 spec section 4.8 has an example
  • SMB Direct: Small protocol - Negotiate exchange phase, PDU transfer phase.
  • Structure of Samba. Why did it take us two years? Samba uses a fork model. Master smbd forks a child. Easy with TCP. Master does not handle SMB PDUs.
  • Separate process per connection. No easy way to transfer connection between them.
  • Diagram with Samba structure. Problem: who should listen on port 5445? Wanted RDMA connection to go to the child process.
  • 3 options:
  • 1 - Convert Samba to a threaded model, everything in one address space. Would simplify TCP as well. A lot of work… Presents other problems.
  • 2 - Separate process to handle RDMA. Master dmbd, RDMA handler, multiple child smbd, shared memory. Layering violation! Context switches per send/receive/read/write. Big perf hit.
  • 3 - Kernel driver to handle RDMA. Smbdirect support / RDMA support (rdmacm, etc) / device drivers. Use IOCTLs. All RDMA work in kernel, including RDMA negotiate on port 5445. Still a layering violation. Will require both kernel and Samba knowledge.
  • I decided I will follow this kernel option.
  • Character mode device. Should be agnostic of the card used. Communicate via IOCTLs (setup, memory params, send/receive, read/write).
  • Mmap for RDMA READ and RDMA WRITE. Can copy memory for other requests. Event/callback driven. Memory registration.
  • The fact that is looks like a device driver is a convenience.
  • IOCTLs: set parameters, set session ID, get mem params, get event (includes receive, send complete), send pdu, rdma read and write, disconnect.
  • Considering option 2. Doing the implementation of option 3 will give us experience and might change later.
  • Amortizing the mode switch. Get, send, etc, multiple buffers per IOCTL. Passing an array of objects at a time.
  • Samba changes needed….
  • Goals at this stage: Get something working. Allow others to complete. It will be up on github. Longer term: improve performance with help from others.
  • Some of this work could be used by the SMB client
  • Status: A start has been made. Driver loads and unloads, listens to connections. Working through the details of registering memory. Understand the Samba changes needed.
  • Weekend project! http://github.com/RichardSharpe/smbdirect-driver
  • Acknowledgments: Microsoft. Tom Talpey. Mellanox. Or Gerlitz. Samba team members.

CDMI and Scale Out File System for Hadoop
Philippe Nicolas, Scality

  • Short summary of who is Scality. Founded 2009. HQ in SF. ~60 employees, ~25 engineers in Paris. 24x7 support team. 3 US patents. $35Min 3 rounds.
  • Scality RING. Topology and name of the product. Currently in the 4.2 release. Commodity servers and storage. Support 4 LINUX distributions. Configure Scality layer. Create a large pool of storage.
  • Ring Topology. End-to-end Paralelism. Object Storage. NewSQL DB. Replication. Erasure coding. Geo Redundancy. Tiering. Multiple access methods (HTTP/REST, CDMI, NFS, CIFS, SOFS). GUI/CLI management.
  • Usage: e-mail, file storage, StaaS, Digital Media, Big Data, HPC
  • Access methods: APIs: RS2 (S3 compatible API), Sproxyd, RS2 light, SNIA CDMI. File interface: Scality Scale Out File System (SOFS), NFS, CIFS, AFP, FTP. Hadoop HDFS. OpenStack Cinder (since April 2013).
  • Parallel network file system. Limits are huge – 2^32 volumes (FS), 2^24 namespaces, 2^64 files. Sparse files. Aggregated throughput, auto-scaling with storage or access node addition.
  • CDMI (path and ID based access). Versions 1.0.1., 1.0.2. On github. CDMI client java library (CaDMIum), set of open source filesystem tools. On github.
  • Apache Hadoop. Transform commodity hard in data storage service. Largely supported by industry and end user community. Industry adoption: big names adopting Hadoop.
  • Scality RING for Hadoop. Replace HDFS with the Scality FS. We validate Hortonworks and Cloudera. Example with 12 Hadoop nodes for 12 storage nodes. Hadoop task trackers on RING storage nodes.
  • Data compute and storage platform in ONE cluster. Scality Scale Out File System (SOFS)  instead of HDFS. Advanced data protection (data replication up to 6 copies, erasure coding). Integration with Hortonworks HDP 1.0 & Cloudera CDH3/CDH4. Not another Hadoop distribution!
  • Summary: This is Open Cloud Access: access local or remotely via file and block interface. Full CDMI server and client. Hadoop integration (convergence approach). Comprehensive data storage platform.

Introduction to HP Moonshot
Tracy Shintaku, HP

  • Today’s demands – pervasive computing estimates. Growing internet of things (IoT).
  • Using SoC technologies used in other scenarios for the datacenter.
  • HP Moonshot System. 4U. World’s first low-energy software-defined server. HP Moonshot 1500 Chassis.
  • 45 individually serviceable hot-plug artrdges. 2 network switches, private fabric. Passive base plane.
  • Introducing HP Proliant Moonshot Server (passing around the room). 2000 of these servers in a rack. Intel Atom S1260 2GHz, 8GB DDR ECC 1333MHz. 500GB or 1TBHDD or SSD.
  • Single server = 45 servers per chassis. Quad-server = 180 servers per chassis. Compute, storage or combination. Storage cartridges with 2 HDD shared by 8 servers.
  • Rear view of the chassis. Dual 4QSFP network uplinks each with 4 x 40GB), 5 hot-plug fans, Power supplies, management module.
  • Ethernet – traffic isolation and stacking for resiliency with dual low-latency switches. 45 servers  dual switches  dual uplink modules.
  • Storage fabric. Different module form factors allow for different options: Local storage. Low cost boot and logging. Distributed storage and RAID. Drive slices reduce cost of a boot drive 87%.
  • Inter-cartridge private 2D Taurus Ring – available in future cartridges. High speed communication lanes between servers. Ring fabric, where efficient localized traffic is benefitial.
  • Cartridge roadmap. Today to near future to future. CPU: Atom  Atom, GPU, DSP, x86, ARM. Increasing variety of workloads static web servers now to hosting, financial servers in the future.
  • Enablement: customer and partner programs. Partner program. Logo wall for technology partners. Solution building program. Lab, services, consulting, financing.
  • Partners include Redhat, Suse, Ubuntu, Hortonworks, MapR, Cloudera, Couchbase, Citrix, Intel, AMD, Calxeda, Applied Micro, TI, Marvell, others. There’s a lot of commonality with OpenStack.
  • Web site: http://h17007.www1.hp.com/us/en/enterprise/servers/products/moonshot/index.aspx 

NFS on Steroids: Building Worldwide Distributed File System
Gregory Touretsky, Intel

  • Intel is a big organization, 6500 Its @ 59 sites, 95,200 employes, 142000 devices.
  • Every employee doing design has a Windows machines, but also interact with NFS backend
  • Remote desktop in interactive pool. Talks to NFS file servers, glued together with name spaces.
  • Large batch pools that do testing. Models stored in NFS.
  • Various application servers, running various systems. Also NIS, Cron servers, event monitors, configuration management.
  • Uses Samba to provide access to NFS file servers using SMB.
  • Many sites, many projects. Diagram with map of the world and multiple projects spanning geographies.
  • Latency between 10s t 100s ms. Bandwidth: 10s Mbps to 10s of Gbpsp.
  • Challenge: how to get to our customers and provide the ability to collaborate across the globe in a secure way
  • Cross-site data access in 2012. Multiple file servers, each with multiple exports. Clients access servers on same site. Rsync++ for replication between sites (time consuming).
  • Global user and group accounts. Users belong to different groups in different sites.
  • Goals: Access every file in the world from anywhere, same path, fault tolerant, WAN friendly, every user account and group on every site, Local IO performance not compromised.
  • Options: OpenAFS (moved out many years ago, decided not to got back). Cloud storage (concern with performance). NFS client-side caching (does not work well, many issues). WAN optimization (have some in production, help with some protocols, but not suitable for NFS). NFS site-level caching (proprietary and open source NFS Ganesha). In house (decided not to go there).
  • Direct NFS mount over WAN optimized tunnel. NFS ops terminated at the remote site, multiple potential routes, cache miss. Not the right solution for us.
  • Select: Site-level NFS caching and Kerberos. Each site has NFS servers and Cache servers. Provides instant visibility and minimizes amount of data transfer across sites.
  • Cache is also writable. Solutions with write-through and write-back caching.
  • Kerberos authentication with NFS v3. There are some problems there.
  • Cache implementations: half a dozen vendors, many not suitable for WAN. Evaluating alternatives.
  • Many are unable to provide a disconnected mode of operation. That eliminated a number of vendors.
  • Consistency vs. performance. Attribute cache timeout. Nobody integrates this at the directory level. Max writeback delay.
  • Optimizations. Read cache vs. Read/Write cache, delegations, proactive attribute validation for hot files, cache pre-population.
  • Where is it problematic? Application is very NFS unfriendly, does not work well with caching. Some cases it cannot do over cache, must use replication.
  • Problems: Read once over high latency link. First read, large file, interactive work. Large % of non-cacheable ops (write-through). Seldom access, beyond cache timeout.
  • Caching is not a business continuity solution. Only a partial copy of the data.
  • Cache management implementation. Doing it at scale is hard. Nobody provides a solution that fits our needs.
  • Goal: self-service management for data caching. Today administrators are involved in the process.
  • Use cases: cache my disk at site X, modify cache parameters, remove cache, migrate source/cache, get cache statistics, shared capacity management, etc.
  • Abstract the differences between the different vendors with this management system.
  • Management system example: report with project, path (mount point), size, usage, cached cells. Create cell in specific site for specific site.
  • Cache capacity planning. Goal: every file to be accessible on-demand everywhere.
  • Track cache usage by org/project. Shared cache capacity, multi-tenant. Initial rule of thumb: 7-10% of the source capacity, seeding capacity at key locations
  • Usage models: Remote validation (write once, read many). Get results back from remote sites (write once, read once). Drop box (generate in one site, get anywhere). Single home directory (avoid home directory in every site for every user, cache remote home directories). Quick remote environment setup, data access from branch location.
  • NFS (RPC) Authentication. Comparing AUTH_SYS and RPCSEC_GSS (KRB5). Second one uses an external KDC, gets past the AUTH_SYS limitation of 16 group IDs.
  • Bringing Kerberos? Needs to make sure this works as well as Windows with Active Directory. Need to touch everything (Linux client, NFS file servers, SSH, batch scheduler, remote desktop/interactive servers, name space/automounter (trusted hosts vs. regular hosts), Samba (used as an SMB gateway to NFS), setuid/sudo, cron jobs and service accounts (keytab management system),
  • Supporting the transition from legacy mounts to Kerberos mount. Must support a mixed environment. Introducing second NIS domain.
  • Welcome on board GDA airlines (actual connections between different sites). Good initial feedback from users (works like magic!)
  • Summary: NFS can be accessed over WAN – using NFS caching proxy. NFSv3 environment can be kerberized (major effort is required, transition is challenging, it would be as challenging for NFSv5/KRB)

Forget IOPS: A Proper Way to Characterize & Test Storage Performance
Peter Murray, SwiftTest

  • About what we learned in the last few years
  • Evolution: Vendor IOPs claims, test in production and pray, validate with freeware tools (iometer, IOZone), validate with workload models
  • What is storage validation? Characterize the various applications, workloads. Diagram: validation appliance, workload emulations, storage under test.
  • Why should you care? Because customers do care! Product evaluations, vendor bakeoffs, new feature and technology evaluations, etc…
  • IOPS: definition from the SNIA dictionary. Not really well defined. One size does not fit all. Looking at different sizes.
  • Real IO does not use a fixed size. Read/write may be a small portion of it in certain workloads. RDMA read/write may erode the usefulness of isolated read/write.
  • Metadata: data about data. Often in excess of 50%, sometimes more than 90%. GoDaddy mentioned that 94% of workloads are not read/write.
  • Reducing metadata impact: caching with ram, flash, SSD helps but it’s expensive.
  • Workloads: IOPS, metadata and your access pattern. Write/read, random/sequential, IO/metadata, block/chunk size, etc.
  • The important of workloads: Understand overload and failure conditions. Understand server, cluster, deduplication, compression, network configuration and conditions
  • Creating and understanding workloads. Access patterns (I/O mix: read/write%, metadata%) , file system (depth, files/folder, file size distribution), IO parameters (block size, chunk size, direction), load properties (number of users, actions/second, load variability/time)
  • Step 1 - Creating a production model. It’s an art, working to make it a science. Production stats + packet captures + pre-built test suites = accurate, realistic work model.
  • Looking at various workload analysis
  • Workload re-creation challenges: difficult. Many working on these. Big data, VDI, general VM, infinite permutations.
  • Complex workloads emulation is difficult and time consuming. You need smart people, you need to spend the time.
  • Go Daddy shared some of the work on simulation of their workload. Looking at diagram with characteristics of a workload.
  • Looking at table with NFSv3/SMB2 vs. file action distribution.
  • Step 2: Run workload model against the target.
  • Step 3: Analyze the results for better decisions. Analytics leads to Insight. Blocks vs. file. Boot storm handing, limits testing, failure modes, effects of flash/dedup/tiering/scale-out.
  • I think we’ll see dramatic changes with the use of Flash. Things are going to change in the next few years.
  • Results analysis: Performance. You want to understand performance, spikes during the day, what causes them. Response times, throughput.
  • Results analysis: Command mix. Verify that the execution reflects the expected mix. Attempts, successes, errors, aborts.
  • Summary: IOPs alone cannot characterize real app storage performance. Inclusion of metadata is essential, workload modeling and purpose-build load generation appliances are the way to emulate applications. The more complete the emulation, the deeper the understanding.
  • If we can reduce storage cost from 40% to 20% of the solution by better understanding the workload, you can save a lot of money.

pNFS, NFSv4.1, FedFS and Future NFS Developments
Tom Haynes, NetApp

  • Tom covering for Alex McDonald, who is sick. His slides.
  • We want to talk about how the protocol get defined, how it interfact with different application vendors and customers.
  • Looking at what is happening on the Linux client these days.
  • NFS: Ubiquitous and everywhere. NFSv3 is very successful, we can’t dislodge it. We though everyone would go for NFSv4 and it’s now 10 years later…
  • NFSv2 in 1983, NFSv3 in 1995, NFSv4 in 2003, NFSv4.1 in 2010. NFSv4.2 to be agreed at the IETF – still kinks in the protocol that need to be ironed out. 2000=DAS, 2010=NAS, 2020=Scale-Out
  • Evolving requirements. Adoption is slow. Lack of clients was a problem with NFSv4. NFSv3 was just “good enough”. (It actually is more than good enough!)
  • Industry is changing, as are requirements. Economic trends (cheap and fast cluster, cheap and fast network, etc…)
  • Performance: NFSv3 single threaded bottlenecks in applications (you can work around it).
  • Business requirements. Reliability (sessions) is a big requirement
  • NFSv4 and beyond.
  • Areas for NFSv4, NFSv4.1 and pNFS: Security, uniform namespaces, statefulness/sessions, compound operations, caching (directory and file delegations) parallelization (layout and pFNS)
  • Future NFSv4.2 and FedFS (Global namespace; IESG has approved Dec 2012)
  • NFSv4.1 failed to talk to the applications and customers and ask what they needed. We did that for NFSv4.2
  • Selecting the application for NFSv4.1, planning, server and client availability. High level overview
  • Selecting the parts: 1 – NFSv4.1 compliant server (Files, blocks or objects?), 2-compliant client. The rise of the embedded client (Oracle, VMware). 3 – Auxiliary tools (Kerberos, DNS, NTP, LDAP). 4 – If you can, use NFS v4.1 over NFSv4.
  • If you’re implementing something today, skip NFS v4 and go straight to NFS v4.1
  • First task: select an application: Home directories, HPC applications.
  • Don’t select: Oracle (use dNFS built in), VMware and other virtualization tools (NFSv3). Oddball apps that expect to be able to internally manage NFSv3 “maps”. Any application that required UDP, since v4.1 doesn’t support anything but TCP.
  • NSFv4 stateful clients. Gives client independence (client has state). Allows delegation and caching. No automounter required, simplified locking
  • Why? Compute nodes work best with local data, NFSv4 eliminates the need for local storage, exposes more of the backed storage functionality (hints), removes stale locks (major source of NFSv3 irritation)
  • NFSv4.1 Delegations. Server delegates certain responsabilities to the client (directory and file, caching). Read and write delegation. Allows clients to locally service operations (open, close, lock, etc.)
  • NFSv4.1 Sessions. In v3, server never knows if client got the reply message. In v4.1, sessions introduced.
  • Sessions: Major protocol infrastructure change. Exactly once semantics (EOS), bounded size of reply cache. Unlimited parallelism. Maintains server’s state relative to the connections belonging to a client.
  • Use delegation and caching transparently; client and server provide transparency. Session lock clean up automatically.
  • NFSv4 Compound operations – NFSv3 protocol can be “chatty”, unsuitable for WANs with poor latency. Typical NFSv3: open, read & close a file. Compounds many operations into one to reduce wire time and simple error recovery.
  • GETATTR is the bad boy. We spent 10 years with the Linux client to get rid of many of the GETATTR (26% of SPECsfs2008).
  • NFSv4 Namespace. Uniform and “infinite” namespace. Moving from user/home directories to datacenter and corporate use. Meets demand for “large scale” protocol. Unicode support for UTF-8 codepoints. No automounter required (simplifies administration). Pseudo-file system constructed by the server.
  • Looking at NFSv4 Namespace example. Consider the flexibility of pseudo-filesystems to permit easier migration.
  • NFSv4 I18N Directory and File Names. Uses UTF-8, check filenames for compatibility, review filenames for compatibility. Review existing NFSv3 names to ensure they are 7-bit ASCII clean.
  • NFSv4 Security. Strong security framework. ACLs for security and Windows compatibility. Security with Kerberos. NFSv4 can be implemented without Kerberos security, but not advisable.
  • Implementing without Kerberos (no security is a last resort!). NFSv4 represents users/groups as strings (NFSv3 used 32-bit integers, UID/GID). Requires UID/GID to be converted to all numeric strings.
  • Implementing with Kerberos. Find a security expert. Consider using Windows AD Server.
  • NFSv4 Security. Firewalls. NFSv4 ha no auxiliary protocols. Uses port 2049 with TCP only. Just open that port.
  • NFSv4 Layouts. Files, objects and block layouts. Flexibility for storage that underpins it. Location transparent. Layouts available from various vendors.
  • pNFS. Can aggregate bandwidth. Modern approach, relieves issues associated with point-to-point connections.
  • pNFS Filesystem implications.
  • pNFS terminology. Important callback mechanism to provide information about the resource.
  • pNFS: Commercial server implementations. NetApp has it. Panasas is the room as well. Can’t talk about other vendors…
  • Going very fast through a number of slides on pNFS: NFS client mount, client to MDS, MDS Layout to NFS client, pNFS client to DEVICEINFO from MDS,
  • In summary: Go adopt NFS 4.1, it’s the greatest thing since sliced bread, skip NFS 4.0
  • List of papers and references. RFCs: 1813 (NFSv3), 3530 (NFSv4), 5661 (NFSv4.1), 5663 (NFSv4.1 block layout), 5664 (NFSv4.1 object layout)

pNFS Directions / NFSv4 Agility
Adam Emerson, CohortFS, LLC

  • Stimulate discussion about agility as a guiding vision for future protocol evaluation
  • NFSv4: A standard file acces/storage protocol, that is agile
  • Incremental advances shouldn’t require a new access protocol. Capture more value from the engineering already done. Retain broad applicability, yet adapt quickly to new challenges/opportunities
  • NFSv4 has delivered (over 10+ years of effort) on a set of features designers had long aspired to: atomicity, consistency, integration, referrals, single namespaces
  • NFSv4 has sometimes been faulted for delivering slowly and imperfect on some key promises: flexible and easy wire security , capable and interoperable ACLs, RDMA acceleration
  • NFSv4 has a set of Interesting optional features not widely implemented: named attributes, write delegations, directory delegations, security state verifier, retention policy
  • Related discussion in the NFSv4 Community (IETF): The minor version/extension debate: de-serializing independent, potentially parallel extension efforts, fixing defect in prior protocol revisions, rationalizing past and future extension mechanisms
  • Related discussion in the NFSv4 Community (IETF): Extensions drafts leave my options open, but prescribes: process to support development of new features proposals in parallel, capability negotiation, experimentation
  • Embracing agility: Noveck formulation is subtle: rooted in NFS and WG, future depends on participants, can encompass but perhaps does not call out for an agile future.
  • Capability negotiation and experimental codepoint ranges strongly support agility. What we really want is a model that encourages movement of features from private experimentation to shared experimentation to standardization.
  • Efforts promoting agility: user-mode (and open source) NFSv4 servers (Ganesha, others?) and clients (CITI Windows NFSv4.1 client, library client implementations)
  • Some of the people in the original CITI team now working with us and are continuing to work on it
  • library client implementations: Allow novel semantics and features like pre-seeding of files, HPC workloads, etc.
  • NFSv4 Protocol Concepts promoting agility: Not just new RPCs and union types.
  • Compound: Grouping operations with context operations. Context evolves with operations and inflects the operations. It could be pushed further…
  • Named Attributes: Support elaboration of conventions and even features above the protocol, with minimal effort and coordination. Subfiles, proplists. Namespace issues: System/user/other, non-atomicity, not inlined.
  • Layout: Powerful structuring concept carrying simplified transaction pattern. Typed, Operations carry opaque data nearly everywhere, application to data striping compelling.
  • Futures/experimental work – some of them are ridicuolous and I apologize in advance
  • pNFS striping flexibility/flexible files (Halevy). Per-file striping and specific parity applications to file layout. OSDv2 layout, presented at IETF 87.
  • pNFS metastripe (Eisler, further WG drafts). Scale-out metadata and parallel operations for NFSv4. Generalizing parallel access concept of NFSv4 for metadata. Built on layout and attribute hints. CohortFS prototyping metastripe on a parallel version of the Ceph file system. NFSv4 missing a per-file redirect, so this has file redirection hints.
  • End-to-end Data Integrity (Lever/IBM). Add end-to-end data integrity primitives (NFSv4.2). Build on new READ_PLUS and WRITE ops. Potentially high value for many applications.
  • pNFS Placement Layouts (CohortFS). Design for algorithmic placement in pNFS layout extension. OSD selection and placement computed by a function returned at GETDEVICEINFO. Client execution of placement codes, complex parity, volumes, etc.
  • Replication Layouts (CohortFS). Client based replication with integrity. Synchronous wide-area replication. Built on Layout.
  • Client Encryption (CohortFS). Relying on named attribute extension only, could use atomicity. Hopefully combined with end-to-end integrity being worked on
  • Cache consistency. POSIX/non-CTO recently proposed (eg, Eshel/IBM). Potentially, more generality. Eg, flexible client cache consistency models in NFSv4. Add value to existing client caching like CacheFS.
  • New participants. You? The future is in the participants…

Closing Keynote: Worlds Colliding: Why Big Data Changes How to Think about Enterprise Storage
Addison Snell, CEO, Intersect360 Research

  • Analyst, researching big data. There are hazard of forecasting big data. Big Data goes beyond Hadoop.
  • What is the real opportunity? Hype vs. Reality. Perception: HPC is not something I want in my Enterprise computing environment
  • Technical computing vs. Enterprise computing. Both defined as mission-critical.
  • Vendors do not know how much money is going on each, we need to do a lot of user research.
  • Technical computing is about the top-line mission of the business. Enterprise computing is about keeping the business running.
  • TC is driven by price/performance, faster adoption of new tech. EC is driven by RAS (reliability, availability, serviceability), slow adoption of new tech.
  • Survey results, 278 respondents, April to August 2013. Built on earlier surveys. 178 technical, 100 enterprise, 165 commercial, 67 academic, 46 government
  • Insight 1: big data is a big opportunity. Money is being spent. But people that do not do big data are not likely to answer the survey…
  • Some to spend 25% of total IT budget in 2013. Use caution when describing the big data market. Not sure about what’s being counted.
  • What are big data applications? Vendors talk “Hadoop” and “graph”. Users talk about “analyze” and “algorithm”. ISV software for big data is thinly scattered.
  • Insight #2: it’s not just Hadoop: Out of 574 apps mentioned in the survey, about half is internally developed and this grew from last year. Only 75 mentioned Hadoop in 2012 and it’s going down.
  • It’s like HPC, top 20 applications cover only 40% of the market. It’s like measuring the tail of a snake…
  • Defining the opportunity looking at IT budget growth. Expect two year change in IT budget. Look at “satisfaction gaps”.
  • Insight #3: Performance counts. Top three satisfaction gaps are IO performance, storage capacity and RAS. For both TC and EC.
  • HPC-like mentality is starting to creep into the enterprise. Big data will be a driver for expanded usage of HPC, IF they can still meet enterprise requirements.
  • Faulty logic: I like sushi, I like ice cream, therefore  I like sushi-flavored ice cream… big data and public cloud not necessarily go together.
  • Maybe private cloud is better. Lead with private cloud and burst some of it to the public cloud.
  • Technologies in the discussion. Storage beyond spinning disks (flash for max IO, tape for capacity), parallel file systems, high speed fabrics (InfiniBand), MPI, large shared memory spaces, accelerators.

Tunneling SCSI over SMB: Shared VHDX files for Guest Clustering in Windows Server 2012 R2
Jose Barreto, Microsoft
Matt Kurjanowicz, Microsoft

  • Could not take notes while presenting :-)
  • It was really good, though…

An SMB3 Engineer’s View of Windows Server 2012 Hyper-V Workloads
Gerald Carter, EMC

  • FAS paper on Virtual Machine workloads
  • Does the current workload characterization done for NFS work for SMB3?
  • What do the IO and jump distance patterns look like?
  • What SMB3 protocol features does Hyper-V use?
  • Original scope abstract was reduced due to time constraint and hardware availability (no SMB Direct, no SMB Multichannel, no failover)
  • Building of tools that look into this using just pcap files.
  • Hardware: couple of quad-core Intel Q9650 @ 3GHz, 16GB RAM, SATA II disks, networking is 1Gbps, Windows Server 2012 (not R2)
  • OS on the VMs: Ubuntu 12.04 x64, Windows XP SP3 x86, Windows Server 2012 Standard
  • Deployment to SMB file share. Sharing PowerShell scripts.
  • Scenarios. 1: Boot host idle for 10 minutes. Scenario 2: Build Linux Kernel. 3: Random file operations (different size file copies)
  • Cold boot, start packet capture, cold boot hyper-v host, launch vm, cleanly shutdown scenario, stop packet capture.
  • Wanted to understand distribution. Read/write mix, how much of each command was used. Distribution of read size/write size/jumps.
  • Sharktools: Python extension library for data analysis. Example code: import pyshark.
  • Dictionary composition: request/response by exchanges by file handles.
  • Scenario 1 - Single client boot
  • Boot, leave it running for 10 minutes, then shutdown.
  • Huge table with lots of results. You expected lots of reads and writes. Large number of query info – hypervisor queries configuration files over and over. Lots of reads (second place), then writes.
  • Another huge table with command occurrences. More IOs in Windows Server 2012 than on Windows XP and Linux.
  • VHDX handle (one slide for Linux, Windows XP, Windows Server 2012). One handle  has most of the operations.
  • Did not look into the create options.
  • IO Size distribution (one slide for Linux, Windows XP, Windows Server 2012). Theory: read and write sizes relate to the latency of the network. Most frequent; 4KB, 32KB, 128KB.
  • Jump distance. Tables wit Time, Jump distance, offset, length. Scatter plot for the three operating systems.
  • Trying to measure the degree of randomness over time. Linux is more sequential, maybe due to the nature of the file system. Not authoritative, just an observation.
  • Comment from audience: different styles of booting Linux, some more sequential read, some more parallel.
  • Boot – single host summary. All read/writes are multiple of 512. 4KB, 32KB, 128KB are favorites. Size and jump distance distribution changes with guest OS (file system).
  • Multiple persistent handles (4 or more) opened per VM instance.
  • Scenario 2 – Linux kernel compile
  • Command distribution. Consistent – 24% query info, 33% read, 33% write.
  • Command occurrence.
  • IO size: Lots of 4KB and 128KB reads. Big spikes in writes (lots of 128KB).
  • Jump distances: Read and write charts. Random IO, no real surprise.
  • Scenario 3 – Random file operations
  • Guest using SCSI virtual disk.
  • Command distribution. 62% read, 32% write, IOCTL: 3.1%  (request for resilient handles, trim), queryinfo 1.3%.
  • Random tree copy. IO Size: Lots of 4KB, some 1MB writes.
  • Jump – random tree copy. Nice and sequential. Looks like a spaceship, or the Eiffel tower sideways. Two threads contending for the same file? Metadata vs. data?
  • Closing thoughts. Heavy query info. Not as many large size IOs as expected. Multiple long lived file handles per guest.
  • Future work: Include SMB Direct, SMB Multichannel, Failover. Include more workloads. More generalized SMB3 workloads.

 That's it for this year. I'm posting this last update from the San Jose airport, on my way back to Redmond...