Here is a compilation of my live tweets from SNIA’s SDC 2012 (Storage Developers Conference).
You can also read those directly from twitter at http://twitter.com/josebarreto (in reverse order)

Notes and disclaimers

  • These tweets were typed during the talks and they include typos and my own misinterpretations.
  • Text under each talk are quotes from the speaker or text from the speaker slides, not my personal opinion.
  • If you feel that I misquoted you or badly represented the content of a talk, please add a comment to the post.
  • I spent just limited time fixing typos or correcting the text after the event. Just so many hours in a day...
  • I have not attended all sessions (since there are 4 or 5 at a time, that would actually not be possible :-)…
  • SNIA usually posts the actual PDF decks a few weeks after the event. Attendees have access immediately.

Linux CIFS/SMB2 Kernel Clients - A Year In Review by Steven French, IBM

  • SMB3 will be important for Linux, not just Windows #sdc2012
  • Linux kernel supports SMB. Kernel 3.7 (Q4-2012) includes 71 changes related to SMB (including SMB 2.1), 3.6 has 61, 3.5 has 42
  • SMB 2.1 kernel code in Linux enabled as experimental in 3.7. SMB 2.1 will replace CIFS as the default client when stable.
  • SMB3 client (CONFIG_EXPERIMENTAL) expected by Linux kernel 3.8.
  • While implementing Linux client for SMB3, focusing on strengths: clustering, RDMA. Take advantage of great protocol docs

Multiuser CIFS Mounts, Jeff Layton, Red Hat

  • I attended this session, but tweeted just the session title.

How Many IOPS is Enough by Thomas Coughlin, Coughlin Associates

  • 79% of surveyed people said they need between 1K and 1M IOPs. Capacity: from 1GB to 50TB with sweet spot on 500GB.
  • 78% of surveyed people said hardware delivers between 1K and 1M IOPs, with a sweet spot at 100K IOPs. Matches requirements  
  • Minimum latency system hardware (before other bottleneck) ranges between >1sec to <10ns. 35% at 10ms latency.
  • $/GB for SDD and HDD both declining in parallel paths. $/GB roughly follows IOPs.
  • Survey results will be available in October...

SMB 3.0 ( Because 3 > 2 ) - David Kruse, Microsoft

  • Fully packed room to hear David's SMB3 talk. Plus a few standing in the back... pic.twitter.com/TT5mRXiT
  • Time to ponder: When should we recommend disabling SMB1/CIFS by default?

Understanding Hyper-V over SMB 3.0 Through Specific Test Cases with Jose Barreto

  • No tweets during this session. Hard to talk and tweet at the same time :-)

Continuously Available SMB – Observations and Lessons Learned - David Kruse and Mathew George.

  • I attended this session, but tweeted just the session title.

Status of SMB2/SMB3 Development in Samba, Michael Adam, Samba Team

  • SMB 2.0 officially supported in Samba 3.6 (about a year ago, August 2011)
  • SMB 2.1 work done in Samba for Large MTU, multi-credit, dynamic re-authentication
  • Samba 4.0 will be the release to incorporate SMB 3.0 (encryption and secure negotiate already done)

The Solid State Storage (R-)Evolution, Michael Krause, Hewlett-Packard

  • Storage (especially SSD) performance constrained by SAS interconnects
  • Looking at serviceability from DIMM to PCIe to SATA to SAS. Easy to replace x perfor
  • No need to re-invent SCSI. All OS, hypervisors, file systems, PCIe storage support SCSI.
  • Talking SCSI Express. Potential to take advantage of PCIe capabilities.
  • PCIe has benefits but some challenges: Non optimal DMA "caching", non optimal MMIO performance
  • everything in the world of storage is about to radically change in a few years: SATA, SAS, PCIe, Memory
  • Downstream Port Containment. OS informed of async communications lost.
  • OCuLink: new PCIe cable technology
  • Hardware revolution: stacked media, MCM / On-die, DIMM. Main memory in 1 to 10 TB. Everything in memory?
  • Express Bay (SFF 8639 connector), PCIe CEM (MMIO based semantics), yet to be developed modules
  • Media is going to change. $/bit, power, durability, performance vs. persistence. NAND future bleak.
  • Will every memory become persistent memory? Not sic-fi, this could happen in a few years...
  • Revolutionary changes coming in media. New protocols, new hardware, new software. This is only the beginning

Block Storage and Fabric Management Using System Center 2012 Virtual Machine Manager and SMI-S, Madhu Jujare, Microsoft

  • Windows Server 2012 Storage Management APIs are used by VMM 5012. An abstraction of SMI-S APIs.
  • SMAPI Operations: Discovery, Provisioning, Replication, Monitoring, Pass-thru layer
  • Demo of storage discovery and mapping with Virtual Machine Manager 2012.SP1. Using Microsoft iSCSI Target!

Linux Filesystems: Details on Recent Developments in Linux Filesystems and Storage by Chris Mason, Fusion-io

  • Many journaled file systems introduced in Linux 2.4.x in the early 2000s.
  • Linux 2.6.x. Source control at last. Kernel development moved more rapidly. Specially after Git.
  • Backporting to Enterprise. Enterprise kernels are 2-3 years behind mainline. Some distros more than others.
  • Why are there so many filesystems? Why not pick one? Because it's easy and people need specific things.
  • Where Linux is now. Ext4, XFS (great for large files). Btrfs (snapshots, online maintenance). Device Mapper.
  • Where Linux is now. CF (Compact Flash). Block. SCSI (4K, unmap, trim, t10 pi, multipath, Cgroups).
  • NFS. Still THE filesystem for Linux. Revisions introduce new features and complexity. Interoperable.
  • Futures. Atomic writes. Copy offload (block range cloning or new token based standard). Shingled drives (hybrid)
  • Futures. Hinting (tiers, connect blocks, IO priorities). Flash (seems appropriate to end here :-)

Non-volatile Memory in the Storage Hierarchy: Opportunities and Challenges by Dhruva Chakrabarti, HP

  • Will cover a few technologies coming the near future. From disks to flash and beyond...
  • Flash is a huge leap, but NVRAM presents even bigger opportunities.
  • Comparing density/retention/endurance/latency/cost for hdd/sdd (nand flash)/dram/nvram
  • Talking SCM (Storage Class Memory). Access choices: block interface or byte-addressable model.
  • Architectural model for NVRAM. Coexist with DRAM. Buffers/caches still there. Updates may linger...
  • Failure models. Fail-stop. Byzantine. Arbitrary state corruption. Memory protection.
  • Store to memory must be failure-atomic.
  • NVRAM challenges. Keep persistent data consistent. Programming complexity. Models require flexibility.
  • Visibility ordering requirements. Crash can lead to pointers to uninitialized memory, wild pointers.
  • Potential inconsistencies like persistent memory leaks. There are analogs in multi-threading.
  • Insert a cache line flush to ensure visibility in NVRAM. Reminiscent of a disk cache flush.
  • Many flavors of cache flushes. Intended semantics must be honored red. CPU instruction or API?
  • Fence-based programming has not been well accepted. Higher level abstractions? Wrap in transactions?
  • Conclusion. What is the right API for persistent memory? How much effort? What's the implementation cost?

Building Next Generation Cloud Networks for Big Data Applications by Jayshree Ullal, Arista Networks

  • Agenda: Big Data Trends, Data Analytics, Hadoop.
  • 64-bit CPUs trends, Data storage trends. Moore's law is alive and well.
  • Memory hierarchy is not changing. Hard drives not keeping up, but Flash...
  • Moore's law for Big Data, Digital data doubling every 2 years. DAS/NAS/SAN not keeping up.
  • Variety of data. Raw, unstructured. Not enough minds around to deal with all the issues here.
  • Hadoop means the return of DAS. Racks of servers, DAS, flash cache, non-blocking fabric.
  • Hadoop. 3 copies of the data, one in another rack. Protect you main node, single point of failure.
  • Hadoop. Minimum 10Gb. Shift from north-south communications to east-west. Servers talking to each other.
  • From mainframe, to client/server, to Hadoop clusters.
  • Hadoop pitfalls. Not a layer 2 thing. Highly redundant, many paths, routing. Rack locality. Data integrity
  • Hadoop. File transfers in chunks and blocks. Pipelines replication east-west. Map and Reduce.
  • showing sample 2-rack solution. East-west interconnect is very important. Non-blocking. Buffering.
  • Sample conf. 4000 nodes. 48 servers per cabinet. High speed network backbone. Fault tolerant main node
  • Automating cluster provisioning. Script using DHCP for zero touch provisioning.
  • Buffer challenges. Dynamic allocations, survive micro bursts.
  • Advanced diagnostics and management. Visibility to the queue depth and buffering. Graph historical latency.
  • my power is running out. I gotta speak fast. :-)

Windows File and Storage Directions by Surendra Verma, Microsoft

  • Landscape: pooled resources, self-service, elasticity, virtualization, usage-based, highly available
  • Industry-standard parts to build very high scale, performing systems. Greater number of less reliable parts.
  • Services influencing hardware. New technologies to address specific needs. Example: Hadoop.
  • OS storage built to address specific needs. Changing that requires significant effort.
  • You have to assume that disks and other parts will fail. Need to address that in software.
  • If you have 1000 disks in a system, some are always failing, you're always reconstructing.
  • ReFS: new file system in Windows 8, assumes that everything is unreliable underneath.
  • Other relevant features in Windows Server 2012: Storage Spaces, Clustered Shared Volumes, SMB Direct.
  • Storage Spaces provides resiliency to media failures. Mirror (2 or 3 way), parity, hot spares.
  • Shared Storage Spaces. Resiliency to node and path failures using shared SAS disks.
  • Storage Spaces is aware of enclosures, can tolerate failure of an entire enclosure.
  • ReFS provides resiliency to media failures. Never write metadata in place. Integrity streams checksum.
  • integrity Streams. User data checksum, validated on every read. Uses Storage Spaces to find a good copy.
  • You own application can use an API to talk to Storage Spaces, find all copies of the data, correct things.
  • Resiliency to latent media errors. Proactively detect and correct, keeping redundancy levels intact.
  • ReFS can detect/correct corrupted data even for data not frequently read. Do it on a regular basis.
  • What if all copies are lost? ReFS will keep the volume online, you can still read what's not corrupted.
  • example configuration with 4 Windows Server 2012 nodes connected to multiple JBODs.

Hyper-V Storage Performance and Scaling with Joe Dai & Liang Yang, Microsoft

Joe Dai:

  • New option in Windows Server 2012: Virtual Fibre Channel. FC to the guest. Uses NPIV. Live migration just works.
  • New in WS2012: SMB 3.0 support in Hyper-V. Enables Shared Nothing Live Migration, Cross-cluster Live Migration.
  • New in WS 2012: Storage Spaces. Pools, Spaces. Thin provisioning. Resiliency.
  • Clustered PCI RAID. Host hardware RAID in a cluster setup.
  • Improved VHD format used by Hyper-V. VHDX. Format specification at http://www.microsoft.com/en-us/download/details.aspx?id=29681 Currently v0.95. 1.0 soon
  • VHDX: Up to 64TB. Internal log for resiliency. MB aligned. Larger blocks for better perf. Custom metadata support.
  • Comparing performance. Pass thru, fixed, dynamic, differencing. VHDX dynamic ~= VHD fixed ~= physical disk.
  • Offloaded Data Transfers (ODX). Reduces times to merge, mirror and create VHD/VHDX. Also works for IO inside the VM.
  • Hyper-V support for UNMAP. Supported on VHDX, Pass-thru. Supported on VHDX Virtual SCSI, Virtual FC, Virtual IDE.
  • UNMAP in Windows Server 2012 can flow from virtual IDE in VM to VHDX to SMB share to block storage behind share.

Laing Yang:

  • My job is to find storage bottlenecks in Hyper-V storage and hand over to Joe to fix them. :-)
  • Finding scale limits in Hyper-V synthetic SCSI IO path in WS2008R2. 1 VSP thread, 1 VMBus channel per VM, 256 queue depth per
  • WS2012: From 4 VPs per VM to 64 VP per VM. Multi-threaded IO model. 1 channel per 16 VPs. Breaks 1 million IOPs.
  • Huge performance jump in WS2012 Hyper-V. Really close to physical even with high performance storage.
  • Hyper-V Multichannel (not to be confused with SMB Multichannel) enables the jump on performance.
  • Built 1 million IOPs setup for about $10K (excluding server) using SSDs. Demo using IOmeter. Over 1.22M IOPs...

The Virtual Desktop Infrastructure Storage Behaviors and Requirements with Spencer Shepler, Microsoft

  • Storage for Hyper-V in Windows Server 2012: VHDX, NTFS, CSV, SMB 3.0.
  • Review of SMB 3.0 advantages for Hyper-V: active recovery, Multichannel, RDMA.
  • Showing results for SMB Multichannel with four traditional 10GbE. Line rate with 64KB IOs. CPU bound with 8KB.
  • Files used by Hyper-V. XML, BIN, CSV, VHD, VHDX, AVHDX. Gold, diff and snapshot disk relationships.
  • improvements in VHDX. Up to 64TB size. 4KB logical sector size, 1MB alignment for allocations. UNMAP. TRIM.
  • VDI: Personal desktops vs. Pooled desktops. Pros and cons.
  • Test environment. WS2012 servers. Win7 desktops. Login VSI http://www.loginvsi.com - 48 10K rpm HDD.
  • Workload. Copy, word, print pdf, find/replace, zip, outlook e-mail, ppt, browsing, freemind. Realistic!
  • Login VSI fairly complex to setup. Login frequency 30 seconds. Workload started "randomly" after login.
  • Example output from Login VSI. Showing VSI Max.
  • Reading of BIN file during VM restore is sequential. IO size varies.
  • Gold VHDX activity. 77GB over 1 hour. Only reads, 512 bytes to 1MB size IOs. 25KB average. 88% are <=32KB
  • Distribution for all IO. Reads are 90% 64KB or less. Writes mostly 20KB or less.
  • AVHD activity 1/10 read to write ratio. Flush/write is 1/10. Range 512 bytes to 1MB. 90% are 64KB or less.
  • At the end of test run for 1 hour with 85 desktops. 2000 IOPs from all 85 VMs, 2:1 read/write ratio.

SQL Server: Understanding the Data Workload by Gunter Zink, Microsoft (original title did not fit a tweet)

  • Looking at OLTP and data warehousing workloads. What's new in SQL Server 2012.
  • Understanding SQL Server. Store and retrieve structured data, Relation, ACID, using schema.
  • Data organized in tables. tables have columns. Tables stored in 8KB pages. Page size fixed, not configurable.
  • SQL Server Datafile. Header, GAM page (bitmap for 4GB of pages), 4GB of pages, GAM page, 4GB of pages, etc...
  • SQL Server file space allocated in extents. An extent is 8 pages or 64KB. Parameter for larger extent size.
  • SQL Server log file: Hreader, log records (512 bytes to 60KB). Checkpoint markers. truncated after backup.
  • If your storage reports 4KB sector size, minimum log write for SQL Server is 4KB. Records are padded.
  • 2/3 of SQL Servers run OLTP workloads. Many active users, lightweight transactions.
  • Going over what happens when you run OLTP. Read cache or read disk, write log to disk and mark page as dirty
  • Log buffer. Circular buffer, no fixed size. One buffer written to disk, another being filled with changes.
  • If storage is not fast enough, writing log takes longer and buffer changes grows larger.
  • Lazy writer. Writes dirty pages to disk (memory pressure). Checkpoint: Writes pages, marks log file (time limit)
  • Checkpoint modes: Automatic, Indirect, Manual. Write rate reduced if latency reaches 20ms (can be configured)
  • Automatic SQL Checkpoint. Write intensity controlled by recovery interval. Default is 0 = every two minutes.
  • New in SQL Server 2012. Target_Recovery_Time. Makes checkpoint less spikey by constantly writing dirty pages.
  • SQL Server log file. Change records in sequence. Mostly just writes. Except in recovery or transaction rollback.
  • Data file IO. 8KB random reads, buffered (based on number of user queries). Can be done in 64KB at SQL start up.
  • Log file IO: unbuffered small sequential writes (depends on how many inserts/updates/deletes).
  • About 80% of SQL Server performance problems are storage performance problems. Not enough spindles or memory.
  • SQL Server problems. 20ms threshold too high for SSDs. Use -k parameter to limit (specified in MB/sec)
  • Issues. Checkpoint floods array cache (20ms). Cache de-staging causes log drive write performance.
  • Log writes must go to disk, no buffering. Data writes can be buffered, since it can recover from the log.
  • SQL Server and Tiered Storage. We probably won't read what we've just written.
  • Data warehouse. Read large amounts of data, mostly no index, table scans. Hourly or daily updates (from OLTP).
  • Understanding a data warehouse query. Lots of large reads. Table scans and range scans. Reads: 64KB up to 512KB.
  • DW. Uses TempDB to handle intermediate results, sort. Mostly 64KB writes, 8KB reads. SSDs are good for this.
  • DW common problems: Not enough IO bandwidth. 2P server can ingest 10Gbytes/sec. Careful with TP, pooled LUNs.
  • DW common problems. Arrays don't read from multiple mirror copies.
  • SMB file server and SQL Server. Limited support in SQL Server 2008 R2. Fully supported with SQL Server 2012.
  • I got my fastest data warehouse performance using SMB 3.0 with RDMA. Also simpler to manage.
  • Comparing steps to update SQL Server with Fibre Channel and SMB 3.0 (many more steps using FC).
  • SQL Server - FC vs. SMB 3.0 connectivity cost comparison. Comparing $/MB/sec with 1GbE, 10GbE, QDR IB, 8G FC.

The Future of Protocol and SMB2/3 Analysis with Paul Long, Microsoft

  • We'll talk about Message Analyzer. David is helping.
  • Protocol Engineering Framework
  • Like Network Monitor. Modern message analysis tool built on the Protocol Engineering Framework
  • Source for Message Analyzer can be network packets, ETW events, text logs, other sources. Can validate messages.
  • Browse for message sources, Select a subset of messages, View using a viewer like a grid..
  • New way of viewing starting from the top down, instead of the bottom up in NetMon.
  • Unlike NetMon, you can group by any field or message property. Also payload rendering (like JPG)
  • Switching to demo mode...
  • Guidance shipped online. Starting with a the "Capture/Trace" option.
  • Trace scenarios: NDIS, Firewall, Web Proxy, LAN , WLAN, Wifi. Trace filter as well.
  • Doing a link layer capture (just like old NetMon). Start capture. Generate some web traffic.
  • Stop the trace. Group by module. Look at all protocols. Like HTTP. Drill in to see operations.
  • Looking at operations. HTTP GET. Look at the details. High level stack view.
  • Now grouping on both protocol and content type. Easily spots pictures over HTTP. Image preview.
  • Easier to see time elapsed per operation when you group messages. You dig to individual messages
  • Now looking at SMB trace. Trace of a file copy. Group on the file name (search for the property)
  • Now grouped on SMB.Filename. You can see all SB operations to copy a specific file.
  • Now looking at a trace of SMB file copy to an encrypted file share.
  • Built in traces to capture from the client side or server side. Can do full PDU or header.only
  • This can also be used to capture SMB Direct data, using the SMB client trace.
  • Showing the trace now with both network traffic and SMB client trace data (unencrypted).
  • Want to associate the wire capture with the SMB client ETW trace? Use the message ID
  • Showing mix of firewall trace and SMB client ETW trace. You see it both encrypted and not.
  • SMB team at Microsoft is the first to add native protocol unit tracing. Very useful...
  • Most providers have ETW debug logging but not the actual messages.
  • You can also get the trace with just NetSh or LogMan and load the trace in the tool later.
  • We also can deploy the tool and use PowerShell to start/stop capture.
  • If the event provider offers them, you can specify level and keywords during the capture.
  • Add some files (log file and wireshark trace). Narrow down the time. Add selection filter.
  • Mixing wireshark trace with a Samba text log file (pattern matching text log).
  • Audience: As a Samba hacker, Message Analyzer is one of the most interesting tools I have seen!
  • Jaws are dropping as Paul demos analyzing a trace from WireShark + Samba taken on Linux.
  • Next demo: visualizations. Two separate file copies. Showing summary view for SMB reads/writes
  • Looking at a graph of bytes/second for SMB reads and writes. Zooming into a specific time.
  • From any viewer you should be any to do any kind of selection and then launch another viewer.
  • If you're a developer, you can create a very sophisticated viewer.
  • Next demo: showing the protocol dashboard viewer. Charts with protocol bars. Drills into HTTP.

Storage Systems for Shingled Disks, with Garth Gibson, Panasas

  • Talking about disk technology. Reaction of HDD to what's going with SSDs.
  • Kryder's law for magnetic disks. Expectation is that disks will cost next to nothing.
  • High capacity disk. As bits get smaller, the bit might not hold it's orientation 10 years later.
  • Heat assisted to make it possible to write, then keep it longer when cold. Need to aim that laser precisely..
  • New technology. Shingled writing. Write head is wider than read head. Density defined by read head, not write head.
  • As you write, you overwrite a portion of what you wrote before, but you can still read it.
  • Shingled can be done with today's heads with minor changes, no need to wait for heat assisted technology.
  • Shingled disks. Large sequential writes. Disks becomes tape!!
  • Hard to see just the one bit. Safe plan is to see the bit from slightly different angles and use signal processing.
  • if aiming at 3x the density: cross talk. Signal processing using 2 dimensions TMDR. 3-5 revs to to read a track.
  • Shingled disks. Initial multiplier will be a factor of 2. Seek 10nm instead of 30 nm. Wider band with sharp edges.
  • Write head edge needs to be sharp on one side, where the tracks will overlap. Looking at different widths.
  • Aerial density favors large bands that overlap. Looking at some math that proves this.
  • You could have a special place in the disk with no shingles for good random write performance, mixed with shingled.
  • Lots of question on shingled disks. How to handle performance, errors, etc.
  • Shingled disks. Same problem for Flash. Shingled disks - same algorithms as Flash.
  • modify software to avoid or minimize read, modify, write. Log structured file systems are 20 years old.
  • Key idea is that disk attribute says "sequential writing". T13 and t10 standards.
  • Shingled disks. Hadoop as initial target. Project with mix of shingled and unshingled disks. Could also be SSD+HDD.
  • Prototype banded disk API. Write forward or move back to 0. Showing test results with new file system.
  • future work. Move beyond hadoop to general workloads, hurts with lots of small files. Large files ok.
  • future work. Pack metadata. All of the metadata into tables, backed on disk by large blob of changes.
  • Summary of status. Appropriate for Big Data. One file = one band. Hadoop is write once. Next steps: pack metadata.

The Big Deal of Big Data to Big Storage with Benjamin Woo, Neuralytix

  • Can't project to both screens because laptop does not have VGA. Ah, technology... Will use just right screen.
  • Even Batman is into big data. ?!
  • What's the big picture for big data. Eye chart with lots of companies, grouped into areas...
  • We have a problem with storage/data processing today. Way too many hops. (comparing to airline routes ?!)
  • Sample path: Oracle to Informatica to Microstategy and Hadoop. Bring them together. Single copy of "the truth".
  • Eliminate the process of ETL. Eliminate the need for exports. Help customers to find stuff in the single copy.
  • You are developers. You need to find a solution for this problem. Do you buy into this?
  • Multiple copies OK for redundancy or performance, but shouldn't it all be same source of truth?
  • Single copy of the truth better for discovery. Don't sample, don't summarize. You will find more than you expect.
  • We're always thinking about the infrastructure. Remove yourself from the hardware and think about the data!
  • The challenge is how to think about the data. Storage developers can map that to the hardware.
  • Send complaints to /dev/null. Tweet at @BenWooNY
  • Should we drop RDBMS altogether? Should we add more metadata to them? Maybe.
  • Our abstractions are already far removed from the hardware. Think virtual disks in VM to file system to SAN array.
  • Software Defined Storage is something we've been doing for years in silicon.
  • Remember what we're here for. It's about the data. Otherwise there is no point in doing storage.
  • Is there more complexity in having a single copy of the truth? Yes, but that is part of what we do! We thrive there!
  • Think about Hadoop. They take on all the complexity and use dumb hardware. That's how they create value!

Unified Storage for the Private Cloud with Dennis Chapman, NetApp

  • 10th anniversary of SMI-S. Also 10th anniversary of pirate day. Arghhh...
  • application silos to virtualization to private clouds (plus public and hybrid clouds)
  • Focusing on the network. Fundamentally clients talking to storage in some way...
  • storage choices for physical servers. Local (DAS) and remote (FC, iSCSI, SMB). Local for OS, remote for data.
  • Linux pretty much the same as Windows. Difference is NFS instead of SMB. Talking storage affinities.
  • Windows OS. Limited booting from iSCSI and FC. Mostly local.
  • Windows. Data mostly on FC and iSCSI, SMB still limited (NFS more well established on Linux).
  • shifting to virtualized workloads on Windows. Opts for local and remote. More choices, storage to the guest.
  • Virtualized workloads are the #1 configuration we provide storage for.
  • Looking at options for Windows and Linux guests, hosted on both VMware and Hyper-V hosts. Table shows options
  • FC to the guest. Primary on Linux, secondary on Windows. Jose: FC to the guest new in WS2012.
  • File storage (NFS) primary on Linux, but secondary on Windows (SMB). Jose: again, SMB support new in WS2012.
  • iSCSI secondary for Linux guest, but primary for Windows guests.
  • SMB still limited right now, expect it to grow. Interested on how it will play, maybe as high as NFS on Linux
  • Distributed workload state. Workload domain, hypervisors domain, storage domain.
  • Guest point in time consistency. Crash consistency or application consistency. OS easier, applications harder
  • Hibernation consistency. Put the guest to sleep and snapshot. Works well for Windows VMs. Costs time.
  • Application consistency. Specific APIs. VSS for Windows. I love this! Including remote VSS for SMB shares.
  • Application consistency for Linux. Missing VSS. We have to do specific things to make it work. Not easy.
  • hypervisors PIT consistency. VMware, cluster file system VMFS. Can store files on NFS as well.
  • Hypervisors PIT for Hyper-V. Similar choices with VHD on CSV. Also now option for SMB in WS2012.
  • Affinities and consistency. Workload domain, Hypervisors domain and Storage domain backups. Choices.
  • VSS is the major difference between Windows and Linux in terms of backup and consistency.
  • Moving to the Storage domain. Data ONTAP 8 Clustering. Showing 6-node filer cluster diagram.
  • NetApp Vservers owns a set of Flexvols, with contain close objects (either LUN or file).
  • Sample workflow with NetApp with remote SMB storage. Using remote VSS to create a backup using clones.
  • Sample workflow. App consistent backup from a guest using an iSCSI LUN.
  • Showing eye charts with integration with VMware and Microsoft.
  • Talking up the use of PowerShell, SMB when integrating with Microsoft.
  • Talk multiple protocols, rich services, deep management integration, highly available and reliable.

SNIA SSSI PCIe SSD Round Table. Moderator + four members.

  • Introductions, overview of SSSI PCIe task force and committee.
  • 62 companies in the last conference. Presentations available for download. http://www.snia.org/forums/sssi/pcie
  • Covering standards, sites and tools available from the group. See link posted
  • difference between PCIE SSDs look just other drives, but there are differences. Bandwidth is one of them.
  • Looking at random 4KB write IOPs and response time for different types of disks: HDD, MLC, SLC, PCIe.
  • Different SSD tech offer similar response rates. Some high latencies due to garbage collection.
  • comparing now DRAM, PCIe, SAS and SATA. Lower latencies in first two.
  • Comparing CPU utilization. From less than 10% to over 50%. What CPU utilization to achieve IOPs...
  • Other system factors. Looking at CPU affinity effect on random 4KB writes... Wide variation.
  • Performance measurement. Response time is key when testing PCIe SSDs. Power mgmt? Heat mgmt? Protocol effect on perf?
  • Extending the SCSI platform for performance. SCSI is everywhere in storage.
  • Looking at server attached SSDs and how much is SATA, SAS, PCIe, boot drive. Power envelope is a consideration.
  • SCSI is everywhere. SCSI Express protocol for standard path to PCIe. SoP (SCSI over PCIe). Hardware and software.
  • SCSI Express: Controllers, Drive/Device, Drivers. Express bay connector. 25 watts of power.
  • Future: 12Gbps SAS in volume at the end of 2013. Extended copy feature. 25W devices. Atomic writes. Hinting. SCSI Express.
  • SAS controllers > 1 million IOPs and increased power for SAS. Reduces PCIe SSD differentiation. New form factors?
  • Flash drives: block storage or memory.
  • Block versus Memory access. Storage SSDs, PCIe SSDs, memory class SCM compared in a block diagram. Looking at app performance
  • optimization required for apps to realize the memory class benefits. Looking at ways to address this.
  • Open industry directions. Make all storage look like SCSI or offer apps other access models for storage?
  • Mapping NVMExpress capability to SCSI commands. User-level abstractions. Enabling SCM by making it easy.
  • Panel done with introductions. Moving to questions.
  • How is Linux support for this? NVMExpress driver is all that exists now.
  • How much of the latency is owned by the host and the PCIe device? Difficult to answer. Hardware, transport, driver.
  • Comparing to DRAM was excellent. That was very helpful.
  • How are form factors moving forward? 2.5" HDD format will be around for a long time. Serviceability.
  • Memory like access semantics - advantages over SSDs. Lower overhead, lots in the hardware.
  • Difference between NVMe and SOP/PQI? Capabilities voted down due to complexity.
  • What are the abstractions like? Something like a file? NVMe has a namespace. Atomic write is a good example. How to overlay?
  • It's easy to use just a malloc, but it's a cut the block, run with memory. However, how do you transition?

NAS Management using System Center 2012 Virtual Machine Manager and SMI-S with Alex Naparu and Madhu Jujare

  • VMM for Management of Virtualized Infrastructure: VMM 2012 SP1 covers block storage and SMB3 shares
  • Lots of SMB 3.0 sessions here at SDC...
  • VMM offers to manage your infrastructure. We'll be focusing on storage. Lots enabled by Windows Server 2012.
  • There's an entire layer in Windows Server 2012 dedicated to manage storage. Includes translation of WMI to SMI-S
  • All of this can be leveraged using PowerShell.
  • VMM NAS Management: Discovery (Servers, Systems, Shares), Creation/Removal (Systems, Shares), Share Permissions
  • How did we get there? With a lot of help from our partners. Kick-off with EMC and NetApp. More soon. Plugfests.
  • Pre-release providers. If you have any questions on the availability of providers, please ask EMC and NetApp.
  • Moving now into demo mode. Select provider type. Specify discovery scope. Provide credentials. Discovering...
  • Discovered some block storage and file storage. Some providers expose one of them, some expose both.
  • Looking at all the pools and all the shares. Shallow discovery at first. After selection, we do deep discovery.
  • Each pool is given a tag, called classification. Tagged some as Gold, some as Platinum. Finishing discovery.
  • Deep discovery completed. Looking at the Storage tree in VMM, with arrays, pools, LUNs, file shares.
  • Now using VMM to create a file share. Provide a name, description, file server, storage pool and size.
  • Creates a logical disk in the pool, format with a file system, then create a file share. All automated.
  • Now going to a Hyper-V host, add a file share to the host using VMM. Sets appropriate permissions for the share.
  • VMM also checks the file access is good from that host.
  • Now let's see how that works for Windows. Add a provider, abstracted. Using WMI, not SMI-S. Need credentials.
  • Again, shows all shares, select for deep discovery. Full management available after that.
  • Now we can assign Windows file share to the host, ACLs are set. Create a share. All very much the same as NAS.
  • VMM also verifies the right permissions are set. VMM can also repair permission to the share if necessary.
  • Now using VMM to create a new VM on the Windows SMB 3.0 file share. Same as NAS device with SMB 3.0.
  • SMI-S support. Basic operations supported on SMI-S 1.4 and later. ACL management. requires SMI-S 1.6.
  • SMI-S 1.4 profiles: File server, file share, file system discovery, file share creation, file share removal.
  • Listing profiles that as required for SMI-S support with VMM. Partial list: NAS Head, File System, File Export
  • SMI-S defines a number of namespaces. "Interop" namespace required. Associations are critical.
  • Details on Discovery. namespaces, protocol support. Filter to get only SMB 3.0 shares.
  • Discovery of File Systems. Reside on logical disks. That's the tie from file storage to block storage.
  • Different vendors have different way to handle File Systems. Creating a new one is not trivial. Another profile.
  • VMM creates the file system and file share in one step. Root of FS is the share. Keeping things simple.
  • Permissions management. Integrated with Active Directory. Shares "registered" with Hyper-V host. VMM adds ACLs.
  • Demo of VMM specific PowerShell walking the hierarchy from the array to the share and back.
  • For VMM, NAS device and SMI-S must be integrated with Active Directory. Simple Identity Management Subprofile.
  • CIM Passthrough API. WMI provider can be leveraged via code or PowerShell.

SMB 3, Hyper-V and ONTAP, Garrett Mueller, NetApp

  • Senior Engineer at NetApp focused on CIFS/SMB.
  • What we've done with over 30 developers: features, content for Windows Server 2012. SMB3, Witness, others.
  • Data ONTAP cluster-mode architecture. HA pairs with high speed interconnect. disk "blade" in each node.
  • Single SMB server spread across multiple nodes in the cluster. Each an SMB server with same configuration
  • Each instance of the SMB server in a node has access to the volumes.
  • Non-disruptive operations. Volume move (SMB1+). Logical Interface move (SMB2+). Move node/aggregate (SMB3).
  • We did not have a method to preserve the locks between nodes. That was disruptive before SMB3.
  • SMB 3 and Persistent Handles. Showing two nodes and how you can move a persistent SMB 3 handle.
  • Witness can be used in lots of different ways. Completely separate protocol. NetApp scoped it to an HA pair.
  • Diagram explaining how NetApp uses Witness protocol with SMB3 to discover, monitor, report failure.
  • Remote VSS. VSS is Microsoft's solution for app consistent snapshot. You need to back up your shares!
  • NetApp implemented a provider for Remote VSS for SMB shares using the documented protocol. Showing workflow.
  • All VMs within a share are SIS cloned. SnapManager does backup. After done, temp SIS clones are removed.
  • Can a fault occur during a backup. If there is a failure, the backup will fail. Not protected in that way.
  • Offloaded Data Transfer (ODX). Intra-volume: SIS clones. Inter-volume/inter-node: back-end copy engine.
  • ODX: The real benefit is in the fact that it's used by default in Windows Server 2012. It just works!
  • ODX implications for Hyper-V over SMB: Rapid provisioning, rapid storage migrations, even disk within a VM.
  • Hyper-V over SMB. Putting it all together. Non-disruptive operations, Witness, Remote VSS, ODX.
  • No NetApp support for SMB Multichannel or SMB Direct (RDMA) with SMB 3.

Design and Implementation of SMB Locking in a Clustered File System with Aravind Velamur Srinivasan, EMC - Isilon

  • Part of SMB team at EMC/Isilon. Talk agenda covers OneFS and its distributed locking mechanism.
  • Overview of OneFS. NAS file server, scalable, 8x mirror, +4 parity. 3 to 144 nodes, using commodity hardware.
  • Locking: avoid multiple writers to the same file. Potentially in different file server nodes.
  • DLM challenges: Performance, multiple protocols ands requirements. Expose appropriate APIs.
  • Diagram explaining the goals and mechanism of the Distributed Locking Manager (DLM) Isilon's OneFS
  • Going over requirements of the DLM. Long list...

Scaling Storage to the Cloud and Beyond with Ceph with Sage Weil, Inktank

  • Trying to catch up with ongoing talk on ceph. Sage Weil talks really fast and uses dense slides...
  • Covering RADOS block device being used by virtualization, shared storage. http://ceph.com/category/rados/
  • Covering ceph-fs. Metadata and data paths. Metadata server components. Combined with the object store for data.
  • Legacy metadata storage: bad. Ceph-fs metadata does not use block lists or inode tables. Inode in directory.
  • Dynamic subtree partitioning very scalable. Hundreds of metadata servers. Adaptive. Preserves locality.
  • Challenge dealing metadata Io. Use metadata server as cache, prefect dir-inode. Large journal or log.
  • What is journaled? Lots of state. Sessions, metadata changes. Lazy flush.
  • Client protocol highly stateful. Metadata servers, direct access to IDS.
  • explaining the ceph-fs workflow using ceph-mon, ceph-mds, ceph-osd.
  • Snapshots. Volume and subvolume unusable at petabyte scale. Snapshot arbitrary directory
  • client implementations. Linux kernel client. Use Samba to reexport as CIFS. Also NFS and Hadoop.
  • Current status of the project: most components: status=awesome. Ceph-fs nearly awesome :-)
  • Why do it? Limited options for scalable open source storage. Proprietary solutions expensive.
  • What to do with hard links? They are rare. Using auxiliary table, a little more expensive, but works.
  • How do you deal with running out of space? You don't. Make sure utilization on nodes balanced. Add nodes.

Introduction to the last day

  • Big Data is like crude oil, it needs a lot of refining and filtering...
  • Growing from 2.75 Zettabytes in 2012 to 8 ZB in 2015. Nice infographic showing projected growth...

The Evolving Apache Hadoop Eco System - What It Means for Big Data and Storage Developers, Sanjay Radia, Hortonworks

  • One of the surprising things about Hadoop is that is does not RAID on the disks. It does surprise people.
  • Data is growing. Lots of companies developing custom solutions since nothing commercial could handle the volume.
  • web logs with terabytes of data. Video data is huge, sensors. Big Data = transactions + interactions + observations.
  • Hadoop is commodity servers, jbod disk, horizontal scaling. Scale from small to clusters of thousands of servers..
  • Large table with use cases for Hadoop. Retail, intelligence, finance, ...
  • Going over classic processes with ETL, BI, Analytics. A single system cannot process huge amounts of data.
  • Big change is introducing a "big data refinery". But you need a platform that scales. That's why we need Hadoop.
  • Hadoop can use a SQL engine, or you can do key-value store, NoSQL. Big diagram with Enterprise data architecture.
  • Hadoop offers a lot tools. Flexible metadata services across tools. Helps with the integration, format changes.
  • Moving to Hadoop and Storage. Looking at diagram showing racks, servers, 6k nodes, 120PB. Fault tolerant, disk or node
  • manageability. One operator managing 3000 nodes! Same boxes do both storage and computation.
  • Hadoop uses very high bandwidth. Ethernet or InfiniBand. Commonly uses 40GbE.
  • Namespace layer and Block storage layer. Block pool Isis a set of blocks, like a LUN. Did/file abstraction on namesp.
  • Data is normally accessed locally, but can pull from any other servers. Deals with failures automatically.
  • looking at HDFS. Goes back to 1978 paper on separating data from function in a DFS. Luster, Google, pNFS.
  • I attribute the use of commodity hardware and replication to the GoogleFS. Circa 2003. Non-posix semantics.
  • Computation close to data is an old model. Map Reduce.
  • Significance of not using disk RAID. Replication factor of Hadoop is 3. Node can be fixed when convenient.
  • HDFS recovers at a rate of 12GB in minutes, done in parallel. Even faster for larger clusters. Recovers automatically.
  • Clearly there is an overhead. It's 3x instead of much less for RAID. Used only for some of the data.
  • Generic storage service opportunities for innovation. Federation, partitioned namespace, independent block pools.
  • Archival data. Where should it sit? Hadoop encourages keeping old data for future analysis. Hot/ cold? Tiers? Tape?
  • Two versions of Hadoop. Hadoop 1 (GA) and Hadoop 2 (alpha). One is stable. Full stack HA work in progress.
  • Hadoop full stack HA architecture diagram. Slave nodes layer + HA Cluster layer. Improving performance, DR, upgrades.
  • upcoming features include snapshots, heterogeneous storage (flash drives), block grouping, other protocols (NFS).
  • Which Apache Hadoop distro should you use? Little marketing of Hortonworks. Most stable version of components.
  • It's a new product. At yahoo we needed to make sure we did not lose any data. Needs it to be stable.
  • Hadoop changes the game. Cost, storage and compute. Scales to very very large. Open, growing ecosystem, no lock in.
  • Question from the audience. What is Big Data? What is Hadoop? You don't' need to know what it is, just buy it :-)
  • Sizing? The CPU performance and disk performance/capacity varies a lot. 90% of disk performance for sequential IO.
  • Question: Security? Uses Kerberos authentication, you can conned to Active Directory. There is a paper on this.
  • 1 name node to thousands of nodes, 200M files. Hadoop moving to more name nodes to match the capacity of working set.

Primary Data Deduplication in Windows Server 2012 with Sudipta Sengupta, Jim Benton

Sudipta Sengupta:

  • Growing file storage market. Dedup is the #1 feature customers asking for. Lots of acquisitions in dedup space.
  • What is deduplication, how to do it. Content based chucking using a sliding window, computing hashes. Rabin method.
  • Dedup for data at rst, data on the wire. Savings in your primary storage more valuable, more expensive disks...
  • Dimensions of the problem: Primary storage, locality, service data to components, commodity hardware.
  • Extending the envelope from backup scenarios only to primary deduplication.
  • Key design decisions: post-processing, granularity and chucking, scale slowly to data size, crash consistent
  • Large scale study of primary datasets. Table with different workloads, chunking.
  • Looking at whole-file vs. sub-file. Decided early on to do chunking. Looking at chunk size. Compress the chunks!
  • Compression is more efficient on larger chunk sizes. Decided to use larger chunk size, pays off in metadata size.
  • You don't want to compress unless there's a bang for the buck. 50% of chunks = 80% for compression savings.
  • Basic version of the Rabin fingerprinting based chunking. Large chunks, but more uniform chunk size distribution
  • In Windows average chunk size is 64KB. Jose: Really noticing this guy is in research :-) Math, diagrams, statistics
  • Chunk indexing problem. Metadata too big to fit in RAM. Solution via unique chunk index architecture. Locality.
  • Index very frugal on both memory usage and IOPs. 6 bytes of RAM per chunk. Data partitioning and reconciliation.

Jim Benton:

  • Windows approach to data consistency and integrity. Mandatory block diagram with deduplication components.
  • Looking at deduplication on-disk structures. Identify duplicate data (chunks), optimize target files (stream map)
  • Chunk store file layout. Data container files: chunks and stream maps. Chunk ID has enough data to locate chunk
  • Look at the rehydration process. How to get the file back from the steam map and chunks.
  • Deduplicated file write path partial recall. Recall bitmap allows serving IO from file stream or chunk store.
  • Crash consistency state diagram. One example with partial recall. Generated a lot of these diagrams for confidence.
  • Used state diagrams to allow test team to induce failures and verify deduplication is indeed crash consistent.
  • Data scrubbing. Induce redundancy back in, but strategically. Popular chunks get more copies. Checksum verified.
  • Data scrubbing approach: Detection, containment, resiliency, scrubbing, repair, reporting. Lots of defensive code!
  • Deduplication uses Storage Spaces redundancy. Can use that level to recover the data from another copy if possible.
  • Performance for deduplication. Looking at a table with impact of dedup. Looking at options using less/more memory.
  • Looking at resource utilization for dedup. Focus on converging them.
  • Dedup performance varies depending on data access pattern. Time to open office file, almost no difference.
  • Dedup. Time to copy large VHD file. Lots of common chunks. Actually reduces copy time for those VHD files. Caching.
  • dedup write performance. Crash consistency hurts performance, so there is a hit. In a scenario, around 30% slower.
  • Deduplication around the top features in Windows Server 2012. Mentions at The Register, Ars Technica, Windows IT Pro
  • Lots of great questions being asked. Could not capture it all.

High Performance File Serving with SMB3 and RDMA via the SMBDirect Protocol with Tom Talpey and Greg Kramer

Tom Talpey:

  • Where we are with SMB Direct, where we are going, some pretty cool performance results.
  • Last year here at SDC we had our coming out party for SMB Direct. Review of what's SMB Direct.
  • Nice palindromic port for SMB direct 5455. Protocol documented at MS-SMBD. http://msdn.microsoft.com/en-us/library/hh536346(v=PROT.13).aspx
  • Covering the basic of SMB Direct. Only 3 message types. 2 way full duplex. Discovered via SMB Multichannel.
  • Relationship with the NDKPI in Windows. Provider interface implemented by adapter vendors.
  • Send/receive model. Possibly sent as train. Implements crediting. Direct placement (read/write). Scatter/gather list
  • Going over the details on SMB Direct send transfers. Reads and writes, how they map to SMB3. Looking at read transfer
  • looking at exactly how the RDMA reads and writes work. Actual offloaded transfers via RDMA. Also covering credits.
  • Just noticed we have a relatively packed room for such a technical talk...And it's the larger room here...
  • interesting corner cases for crediting. Last credit case. Async, cancels and errors. No reply, many/large replies
  • SMB Direct efficiency. Two pipes, one per direction, independent. Truly bidirectional. Server pull model. Options.
  • SMB Direct options for RDMA efficiency. FRMR, silent completions, coalescing, etc.
  • Server pull model allows for added efficiency, in addition to improved security. Server controls all RDMA operations.

Greg Kramer:

  • On the main event. That's why you're here, right? Performance...
  • SDC 2011 results. 160k iops, 3.2 GBytes/sec.
  • New SDC 2012 results. Dual CX3 InfiniBand, Storage Spaces, two SAS HBAs, SSDs. SQLIO tool.
  • Examining the results. 7.3 Gbytes / sec with 512KB IOs at 8.6% CPU. 453K 8KB IOs at 60% CPU.
  • Taking it 11. Three InfiniBand links. Six SAS HBAs. 48 SSDs. 16.253 GBytes/sec!!! Still low CPU utilization...
  • NUMA effects on performance. Looking at NUMA disabled versus enabled. 16% percent in CPU utilization.
  • That's great! Now what? Looking at potential techniques to reduce the cost of IOs, increase IOPs further.
  • Looking at improving how invalidation consumes CPU cycles, RNIC bus cycles. But you do need to invalidate agressively
  • Make invalidate cheaper. Using "send with invalidate". Invalidate done as early as possible, fewer round trips.
  • Send with invalidate: supported in InfiniBand, iwarp and roce. No changes to SMB direct protocol. Not committed plan
  • Shout out to http://smb3.info  Thanks, Greg!
  • Question: RDMA and encryption? Yes, you can combine them. SMB Direct will use RDMA send recives in that case.
  • Question: How do you monitor at packet level? Use Message Analyzer. But careful drinking from the fire hose :-)
  • Question: Performance monitor? There are counters for RDMA, look out for stalls, hints on how to optimize.

SMB 3.0 Application End-to-End Performance with Dan Lovinger

  • Product is released now, unlike last year. We're now showing final results...
  • Scenarios with OLTP database, cluster motion, Multichannel. How we found issues during development.
  • Summary statistics. You can drown on river with an average depth of six inches.
  • Starting point: Metric versus time. Averages are not enough, completely miss what's going on.
  • You should think about distribution. Looking at histogram. The classic Bell Curve. 34% to each side.
  • Standard deviation and median. Mid point of all data points. What makes sense for latency, bandwidth?
  • Looking at percentiles. Cumulative distributions. Remember that from College?
  • OLTP workload. Transaction rate, cumulative distribution. How we found and solved an issue that makes SMB ~= DAS
  • OLTP. Log file is small to midsize sequential IO, database file is small random IO.
  • Found 18-year-old perfor bug that affects only SMB and only in an OLTP workload. Leftover from FAT implementation.
  • Found this "write bubble" performance bug look at average queue length. Once fixed, SMB =~ DAS.
  • back to OLTP hardware configuration. IOPs limited workload does not need fast interconnect.
  • Comparing SMB v. DAS transaction rate at ingest. 1GbE over SMB compared to 4GbFC. Obviously limited by bandwidth.
  • As soon as the ingest phase is done, then 1GbE is nearly identical to 4GbFC. IOPs limited on disks. SMB=~DAS.
  • This is just a sample of why workload matters, why we need these performance analysis to find what we can improve.
  • IOmeter and SQLIO are not enough. You need to look at a real workload to find these performance issues.
  • Fix for this issue in Windows Server 2012 and also back ported to Windows Server 2008 R2.
  • Another case: Looking at what happens when you move a cluster resource group from one node to another.
  • 3 file server cluster groups, 40 disk on each. How resource control manager handles the move. Needed visualization.
  • Looking at a neat visualization of how cluster disks are moved from one node to another. Long pole operations.
  • Found that every time we offline a disk, there as a long running operation that was not needed. We fixed that.
  • We also found a situation that took multiple TCP timeouts, leading to long delay in the overall move. Fixed!
  • Final result, dramatic reduction of cluster move time. Entire move time from 55 seconds to under 10 seconds.
  • Now we can do large cluster resource group moves with 120 disks in under 10 seconds. Not bad...
  • Last case study. SMB Multichannel performance. Looking at test hardware configuration. 24 SSDs, 2 SAS HBAs, IOmeter
  • Looking at local throughput at different IO sizes, as a baseline.
  • SMB Multichannel. We can achieve line rate saturation at about 16KB with four 40GBE interfaces.
  • Curve for small IOs matches between DAS and SMB at line rate..

Closing tweet

  • #SDConference is finished. Thanks for a great event! Meet you at SNW Fall 2012 in a month, right back here. On my way back to Redmond now...