Microsoft Research Connections Blog
Next at Microsoft
Social Media Collective
Windows on Theory
Posted by John Davis, researcher at Microsoft Research Silicon Valley
Storage in the data center has been dominated by expensive, aggregated systems that provide consistency but not fault tolerance, or by cheaper, partitioned designs that provide performance but not consistency. Leveraging the properties of NAND flash, CORFU provides strong consistency and high performance, both in terms of throughput and low latency, in a fault-tolerant, distributed storage system.
CORFU creates new opportunities in the data center, enabling designs that are impractical on hard disk or too expensive on RAM. CORFU achieves these goals by exposing a global shared log, where hundreds of client machines append to the tail of a single log and read from its body concurrently.
A shared log is a powerful, versatile primitive for ensuring strong consistency in the presence of failures and asynchrony. It can play many roles in a distributed system: a consensus engine for consistent replication; a transaction arbitrator for isolation and atomicity; an execution history for replica creation, consistent snapshots, and geo-distribution; and even a primary data store that leverages fast appends on underlying media. Flash is an ideal medium for implementing a scalable shared log, supporting fast, contention-free random reads to the body of the log and fast sequential writes to its tail.
CORFU provides the foundation for building multiple data-center or cloud applications, such as key-value stores, databases, or state-machine replication engines, which require concurrent and consistent access to a distributed flash array. As Figure 1 below shows, application servers or “clients” access the shared global log using the CORFU library, which has a simple API. We implement most of the intelligence of CORFU in this library, enabling the use of simple NAND flash storage units. We are prototyping a simplified solid-state drive (SSD) with an Ethernet port instead of SATA or PCI-E.
In CORFU, each position in the shared log is mapped to a set of flash pages on different flash units (see Figure 1). This map is maintained—consistently and compactly—at the clients by the CORFU library. To read a particular position in the shared log, a client uses its local copy of this map to determine a corresponding physical flash page and then directly issues a read to the flash unit storing that page. To append data, a client first determines the next available position in the shared log—using a sequencer node as an optimization for avoiding contention with other appending clients—and then writes data directly to the set of physical flash pages mapped to that position.
CORFU’s client-centric design has two objectives. First, it ensures that the append throughput of the log is not a function of the bandwidth of any single flash unit. Second, placing functionality at the clients reduces the complexity, cost, latency, and power consumption of the flash units. In fact, CORFU can operate over SSDs that are attached directly to the network, eliminating general-purpose storage servers from the critical path.
We are prototyping a network-attached flash unit on an FPGA platform. When used with the CORFU stack, this custom hardware provides the same rate of networked throughput as a XEON-hosted SSD, but with 33 percent lower read latency and an order of magnitude less power. Over a cluster of such flash units, CORFU’s logging design acts as a distributed SSD, implementing functionality found inside conventional SSDs—such as wear-leveling—at cluster scale, while still providing simpler per-unit garbage collection.
On the software side, we also are implementing various infrastructure applications atop CORFU. Visit the CORFU project page to follow the exciting developments as we build CORFU-specific hardware and applications on top of the CORFU library.