THE DIRECT PATH
TO INFINITE BANDWIDTH

Say bye-bye to Storage Area Networks and hello to the Systems Area Networks that will deliver fast interconnect speed with very much reduced latency by cutting out the software middleman – the device driver on the operating system.

   
  by Jack Fegreus      
     
 

The Internet bubble may have burst, but that rupture has left many lasting effects. The time compression that occurs in a 24x7x52 always-on environment continues today stronger than before. Under the gaze of continuous customer scrutiny, the already out-of-control rate of product obsolescence continues to get shorter and shorter. To deal with such a scenario, corporations continue to spotlight and focus their attention on the process of product innovation.

How to better innovate is certainly a useful question to ask. However, there is still an even more fundamental question to answer: Why is it necessary? That question is the essence of “customer value,” which market cognoscenti from Cambridge to London to Lausanne have made the B-School equivalent to unified field theory.

What they are talking about is a customer’s perception of the total benefits to be derived from a particular product or service vs. that customer’s perception of the total cost of acquisition and ownership for that product. In short, product value is not an absolute concept, but one that is relative. Product value only has meaning in the way a product is perceived by a customer relative to competitive product offerings. Twenty-four centuries ago, Plato asked where can men look for anything that they know is truth. Today, business executives ask where can they find the things that shape what their customers perceive as truth.

How to find the answer is the essence of what has become known as ‘business intelligence.” For IT, the new quest for business intelligence is manifested in the imperative to build data marts and the mother of all data marts: the data warehouse.

The fundamental tactic for success in such a quest is to collect all matter of factual data, from historical operational statistics to the latest in competitive intelligence hot off the Internet. The key point here is the subtle difference between factual and transactional data. Factual data deals with business entities, such as customers, products, and sales, rather than processes like posting an order, stocking inventory, or issuing an invoice. IT has long been able to set up the necessary infrastructure for processing transactional data. The new burden falling upon IT is to organize, integrate, and consolidate factual data and then to marshal it for analysis.

This requires the creation of a central data store that defines a single, consistent, organization-wide data model. What’s more, a data warehouse must also be able to handle facts that can be quite inconsistent over time. In marked contrast to a transactional system, which must be consistent at the current point in time, a data warehouse contains a series of snapshots about each of the key subject areas of the business over a long period. As a result, analyzing the contents of a data warehouse involves a very different set of constraints from those involved with On Line Transaction Processing (OLTP). Analyzing the contents of a corporate data warehouse or a departmental data mart involves On Line Analytical Processing (OLAP).

 
         
 

OLTP is characterized by many discrete small—on the order of 8KB—reads that are spread across the database as transactions retrieve or update specific data items. The architecture of an OLTP database must therefore concentrate on maintaining data integrity. OLAP transactions, however, almost exclusively involve reading and aggregating very large amounts of data. The design objective for an OLAP database is to promote the analysis of gross patterns, trends, and exception conditions.

 
 
The strength of Linux I/O is in doing sequential I/O on large files. This is a function of the way file I/O requests are bundled. This makes Linux a good OS to support OLAP whose data components tend to be enormous sparse matrices.   The weakness of Linux I/O is in processing short discrete I/O requests of the type found in an OLTP environment. The lack of a native asynchronous I/O processing scheme can be overcome with InfiniBand which can externalize an asynchronous I/O model.
 

 

To meet these analysis demands, database architects historically built specialized Multidimensional OLAP (MOLAP) databases that were distinctly nonrelational in their design. As a result, OLAP servers tended to be quite proprietary and quite expensive. In turn, an important technology was left distinctly out of the IT mainstream.

These specialized OLAP databases were built around a data model, which conceptually viewed information in terms of multidimensional hypercubes. By describing these information hypercubes in terms of descriptive categories dubbed dimensions and quantitative values dubbed measures, the OLAP data model simplifies the formulation of complex queries.

Recent work in data warehouse design has shown that there are classes of denormalized relational table constructs that can be used as the foundation of a strong OLAP implementation. IBM has been at the forefront of the relational OLAP movement. By developing rich business intelligence features for DB2 EE on Linux that make use of these structures to create cubes, IBM has been able to make OLAP much less formidable and much less expensive. By exploiting this relational underpinning, IBM is making multidimensional analysis accessible to a much broader audience. But that’s only half the story. IBM’s DB2 EEE with its ability to partition a database over multiple independent nodes provides the opportunity to bring the stratospheric high end of business intelligence applications such as SAP Business Information Warehouse (BW) on to PC class servers running Linux.

The constraints to running terabyte-size data warehouses on PC servers are entirely I/O bandwidth-related. The two issues that need to be addressed are wire-speed throughput and overhead latency, Today’s Fibre Channel SANs certainly address the first part of this problem. With wire speed specifications of either 1- or 2 Gbits/sec, Fibre Channel-based Storage Area Networks raise the bar significantly over direct connect SCSI storage. Nonetheless, the use of standard SCSI protocols on Fibre Channel does nothing to address the latency issue. Processing SCSI requests at the host is no different for a storage device on a SAN and a storage device at the end of a directly attached SCSI bus.

To address the issue of I/O latency, it is necessary to think outside of the bus. The PC’s bus structure is at the root of the problem. To find a solution, its necessary to study I/O paradigms of old-line mainframes and clusters.

The origins of the road to infinite I/O bandwidth meander through the misty past of computing history. One of the key way stations along this road was The Mill in a small Massachusetts town called Maynard. Here at DEC, the VAXcluster was conceived as a way to configure a large number of processors—1Mip VAX11/780s as big as a minivan—and storage controllers—HSC50s about the size of a washing machine. Everything was on a grand scale except the 70Mbits/sec throughput. Nonetheless, there was real genius underlying the VAXcluster.

DEC’s key insight was to create a blueprint for clusters called Systems Communication Architecture (SCA). Even though the VAX nodes in a VAXcluster logically resembled tightly coupled processors in a single system more than a loosely integrated LAN, the system communications and not hardware specifications were the foundation on which VAXclusters evolved.

At the heart of SCA was a port-to-port communications layer for data transfer services. Under this scheme, hardware controllers guaranteed the delivery of messages and performed block transfers of data from one node’s memory to another’s. More importantly, all of this was done without much intervention from the host’s CPU.

Flash-forward to 1997 and you’ll find Intel, Compaq, and Microsoft trying to solve the same problem. Only this time they’re trying to harness the power of multiple PCs running Windows NT to compete with big SMP Unix systems. On the surface, the environment that they were working in was a lot different from those early days with the VAX. New network hardware delivered faster network speeds. At the same time, the requirements for inter-computer security increased. Nonetheless, network devices still had the same fundamental structure and required the same fundamental software support layers.

The transmission of data across a traditional network begins with a user application calling an operating system service function to build a connection to a remote machine. Almost immediately, the operating system’s response is to do a context switch to kernel mode and validate the request in that trusted environment. From that point, the user cannot make any changes to the request because memory management protects the data structures.

Once the application’s request has been validated, the operating system allocates buffer space for the data, copies the data from the user’s address space into the buffer and passes the request down through the network protocol layers. Each layer of the protocol implementation code adds its own security, information, and signature and passes the result below. When the packet has been fully built, it is passed to the device driver, which interprets the coded destination and transmission type and queues the request to the hardware.

In the early days of 10Mbit/sec Ethernet and serial communication devices, the software layering represented less than 10% of the total end-to-end transmission time. In today’s world of Gigabit Ethernet, the relative time consumption by the software layers has increased to be more like 95% of the end-to-end transport time. That’s because the latency of network access hasn’t changed very significantly. Any measurement of network data transmission speed must account for the setup costs on the transmit side, the actual data transfer and the processing on the receive side. The components to these operations must be serial. The receive side cannot start to copy data buffers until all the data has been received. The hardware cannot start to transmit the packet until it is complete, and so it goes.

What’s more, when an application requests a protected service from the operating system, the CPU must be switched to a more protected mode: kernel mode. Most CPUs implement this context switch as a software interrupt—a software mechanism that emulates a hardware interrupt. Each interrupt, whether it originated in software or hardware, causes at least partial destruction of the pipelining that has been set up by the CPU and its cache environment. The latest CPU designs provide for the ability to ‘hold the thought’ of the currently executing sequence so that only the actual pipeline needs to be thrown away. Nonetheless, any interrupt causes disruption to the CPU performance.

So Intel, Compaq and Microsoft developed the preliminary Virtual Interface (VI) Architecture (VIA) specification, which defined a high-speed cluster communication interface specifically designed for low-latency as well as high-bandwidth. Like the old VAXcluster SCA, VIA sought to improve the speed of data transmission by pushing security and transmission protocol management into the VIA hardware so as to remove the need for a kernel mode device driver for most operations.

In particular, with the demise of The Mill there arrived at Redmond a number of the best and brightest of the VAXcluster architects. It was this crew that first developed the proof that VIA could be directly mapped onto Winsock. That was the key to selling Microsoft on the standard they could embrace and extend. In essence, they demonstrated how a standard TCP/IP connection could be set up through Winsock and then automatically switched to VIA upon recognition of the hardware. Rather than having all control and data packets flowing through the OS kernel, VIA would pass data to/from an application directly from the network interface once the initial setup and control had taken place. This meant a VI could be provided for free to any Microsoft or third-party application that invoked Winsock.

 
         

 

 The latency involved in establishing a VI is approximately the same duration as that for transmitting a traditional network packet. The difference is that once the VI is established, the transmission and receive latencies drop to nearly zero. What’s more, VIA transfers the data by direct memory access at the device level. This eliminates the need for buffer copying as part of I/O completion.

The typical VIA implementation of a Systems Area Network consists of specialized interface cards (including a processor and some memory), low-level device management drivers for establishing connections, and a set of user-mode libraries that implement the VIA API for the particular hardware.

 When an application needs to communicate across the VIA transport, it calls a function in the Virtual Interface Provider Library (VIPL). VIPL triggers a transition to kernel mode and employs the VIA card device driver to establish a link, called a Virtual Interface (VI).

 

Any management functions concerned with the link—creation, destruction, etc.—require kernel-mode device driver intervention. Nonetheless, once the VI is established, the operating system device driver is not required to transmit or receive data. For all of its promise, however, VIA ran into a number of stumbling blocks because of its very proprietary nature. First, at the hardware device level, VIA does not specify a low-level wire protocol nor the form in which data is to be transmitted across the wire. Any company choosing to implement VIA can therefore make these decisions based on economics or performance. This does nothing if not insure that interoperability will prove impossible.

The obvious need was for an open hardware standard. The milestone here was the coalition of the Next Generation I/O—led by Intel and Microsoft—and the Future I/O—led by IBM, Compaq and Sun—efforts into the InfiniBand Trade Organization (IBTA). This insured that there would be a standard implementation for communication interconnects and system I/O: InfiniBand.

Naturally, nothing in the standardization of InfiniBand precludes the implementation of VIA on InfiniBand hardware. Indeed this has been done and demonstrated by IBM with DB2 EEE as the driving application.

Even more important for the underwhelming adoption of VIA was issue of proprietary interfaces at the upper or application layer. You may be shocked to learn that some vendors thought it a strategic advantage to attempt to dictate just how applications would have to interface with VIPL on their OS platforms. As a result, in parallel to the InfiniBand trade association for hardware, the Direct Access Transport (DAT) collaborative emerged to garner broad industry support for standards at the applications programming interface layer.

The InfiniBand hardware specification coming out of the IBTA incorporates all of proven message passing, memory mapping, and point-to-point link technologies of mainframe and cluster networks. Topologically, InfiniBand creates a point-to-point switched network fabric much like Fibre Channel does in today’s Storage Area Networks. Indeed, InfiniBand switches can be integrated into existing Fibre Channel Storage Area Networks to extend their functionality.

In particular, the hardware pieces in an InfiniBand Systems Area Network fabric are host channel adapters (HCAs), target channel adapters (TCAs), and switches. The end-node channel adapters—the HCAs and the TCAs—generate and consume packets. Both the HCA and TCA provide reliable end-to-end connections that do not require processor intervention. Wire speeds for these connections are identified in multiples of the base speed, 1x, which is 2.5Gbits/sec. Common speeds are 1x, 4x—10 Gbits/sec—and 12x—30Gbits/sec.

The HCA resides in a system node and requires the most complex ASIC with extensive logic circuits. As an active component, an HCA is able to manage a connection and create a path from the system’s memory to the InfiniBand network. To this end, it has a direct-memory-access (DMA) engine with special protection and address-translation features to support both local and remote DMA operations. Remote DMA operations are initiated by another HCA or TCA.

The TCA resides in an I/O unit such as a disk drive, RAID array, or tape library and provides the connection between the I/O device and the InfiniBand network. It implements the physical, link, and transport layers of the InfiniBand protocol to deliver requested data. This simple device replaces the function of a SCSI or FC interface.

The switches are located between the end-node channel adapters. They serve to direct data packets to the correct destinations based on information bundled into the data packets’ route header. Bridging capabilities in switches provide the means to interconnect InfiniBand networks with other I/O networks such as SCSI-based Fibre Channel or TCP/IP-based gigabit Ethernet.

Given the throughput specifications for InfiniBand along with the design criteria to lets multiple I/O devices make requests simultaneously to the system CPU for data, without delays or congestion, the obvious next question is how will the new HCAs and TCAs deliver this level of performance in a PCI bus? The answer is they really won’t. To get the full benefit of the InfiniBand fabric, the fabric must be internalized within the system: In other words, you’ll need an InfiniBand server.

In the meantime, one HCA on a 64-bit PCI card clocked at 66 MHz or—even better—PCI-X card will serve to put today’s high-end servers on to an InfiniBand System Area Network. While a single PCI-X card with a 4x interface should serve nicely, it will saturate the bus.

IBM has a PCI-X—which is also backwards compatible to 64-bit PCI 2.2—to InfiniBand HCA that integrates a dual InfiniBand interface. The dual InfiniBand connections improve reliability of the HCA by eliminating single point failure. The HCA sports two embedded PowerPC 405 processors, one on the transmit path and one on the receive path. Separate direct memory access (DMA) engines permit concurrent receive and transmit data-path processing. Connection-related information is stored in either 256KB of on-device SDRAM or up to 256MB of off-device SDRAM attached directly to the HCA.

While all of these hardware pyrotechnics are very important, the lesson of VIA is that they are meaningless without an open robust applications interface to take advantage of this extraordinary hardware firepower. This is precisely the mission of the Data Access Transport (DAT) collaborative and from DAT has come an open asynchronous communications programming interface dubbed the Direct Access Programming Library (DAPL). To provide for the implementation of high-speed interconnects, DAPL has two components: user-level DAPL (uDAPL) and kernel-level DAPL (kDAPL). In particular uDAPL extends the functionality of VIPL for application developers and kDAPL unifies the semantic differences between InfiniBand and VIA at the HCA device level.

         

 

The power of an open APL can be seen in the Direct Access File System (DAFS) collaborative. This group is building a file system to use with InfiniBand. The DAFS protocol defines low-latency, high-throughput file access operations that use remote memory-to-memory copy and other high performance primitives provided by DAT. The current revision of DAFS borrows heavily from the IETF NFS Version 4 specification.

All of this is just the tip of the iceberg for IT. Systems Area Networks will radically change how we look at processing and overhead. The most obvious implementation of VIA is where fast and long-lived connections between software components on geographically close computer systems. That description applies naturally to several familiar circumstances. Clusters, storage and partitioned databases are the most immediately obvious.

 
     
  Reading data from a remote disk on a SAN will offer exactly the same performance as reading that data from a local disk. Adding a few microseconds of DAT latency to the disk head movement latency of milliseconds makes no difference to the overall result.

Even more important for applications and system implementation is that reading data from a remote computer’s memory in a SAN will be faster than reading the same data from a local disk. Consequently, SANs will render the notion of partitioned databases and the implied I/O shipping far more practical in terms of scalability and performance.