|
THE DIRECT
PATH Say bye-bye to Storage Area Networks and hello to the Systems Area Networks that will deliver fast interconnect speed with very much reduced latency by cutting out the software middleman – the device driver on the operating system. |
![]() |
|||
by
Jack Fegreus |
|
The Internet bubble may have burst, but that rupture has left many lasting effects. The time compression that occurs in a 24x7x52 always-on environment continues today stronger than before. Under the gaze of continuous customer scrutiny, the already out-of-control rate of product obsolescence continues to get shorter and shorter. To deal with such a scenario, corporations continue to spotlight and focus their attention on the process of product innovation. How to better innovate is certainly a useful question to ask. However, there is still an even more fundamental question to answer: Why is it necessary? That question is the essence of “customer value,” which market cognoscenti from Cambridge to London to Lausanne have made the B-School equivalent to unified field theory. What they are talking about is a customer’s perception of the total benefits to be derived from a particular product or service vs. that customer’s perception of the total cost of acquisition and ownership for that product. In short, product value is not an absolute concept, but one that is relative. Product value only has meaning in the way a product is perceived by a customer relative to competitive product offerings. Twenty-four centuries ago, Plato asked where can men look for anything that they know is truth. Today, business executives ask where can they find the things that shape what their customers perceive as truth. How to find the answer is the essence of what has become known as ‘business intelligence.” For IT, the new quest for business intelligence is manifested in the imperative to build data marts and the mother of all data marts: the data warehouse. The fundamental tactic for success in such a quest is to collect all matter of factual data, from historical operational statistics to the latest in competitive intelligence hot off the Internet. The key point here is the subtle difference between factual and transactional data. Factual data deals with business entities, such as customers, products, and sales, rather than processes like posting an order, stocking inventory, or issuing an invoice. IT has long been able to set up the necessary infrastructure for processing transactional data. The new burden falling upon IT is to organize, integrate, and consolidate factual data and then to marshal it for analysis. This requires the creation of a central data store that defines a single, consistent, organization-wide data model. What’s more, a data warehouse must also be able to handle facts that can be quite inconsistent over time. In marked contrast to a transactional system, which must be consistent at the current point in time, a data warehouse contains a series of snapshots about each of the key subject areas of the business over a long period. As a result, analyzing the contents of a data warehouse involves a very different set of constraints from those involved with On Line Transaction Processing (OLTP). Analyzing the contents of a corporate data warehouse or a departmental data mart involves On Line Analytical Processing (OLAP). |
|
OLTP is characterized by many discrete small—on the order of 8KB—reads that are spread across the database as transactions retrieve or update specific data items. The architecture of an OLTP database must therefore concentrate on maintaining data integrity. OLAP transactions, however, almost exclusively involve reading and aggregating very large amounts of data. The design objective for an OLAP database is to promote the analysis of gross patterns, trends, and exception conditions. |
|
|
To meet these analysis demands, database architects historically built specialized Multidimensional OLAP (MOLAP) databases that were distinctly nonrelational in their design. As a result, OLAP servers tended to be quite proprietary and quite expensive. In turn, an important technology was left distinctly out of the IT mainstream. These specialized OLAP databases were built around a data model, which conceptually viewed information in terms of multidimensional hypercubes. By describing these information hypercubes in terms of descriptive categories dubbed dimensions and quantitative values dubbed measures, the OLAP data model simplifies the formulation of complex queries. Recent work in data warehouse design has shown that there are classes of denormalized relational table constructs that can be used as the foundation of a strong OLAP implementation. IBM has been at the forefront of the relational OLAP movement. By developing rich business intelligence features for DB2 EE on Linux that make use of these structures to create cubes, IBM has been able to make OLAP much less formidable and much less expensive. By exploiting this relational underpinning, IBM is making multidimensional analysis accessible to a much broader audience. But that’s only half the story. IBM’s DB2 EEE with its ability to partition a database over multiple independent nodes provides the opportunity to bring the stratospheric high end of business intelligence applications such as SAP Business Information Warehouse (BW) on to PC class servers running Linux. The constraints to running terabyte-size data warehouses on PC servers are entirely I/O bandwidth-related. The two issues that need to be addressed are wire-speed throughput and overhead latency, Today’s Fibre Channel SANs certainly address the first part of this problem. With wire speed specifications of either 1- or 2 Gbits/sec, Fibre Channel-based Storage Area Networks raise the bar significantly over direct connect SCSI storage. Nonetheless, the use of standard SCSI protocols on Fibre Channel does nothing to address the latency issue. Processing SCSI requests at the host is no different for a storage device on a SAN and a storage device at the end of a directly attached SCSI bus. To address the issue of I/O latency, it is necessary to think outside of the bus. The PC’s bus structure is at the root of the problem. To find a solution, its necessary to study I/O paradigms of old-line mainframes and clusters. The origins of the road to infinite I/O bandwidth meander through the misty past of computing history. One of the key way stations along this road was The Mill in a small Massachusetts town called Maynard. Here at DEC, the VAXcluster was conceived as a way to configure a large number of processors—1Mip VAX11/780s as big as a minivan—and storage controllers—HSC50s about the size of a washing machine. Everything was on a grand scale except the 70Mbits/sec throughput. Nonetheless, there was real genius underlying the VAXcluster. DEC’s key insight was to create a blueprint for clusters called Systems Communication Architecture (SCA). Even though the VAX nodes in a VAXcluster logically resembled tightly coupled processors in a single system more than a loosely integrated LAN, the system communications and not hardware specifications were the foundation on which VAXclusters evolved. At the heart of SCA was a port-to-port communications layer for data transfer services. Under this scheme, hardware controllers guaranteed the delivery of messages and performed block transfers of data from one node’s memory to another’s. More importantly, all of this was done without much intervention from the host’s CPU. Flash-forward to 1997 and you’ll find Intel, Compaq, and Microsoft trying to solve the same problem. Only this time they’re trying to harness the power of multiple PCs running Windows NT to compete with big SMP Unix systems. On the surface, the environment that they were working in was a lot different from those early days with the VAX. New network hardware delivered faster network speeds. At the same time, the requirements for inter-computer security increased. Nonetheless, network devices still had the same fundamental structure and required the same fundamental software support layers. The transmission of data across a traditional network begins with a user application calling an operating system service function to build a connection to a remote machine. Almost immediately, the operating system’s response is to do a context switch to kernel mode and validate the request in that trusted environment. From that point, the user cannot make any changes to the request because memory management protects the data structures. Once the application’s request has been validated, the operating system allocates buffer space for the data, copies the data from the user’s address space into the buffer and passes the request down through the network protocol layers. Each layer of the protocol implementation code adds its own security, information, and signature and passes the result below. When the packet has been fully built, it is passed to the device driver, which interprets the coded destination and transmission type and queues the request to the hardware. In the early days of 10Mbit/sec Ethernet and serial communication devices, the software layering represented less than 10% of the total end-to-end transmission time. In today’s world of Gigabit Ethernet, the relative time consumption by the software layers has increased to be more like 95% of the end-to-end transport time. That’s because the latency of network access hasn’t changed very significantly. Any measurement of network data transmission speed must account for the setup costs on the transmit side, the actual data transfer and the processing on the receive side. The components to these operations must be serial. The receive side cannot start to copy data buffers until all the data has been received. The hardware cannot start to transmit the packet until it is complete, and so it goes. What’s more, when an application requests a protected service from the operating system, the CPU must be switched to a more protected mode: kernel mode. Most CPUs implement this context switch as a software interrupt—a software mechanism that emulates a hardware interrupt. Each interrupt, whether it originated in software or hardware, causes at least partial destruction of the pipelining that has been set up by the CPU and its cache environment. The latest CPU designs provide for the ability to ‘hold the thought’ of the currently executing sequence so that only the actual pipeline needs to be thrown away. Nonetheless, any interrupt causes disruption to the CPU performance. So Intel, Compaq and Microsoft developed the preliminary Virtual Interface (VI) Architecture (VIA) specification, which defined a high-speed cluster communication interface specifically designed for low-latency as well as high-bandwidth. Like the old VAXcluster SCA, VIA sought to improve the speed of data transmission by pushing security and transmission protocol management into the VIA hardware so as to remove the need for a kernel mode device driver for most operations. In particular, with the demise of The Mill there arrived at Redmond a number of the best and brightest of the VAXcluster architects. It was this crew that first developed the proof that VIA could be directly mapped onto Winsock. That was the key to selling Microsoft on the standard they could embrace and extend. In essence, they demonstrated how a standard TCP/IP connection could be set up through Winsock and then automatically switched to VIA upon recognition of the hardware. Rather than having all control and data packets flowing through the OS kernel, VIA would pass data to/from an application directly from the network interface once the initial setup and control had taken place. This meant a VI could be provided for free to any Microsoft or third-party application that invoked Winsock. |
|
|
The latency involved in establishing a VI is approximately the same duration as that for transmitting a traditional network packet. The difference is that once the VI is established, the transmission and receive latencies drop to nearly zero. What’s more, VIA transfers the data by direct memory access at the device level. This eliminates the need for buffer copying as part of I/O completion. The typical VIA implementation of a Systems Area Network consists of specialized interface cards (including a processor and some memory), low-level device management drivers for establishing connections, and a set of user-mode libraries that implement the VIA API for the particular hardware. When an application needs to communicate across the VIA transport, it calls a function in the Virtual Interface Provider Library (VIPL). VIPL triggers a transition to kernel mode and employs the VIA card device driver to establish a link, called a Virtual Interface (VI). |
![]() |
| Reading data from a remote disk on a SAN will
offer exactly the same performance as reading that data from a local disk.
Adding a few microseconds of DAT latency to the disk head movement latency
of milliseconds makes no difference to the overall result.
Even more important for applications and system
implementation is that reading data from a remote computer’s memory in a SAN
will be faster than reading the same data from a local disk. Consequently,
SANs will render the notion of partitioned databases and the implied I/O
shipping far more practical in terms of scalability and performance. |