TREKKING THE SAN PIT AT LIGHT SPEED

For sites anguishing over storage bottlenecks, QLogic’s SAN-in-a-Box puts a light at the end of the cable.

   
  by Jack Fegreus      
     
 

Just 18 months ago, openBench Labs was preparing to build our first test Storage Area Network (SAN). At the time we had no appreciation for the Through the Looking Glass world we were about to enter. After all, SANs had been the coming IT Panacea for nearly five years and the first major revision of the Fibre Channel hardware—from Gb per second to 2Gb per second—was about to be unleashed on the market.. Then came our only premonition when the vendor of a very hefty (physically and fiscally) RAID array confronted us with the caveat, “But you’ve got to send the cable back.”

The problems that we ran into were not as much technical as physical. Cables, Gigabit Interface Connectors (GBICs), and all the other network plumbing were insanely hard to come by and priced like a Patek Philippe watch. Now we understood all the concern for what seemed like a trivial copper Fibre Channel cable required by that array.

 
         
 
openBENCH LABS SCENARIO
UNDER EXAMINATION
Linux SAN Performance and Functionality

UNDER EXAMINATION
QLogic SAN Connectivity Kit 1000
http://www.qlogic.com
Imperial Technologies MegaRam-2000 http://www.imperialtech.com
Overland Neo Series Library http://www.overlanddata.com

http://www.suse.com
Ximian Evolution 1.0
http://www.ximian.com
StarOffice 6.0
http://www.sun.com

HOW WE TESTED
HP Netserver LP 1000r
http://www.hp.com
Dell PowerEdge 2400 Servers http://www.dell.com
Red Hat Linux v7.3
http://www.redhat.com
SuSE Linux 8.0
http://www.suse.com
Windows 2000 Server http://www.microsoft.com
openBench Labs oblLoad v1.0 benchmark openBench Labs oblDisk v1.0 benchmark openBench Labs oblTape v1.0 benchmark

KEY FINDINGS
 Linux performance in the SAN was superior to that of Windows 2000 Server with the one exception of I/O loading, which is a function of Windows’ ability to do asynchronous I/O.
 Both the current distribution of Red Hat and SuSE supported all of the SAN hardware tested with no additional driver requirements.
Performance of the MegaRam-2000 saturated the 1Gb SAN.
Streaming performance of the Quantum SDLT drive in the Overland Neo library improved with Linux on the SAN.

  Fibre Channel switches and many HBAs come with a slot that requires a module to provide either a copper or a fiber optic interface. So in addition to the pricey cables, we discovered that we needed to factor in several hundred more dollars to put GBICs on one or both ends of these cables. For openBench Labs, the only thing hard about setting up our first SAN was the tedious dealing with nasty connectors.

Still, this left the burning issue of why Linux had been a standout no-show when it came to SANs. From the beginning, it was Sun Solaris über alles with Windows NT/2000 making dramatic inroads. In the intervening months, the proliferation of Linux distributions based on the 2.4 kernel has sparked dramatic changes in the environment. Both the latest releases of Red Hat and SuSE provide support for 1-Gb SAN HBAs. What’s more, QLogic has introduced its SAN-in-a-Box, formally dubbed the SAN Connectivity Kit 1000, which goes a long way to make the basics of installing a SAN easy.

With 2-Gb SANs already becoming ensconced at the high-end of IT shops and the InfiniBand revolution on the beta horizon, the cost of equipment for a 1Gb SAN has become far less dear. In line with that trend, QLogic has introduced a kit that contains a SANbox 8-port Fibre Channel switch, four QLA2200 64-bit PCI Fibre Channel HBAs, four 10-meter fiber optic cables, four GBICs, along with QLogic’s SAN management and High Availability HBA management software. The price is not cheap, but at $9,999 the SAN Connectivity Kit 1000 puts $25,000 worth of gear into a turnkey package that makes moving to the next generations of storage architecture accessible for small to medium-sized businesses. 

 Switches are at the heart of any SAN and we used multiple SANbox-8 switches in our first endeavor. For this first in a series of SAN reviews, we’re starting back with the most basic setup of a single QLogic SAN Connectivity Kit 1000 with its single SANbox switch. As is the custom with all such devices, the QLogic SANbox switch has a Java-based GUI—dubbed SANmanager. This SAN administration GUI is quite intuitive; however, as is often the custom, this Java applet runs correctly only on Solaris and Windows systems. The applet can be initiated using any of the Mozilla-based browsers on Linux, but it is unable to authenticate with the SANbox host.  

 
         

 

SANmanager provides the SAN administrator with a classic drill-down paradigm with which to view dynamic performance statistics for each on-line port of any SANbox switch. Performance data includes frames-in, frames-out, dropped frames, and errors. In addition, the administrator can also view name server data for each device connected to the selected chassis, the type of GBIC installed in each port, and status of each loop device connected to any port on the selected chassis. 

For systems, we put three servers on our SAN. Our principal Linux test server was an HP Netserver LP 1000r, which was running Red Hat v7.3. Our secondary Linux test server was a Dell PowerEdge 2400 running SuSE v8.0. Into this mix we added another Dell PowerEdge 2400 running Windows 2000. As a welcome change, this time it was both Linux distributions that contained all the necessary drivers for our tests.

For devices in this first series of tests, openBench Labs employed two that are quite at home on SANs at high-end IT sites. What’s more, to function well these devices need precisely those characteristics that epitomize a high speed SAN: high throughput of very large data blocks and fast switching for large numbers of I/O requests.

 
Using the SANmanager administration GUI, the fabric topology can be mapped out in a logical order by dragging and dropping the icons that appear on the left frame when devices are plugged into the switch. (mouse over image) In our initial tests we set up two servers with QLogic QLA2000 HBAs and plugged them into ports 1 and 7. The SANbox switch reports these devices, but not the systems within which the HBAs reside. The same is true for the Overland Neo Series library in port 2 and the MegaRam-2000 disk in port 8. In addition to simply mapping the icons, local names can be assigned. For openBench Labs it was much easier to recognize the Overland Neo rather than Chaparral Technologies who make the library’s interface card.
         

 

Our first test device was a solid state disk (SSD)—better known as a RAM disk—from Imperial Technologies. The MegaRam-2000 that openBench Labs tested had 8GB of error correcting DRAM and two Fibre Channel ports. In addition, the MegaRam has both battery and disk-based backup storage systems in case of power failure.

SSDs are often used as alternatives to large system caches. The fundamental weakness of any such system cache is its susceptibility to cache misses. As a result, the challenge for any cache is to insure that the needed data resides in the cache and insuring a reasonable chance that the right data is in cache is no small matter. Naturally the odds that the desired data bits are in cache are affected by the architecture of the cache—Is a least frequently used algorithm implemented? How are disk addresses mapped into the cache’s address space? How large is the cache? Nonetheless, the odds are equally affected by the nature of the data—How frequently do users access data hot spots? How large are the hot spots relative to the entire data set? As a result, doubling the size of a memory cache often improves performance by only a modest percent.

Three classic applications that are often targeted of caching are web content serving, high-transaction rate OLTP databases, and large OLAP data warehouses used in business intelligence applications.. For databases, transaction logs and indices are ideal structures to be put on an SSD. In particular, a transaction log records all database inserts, deletes, and updates and therefore in an OLTP scenario, it governs the throughput of the database.

As a result, the number of I/Os per second that can be processed is crucial. With the MegaRam-2000 there is naturally no rotational latency to be measured in milliseconds and data access time is a blinding .035ms—two orders of magnitude faster than a RAID array. From our Windows 2000 server, we were able to process just under 10,000 I/Os per second against the MegaRam-2000.

 
Linux had a distinct edge in streaming throughput from the MegRam-2000. This is a function of the way the Linux I/O subsystem bundles I/O requests into the largest possible reads (128KB) whenever possible. As a result, when streaming data from a file, all reads effectively are done at the 128KB rate, even when much smaller requests are issued. At around 105MB per second, the Linux server and the MegRam-2000 had essentially saturated the connection’s bandwidth. When we ran our disk benchmark simultaneously on the Windows 2000 Server and the Linux server, throughput on each was on the order of 40MB per second and 75MB per second respectively for a total throughput of 115MB per second or 920Mb per second. Mouse over to drill down on the SANbox switch for a diagram showing the performance of each port on the switch. Here we have the frames being sent while our streaming disk benchmark runs simultaneously on our Windows 2000 Server (port 1) and Linux Server (port 7) against the Meg-RAM 2000 (port 8)
         

 

At the other end of the database environment, fast I/O gives way to the streaming of large blocks of data in a data warehousing business intelligence scenario. Here the goal is analytical processing rather than transaction processing. The basic technique in OLAP is to build multidimensional cubes, which are enormous sparse matrices, and I/O is streamed in very large block sizes. From our Linux server, we were able to sustain streaming reads off of the MegaRam-2000 on the order of 100MB per second, which essentially saturated our 1Gb SAN.

The ability to stream reads or writes from a server is also critical in tape backup scenarios. For this reason, a SAN is a natural environment for a tape library. To this end we changed the interface on the Overland Neo Series library, which we tested as a SCSI device and reran our benchmarks on the Fibre Channel SAN.

 
As with streaming throughput off of the MegaRam, stream data to the Quantum SDLT drive was more effective with the Linux server on the SAN. With the Fibre Channel interconnect, data throughput improved 10% on the SAN for Linux.
         

 

The Neo Series utilizes Quantum’s SDLT drives, which have a very fast transport speed—116 inches per second when writing—and therefore thrive only when a maximal amount of data can be streamed to the device. Starving an SDLT of data to write can be very costly in repositioning time.

With the SDLT drives—as when streaming data off of the MegaRam-2000, our Linux server showed a distinct edge in performance. We were able to improve the drive’s performance by 10% on the SAN with Fibre Channel with the Linux Server using 128KB blocks. Windows 2000, which utilizes 64KB blocks, showed no throughput improvement versus SCSI.

The purpose of a SAN is to create a network fabric of storage devices. The goal is to provide multiple high-speed paths to optimally access devices and maintain a high-level of availability. To achieve this, multiple switches are absolutely necessary.

As with most sites that begin building a SAN, the immediate first need will likely be to expand the number of user ports beyond the eight ports available in QLogic chassis packaged with the SAN Connectivity Kit 1000. Planning for expansion, there are three basic multi-chassis topologies that can be built using SANbox switches. These topologies are the basic cascade and mesh, along with what QLogic dubs “Multistage.”

 
Drilling in on the switch during the tape benchmarks gives an interesting look at the need to stream data to the SDLT and the results of streaming incompressible data. Data going from the HP Netserver (port 7 green) and into the Neo Series library (port 2 blue) shows a steady flow of data frames and no frames dropped (yellow) or errors (red). Please also note that the scales for frames in, out, dropped, and in error are all independent on each port. In our test, the scale for frames in on port 7 (green) from the server is 2 orders of magnitude greater than the frames going out (blue) back to the server. This scaling is reversed on port 2, which is passing data to the drive. When incompressible data is sent to the drive (mouse over), the drive is unable to smoothly stream and the frame traffic becomes much more erratic.

The critical caveat is that you cannot mix the topologies in the same fabric. As a result, expansion needs to be planned. Here, as in any network, the issues are bandwidth between switches, routing over a minimum number of switched paths to minimize latency, and efficient utilization of the number physical ports.

The simplest multi-switch topology to implement is a cascade. As the name implies, in a cascade configuration switch chassis are conceptually connected in a row one-after-the-next, much like Ethernet hubs and switches are cascaded. Not surprising for a Fibre Channel SAN, the cascade configuration can optionally sport a connection from the last switch back to the first to form a continuous loop. Among its advantages, a loop provides an alternate failover path when only single-port connections are used between switches.

The fundamental problem for a site implementing a cascade topology, which is only partially alleviated with a looped cascade, is dealing with the latency that can be induced by excessive routing. In a cascade topology, each switch will route traffic in the direction of the least number of switch hops. Latency to any port on the same switch is defined as 1 switch hop. Latency to any port on an adjacent switch is 2 hops, again counting the source switch.

As a result, the furthest device in a fabric with n cascaded switches may require n hops from switch-to-switch. Adding a simple loop reduces that number to (n+1)/2 or (n/2)+1, depending on whether n is odd or even. Nonetheless, with a large number of switches, even that reduced number of hops could easily introduce some complicated latency issues.

To overcome these routing issues, a mesh fabric can be woven by connecting each switch to every other switch. In a mesh topology the maximum number of routing hops to any device is always two. It should be noted that in a fabric with only two or three chassis, a looped cascade and a mesh topology are exactly the same. This was the approach taken by openBench Labs.

Whether in a cascade or mesh SAN topology, any port on a SANbox can be either a user port—in QLogic parlance that’s a port connected a user device such as a server or a tape library—or a T_Port, which is used to connect one switch to another. Each port on the SANbox switch will detect whether it is connected to a device or another SANbox port and automatically configure itself as either a user port or T_Port. When ports are configured as a T_Ports, the SANbox guarantees in-order delivery of packets with any number of T_Port links between switches.

A mesh topology immediately addresses the issue of device latency brought about by hopping from switch to switch in the SAN. There are, however, the twin issues of bandwidth between switches and efficient utilization of the number physical ports, which we conveniently ignored up to this point.

Each T_Port link between directly connected SANbox switches provides 100MB of bandwidth between those switches. In the case of the openBench Labs SAN, we had two Linux servers and one Windows 2000 server connected to a single 8-port switch. Each server has a single QLogic QLA2200 Fibre Channel HBA capable of providing 100MB per second of throughput. In theory—and later demonstrated in practice—we should be able to push 200MB per second of total throughput through the SAN.

For our SAN topology, the worst-case scenario is therefore the situation where two servers are connected to one switch and simultaneously each tries to access a device that is connected to a second switch. In order to avoid a bottleneck in throughput between the switches, we need to provide for 200MB of bandwidth between those two switches. In other words, we must devote two ports on each of the switches as T_Ports to provide as much bandwidth between interconnected chassis as would be available were the devices and servers all connected to a local switch.

In the openBench Labs scenario, this severely limits the scalability of the SAN mesh fabric. To provide consistent 200MB bandwidth for our servers, two ports on each switch must be devoted to each interconnection. A mesh fabric with four switches requires each switch to reserve six ports for T_Port connections to the other three switches. With our current 8-port SANbox switches, that scheme effectively creates the analog of a single—but geographically distributed—single 8-port switch as each of the four switches contributes just two user ports.

While a looped cascade topology does not have the scalability issue of a geometrically expanding number of T_Ports, there is an unique bandwidth problem for such a topology. A switch in a looped cascade topology divides its interconnection bandwidth effectively directing half of the bandwidth in each direction around the loop. That’s because the routing algorithm strictly looks at the least number of hops to the desired destination. For a small SAN with just two or three switches, the topology isomorphism between mesh and looped cascade makes these bandwidth differences moot.

In upcoming issues, openBench Labs will continue the foray into weaving a more complex SAN fabric. We’ll also move up to a 2Gb SAN and integration of InfiniBand switches into a SAN environment.