|
I/O, I/O: A NEW CLASS OF LINUX THROUGHPUT Mix 2 parts Hyper-Threading with 1 part PCI-X and you have an I/O accelerant that delivers 3,400 IOPS and 266MB of data per second! |
||||
![]() by Jack Fegreus April 28,2003 |
|
At a computer conference a few months ago, I found myself arguing the merits of Linux with a group of very intractable naysayers. Just getting them to admit to Linux for use for web and e-mail serving was a significant challenge. Just the thought of using Linux for enterprise-class database-driven applications drew condescending scorn. It would be a very long wait, they said, before Linux could drive thousands of I/O operations per second. Try as I may, there was simply no way to convince these Unix advocates that a new class of hardware was on the way to revolutionize PC servers. Nonetheless, the wait is over. One of the leading candidates for poster-child of this new class of servers is none other than the 1U Appro 1224Xi server, which openBench Labs began testing just weeks ago. Sporting dual Intel Xeon CPUs based on the E7501 chipset with its 533MHz front side bus, up to 12GB of ECC DDR266 memory, two gigabit Intel Ethernet ports, and a 64-bit 133MHz PCI-X expansion slot, this server has sufficient processing power to cause a seismic disturbance let alone analyze the disturbance data. All of this processing power is built on Intel's NetBurst microarchitecture that provides an astonishing theoretical direct memory access (DMA) bandwidth of 4.27GB per second and multiple PCI/PCI-X bridges that can individually provide 1.07GB per second of peak I/O bandwidth. When testing the processing power of the Appro 1224Xi server, we quickly learned that exploiting the new Hyper-Threading Technology (HTT), which enables a Xeon CPU to execute data instructions from different threads in parallel as if it were two independent processors, required that we recompile our benchmarks with Intel's C++ compiler for Linux, version 7.0. This compiler can detect patterns of sequential data accesses by the same instruction and transform that code for Single Instruction Multiple Data (SIMD) execution by a Xeon CPU. |
|
In this new series of tests, we went beyond CPU processing power and looked at the ability of the dual-Xeon Appro server to handle large amounts of I/O. To this end, we installed an Adaptec 39320 series Ultra320 SCSI controller in the server's open PCI-X slot and connected the controller to 4 Maxtor Atlas 10K IV drives. These drives feature Maxtor's 2nd-generation of Ultra320 SCSI interface. Of particular note, their Ultra320 SCSI interface incorporates Maxtor's MaxAdapt circuitry, which dynamically attempts to improve signal quality on the SCSI bus by amplifying signal frequencies while simultaneously filtering noise. Such attention should not be too surprising, as the quality of electrical signals in any Ultra320 SCSI installation is of the utmost importance. Ultra320 SCSI introduces major protocol changes to reduce overhead and improve performance enough to support a data burst rate of 320MB per second. One of the more interesting improvements comes in the form of a SCSI packet protocol. Dubbed "packetized SCSI," this enables transferring multiple commands in a single connection along with data and status information. In contrast, Ultra160 SCSI transfers data during a synchronous phase at 160 MB per second, but transfers commands and status information during slower asynchronous phases. What's more, command and status information are restricted to a single transfer per connection. As a result, Utra320 SCSI does a much better job maximizing bus utilization and minimizing command overhead. These performance improvements in SCSI controllers, however, introduce a very real performance problem for systems: Ultra320 SCSI's faster I/O performance rapidly saturates a standard 64-bit PCI bus—don't even think about trying to use a 32-bit PCI bus. As openBench Labs discovered in early attempts to test 2Gbit-per-second Fibre Channel devices, using 64-bit 66-MHz PCI controllers is a great way to feel good about implementing new technology while running no faster than the equipment being replaced. If you want to reap the benefits of Ultra320 SCSI, you'll need to use a server with at least one 100-MHz PCI-X slot. That means you'll need a server with a Xeon CPU running a version of Linux based on nothing earlier than the Linux 2.4.18 kernel. The important take-away here is a new release of the operating system with the latest drivers. Just consider the Appro 1224Xi server, which has by far the simplest internal architecture of any Xeon-based server that openBench Labs has yet tested. With a very standard Ultra100 ATA I/O subsystem, it's difficult to find a new Linux distribution that will not install on this server. Where you may run into trouble is in not having a network driver for the Intel 82546 Gbit-LAN controller included in your distribution. That can easily be solved, however, by putting a 100Mbit-LAN controller in the system and downloading the latest Intel e1000 driver from SourceForge. All of the new technology being introduced into these Xeon-based servers, however, introduces a very interesting conundrum. All of this advanced hardware is driving sophisticated enterprise-level capabilities into what has become known as the "commodity-server" market—the buzz-word for IA32 servers. Now, combine the pace of hardware improvements with the pace of improvements in Linux and you have the reason why many "enterprise-class" software products are being ported to Linux. |
|
While the pace of change may be attracting the attention of enterprise-software companies, the pace of change is also the bane of their existence. Every time a change takes place in the infrastructure on which their software sits, these companies must then qualify that change. So while fast change is attracting enterprise-software vendors to Linux, the first thing these vendors want to do is slow the pace of that change. Enter the "more stable" enterprise server versions of Linux distributions. More often than not, these Linux distributions support multiple architectures including IA32, IA64, Alpha, PowerPC, and now AMD's Opteron. More often than not, these enterprise-server distributions will be updated only annually. More often than not, however, these servers will absolutely need the latest device drivers. |
|
Fortunately for IA32-based servers, standard "desktop" Linux distributions remain free to sport the latest-and-greatest improvements for systems built on 32-bit Intel architecture at intervals of six months or less. Thanks to the convergence of low-end servers with high-end workstations, this should remain the case for some time to come. Only because of this phenomenon were we able to load Linux on all of our Xeon-based systems. All of these servers came with an installation CD for Windows 2000 Server, which would otherwise be lost in a forest of unrecognizable internal hardware. None of our servers came with a CD to help load Linux. |
|
Even the Adaptec Ultra320 SCSI controller came with drivers for Windows 2000, Windows XP, Netware, OpenUNIX, Unixware, but no Linux driver. And as it turned out, that was the only driver needed for these tests that was not included in the SuSE 8.2 distribution. To obtain that driver, we had to surf over to Adaptec's web site and do a download. While essential for openBench Labs, finding a driver for Emulex's PCI-X Fibre Channel host bus adaptor will not likely thrill the typical customer for version 8.2 of SuSE Linux Professional. Nonetheless, the example of its presence goes a long way to demonstrate the readiness of this distribution for enterprise applications. There are several other areas, however, where the changes in the new version of SuSE should find broader appeal. |
|
First and foremost there are the very visible improvements to the KDE desktop in version 3.1. Not only is the look and feel more solidly professional, menu structures and feature integration are now second to none. Everything is right there at the fingertips. Of particular note is the KDE Control Center, which now includes all of SuSE's YaST2 modules for system configuration. Access to these modules requires knowledge of the root password. For many, the most important new feature will be the integration of Samba for Windows-style desktop sharing. With KDE 3.1(as with Gnome on Red Hat 8.0), Windows and Samba file shares can now be discovered and mounted without invoking a special Samba client application, such as LinNetworkNeighbohood. Installing the KDE Desktop sharing framework, which must be explicitly installed even when the Samba package is selected, integrates the command smb://<workgroup/domain name or node name> into Konqueror's desktop command lexicon. |
|
|
To begin the task at hand, which was to stress the I/O capabilities of the Appro 1224Xi server, we started by recompiling our benchmarks with version 3.3 of GNU C. This provided an improvement of about 9% in the CPU performance measured by oblCPU benchmark compared to what we had measured using version 3.2 of GNU C. |
|
The first benchmark that we ran for I/O measured memory bandwidth. Memory latency has become a major bottleneck for achieving high performance for various applications. To this end, front side bus (FSB) speeds have risen at a rapid rate in an attempt to take advantage of DDR memory technology. Given that astonishing theoretical memory bandwidth of 4.27GB per second, which is derived from the clock speed of the front side bus, we fully anticipated that to meet its potential, we would have to recompile our benchmark with the new version of the Intel C++ compiler. Our hunch was correct. On 4-byte strides through memory, throughput was over 35% greater using the benchmark compiled with Intel C++ version 7.0 when compared to the version compiled with GNU C 3.3. As stride size increased, the difference between the benchmark versions rapidly collapsed. By the time strides were greater than 32 bytes, the difference was statistically irrelevant. Compared to a dual-processor system based on AMD's 760MP chipset clocked at 1.2GHz and also equipped with 1GB of PC 2100 DDRAM, the dual Xeon-based Appro 1224Xi with Hyper-Threading was in a class by itself when it came to memory bandwidth. Extrapolating these results to disk I/O, we fully expected that DMA transfers from the Adaptec Ultra320 disk controller would also be improved when running a benchmark compiled with the Intel C++ version 7.0. |
|
|
If throughput improved when our benchmarks were compiled with Intel C++, we couldn't measure it. For both our disk I/O throughput benchmark and our transaction-processing load benchmark, the results when compiled with GNU C and Intel C++ were statistically identical. Configuring the Adaptec controller's BIOS on our Linux server, we did run into one significant problem. The Adaptec controller has two operational modes: normal SCSI and "host-based" RAID. Under host-based RAID, a RAID level 0 or 1 array is created in the controller's firmware. Whenever we enabled host-based RAID in BIOS, the Linux driver failed to recognize the controller. This function worked under Windows 2000 Server and maximum throughput from a host-based RAID 0 array was measured at 245MB per second using 64KB reads. |
|
||||||||
|
The results of the disk throughput benchmark were an interesting parallel to earlier benchmarks performed on a dual-CPU HP NetServer 1000. This system had dual 1.26GHz P-III CPUs. Using an external Adaptec RAID subsystem with Ultra160 SCSI SCSI drives, relative performance of arrays formatted with ext3FS, ReiserFS, XFS, and JFS performed similarly. There were no real surprises in the way performance scaled. The surprise was in the near linear doubling of performance, which was right on spec. In particular, the Maxtor Atlas drives are rated at a sustained throughput of 72MB per second. Off of our 4-drive stripe set, when formatted as XFS, we peaked with a sustained throughput rate of 266MB per second. The other key specification for the Atlas drives is a fast 4.3ms seek time. That plus the low latency of the Adaptec controller would be critical for our final benchmark: I/O loading. This benchmark runs multiple parallel threads which simulate user requests for data in a database environment with varying transaction rates. The ability of a system to complete more I/O requests as more I/O process threads are loaded is dependent upon memory caching and thread load balancing. The results of this test are frequently disappointing on Linux, because the current kernel has no specialized support for asynchronous I/O, which is a strong point of Windows 2000. During the test, background thread processes are dispatched that issue their own unique data 8KB requests in the same manner that simulates access patterns of a relational database. This continues until the average access time exceeds 100 milliseconds. The results of oblLoad can be then be evaluated in terms of total I/Os per second versus the total number I/O-issuing processes or in terms of the response time versus the number of I/Os per second being processed. |
|
|
On the Appro server, with no hardware boost from the Adaptec controller—remember, the Linux driver could not recognize the device when we enabled host-based RAID on the controller—we were able to process an I/O load (IOPS) that was more than double the largest I/O load previously tested. What's more, the response times, which averaged about 18ms, were half the maximum response times measured in earlier tests with dual P-III systems and Ultra160 SCSI hardware-based RAID. A year ago, these results would only have been possible on a very
large, very expensive, and very proprietary server. Now the combination of Hyper-Threading on the Xeon CPU and
133MHz PCI-X buses on "commodity servers" dramatically open the possibilities for hosting seriously large and
seriously mission-critical applications on a seriously low-cost Linux server. Just watch out for floating-point
underflow when calculating the TCO assessment. |