|
BATTLE OF THE I/O HEAVYWEIGHTS Can a server with a narrow low-power I/O bus designed to work well in embedded and desktop systems go toe-to-toe with a Xeon server? To find out, openBench Labs puts two servers—one Opteron, one Xeon—head-to-head. |
||||
![]() by Jack Fegreus December 29,2003 |
|
Xeon-based computers today represent the quintessential IA-32 server system. As CPUs like Intel's Xeon follow Moore's Law and become exponentially more powerful, system architects have to deal with greater integration of very fast, high-volume inter-processor data traffic along with the integration of I/O functions such as USB and Ethernet into core logic components. |
|
Currently there are two chipsets from Intel that implement what Intel dubs "NetBurst microarchitecture" for building a Xeon-based system: the E7501 and the E7505. Another family of chipsets that are popular among systems vendors, including IBM and HP, are the ServerWorks Grand Champion series from Broadcom. At the heart of all of these chipsets for Intel's 32-bit architecture (IA-32), there are three major components: a Chipset Memory Controller Hub (MCH), an I/O Controller Hub for legacy I/O, and PCI/PCI-X 64-bit Hubs for PCI/PCI-X bus expansion. The E7501 chipset, which is used in the Appro 1224Xi server tested by openBench Labs, is tailored for dual-processor configurations. At the heart of this chipset, the MCH provides a 533MHz front side bus interface for the processor, a memory controller, which can deliver 4.27 GB per second using 266 MHz (PC2100) DDR memory, a hub interface for legacy I/O, and three high-performance hub interfaces for PCI/PCI-X bridges, each of which provides 1.066 GB per second I/O peak bandwidth for a total of 3.2 GB per second. Bigger, better, faster, is all very cool for IT buyers; however. it creates a seriously hot problem for system architects and board-level designers. All of this higher component integration increases the number of power and ground pins needed to provide sufficient current to bring all of the buses in and out of chipset packages. That's a big problem, because hand-in-hand with high pin counts come increased power consumption, increased RF radiation, and all of the accompanying issues that make it difficult to meet FCC, and VDE equipment certification requirements. What's more, multi-drop bus architectures such as PCI exhibit significant electrical and bandwidth degradation when more hardware devices are added. For example, the maximum supported clock speed for the PCI-X1.0 spec of 133MHz must be reduced to 100MHz when a second PCI-X device is attached. |
|
To achieve this end, HyperTransport technology departs from the longstanding Northbridge/Southbridge interconnect model. In its place, HyperTransport technology I/O links introduce the construct of flow-through tunnels, which can can pass data at up to 12.8GB per second in a daisy chain—up to 31 tunnels can be in a chain. These tunnels transfer data between devices at up to 12.8GB per second using two unidirectional point-to-point links, which has the added benefit of eliminating the arbitration overhead of a shared bus. The result is a narrow, high speed, low power I/O bus that is capable of handling clock speeds ranging from 200 to 800MHz. Furthermore, by connecting bridges to tunnels, switch and star HyperTransport topologies are also possible. Not only does HyperTransport technology eliminate the Northbridge role, which is played by the MCH in Intel's E7501 and E7505 chipsets, the Opteron CPU incorporates its own memory controller. Such a direct interface to memory can significantly reduce access latency for the processor. What's more, memory access speeds should scale directly with processor frequency. In theory, this puts potential memory throughput at the magical 12.8GB per second. Closer to reality, the potential for PC2700 DDR memory is 5.3GB per second which is 24% greater than the 4.27GB per second theoretical throughput ceiling for the E7501 with PC2100 DDR memory.
For our Xeon-based test system featuring Intel's NetBurst microarchitecture, we used an Appro 1224Xi server. Our test server sported dual 2.4GHz Xeon CPUs and was built on a Tyan Tiger motherboard. This motherboard utilizes the Intel E7501 chipset, supports up to 12GB of PC2100 DDR memory, and integrates a number of peripherals on the board including an ATA 100 disk controller, two Intel gigabit Ethernet NICs, an ATI RageXL graphics controller, and a full-speed 133MHz PCI-X slot for I/O expansion. |
|
To test an Opteron-based system with AMD's HyperTransport microarchitecture, we used an Appro 1122Hi server. Our test server featured dual 1.4GHz Opteron CPUs and was built on an MSI K8D-Master motherboard. The MSI K8D-Master is not to be confused with the popular MSI K8T-Neo, which uses VIA chipsets for desktop systems. The MSI K8D-Master utilizes the AMD 8131 and AMD 8111 chipsets to deliver server-class I/O. The AMD 8131 provides an I/O bus tunnel for two 64-bit 133MHz PCI-X buses. On our test system, only one was useable given the physical constraints of a 1U chassis. The AMD 8111 implements an I/O bus hub for typical desktop I/O buses including ATA 133, USB 2, and gigabit Ethernet. An ATI RageXL graphics controller as well as slots to install up to 12GB of PC2700 DDR memory are also integrated on the MSI K8D-Master board. We began our testing by running version 3.0 of the oblCPU benchmark suite. This suite contains 34 calculation-intense kernels that are rich in floating point arithmetic. The benchmark suite was compiled with version 7.1 of the Intel C/C++ compiler for Intel 32-bit architecture. As a result, the benchmark suite ran on the Opteron CPU via the IA-32 compatibility library that is part of the SUSE 9.0 Professional distribution with 64-bit AMD support. |
|
|
Not surprisingly, the Opteron CPU performed a great deal more work per clock cycle than the Xeon CPU. While the Xeon CPU was clocked 72% faster than the Opteron CPU, with respect to a 1 GHz P-III processor, the geometric mean to the Xeon CPU's performance was less than 12% higher than that of the Opteron CPU. Interestingly running our benchmarks via the compatibility library, the Opteron CPU's performance scaled very similarly to a 32-bit Athlon CPU from AMD. In general, openBench Labs has measured Athlon performance to be 10-to-15% higher than a P-III CPU at the same clock speed. In these tests, the Opteron CPU performed 16% higher than what we would expect from a P-III CPU at 1.4GHz. Along this same vein, the Opteron CPU demonstrated less variation in performance over all of the kernels in the suite than did the Xeon CPU, which ran through a few of the benchmark kernels in astonishingly fast order. This can be seen in the relatively even distribution of the 95%-confidence interval above and below the geometric mean for both the P-III and Opteron processors. The Xeon CPU, on the other hand, has most of the 95%-confidence interval above the geometric mean indicating a number of high-end outlier data points. |
|
We next ran oblMemBench, which measures memory bandwidth. Memory latency has become a major bottleneck for achieving high performance for various applications. In response, front side bus (FSB) speeds on IA-32 CPUs have risen at a rapid rate to take advantage of DDR memory technology. New Intel P4 CPUs now feature a 800MHz FSB. With a theoretical memory bandwidth of 4.27GB per second, our Appro 1224Xi server with its Xeon-based CPU and E7501 chipset performed better than any previous IA-32 server that we have tested. The Opteron CPU with its integrated memory controller in the Appro 1122Hi server, however, demonstrated even greater memory throughput. Even accounting for PC2700 memory in the Appro 1122Hi memory, which is clocked 25% faster than the PC2100 memory in the Appro 1224Xi, it is reasonable to extrapolate from the results as memory throughput on the Appro 1122Hi exceeded that of the Appro 1224Xi by considerably more than 25%. Clearly, integrating the memory controller into the Opteron CPU was having a positive effect in lowering memory access latency. |
|
|
|
The results of our memory benchmark, which stresses memory bandwidth, were impressive on the Appro1122Hi. Nonetheless, these results were a direct result of AMD's integration of memory controller functionality into the CPU chip. We still had not stressed the external I/O capabilities involving either of the AMD HyperTransport chipsets: the AMD 8111 and the AMD 8131. We began our I/O testing focusing on the AMD 8111 I/O Bus hub, which handles the standard desktop I/O sources including Ethernet. Using one of the server's gigabit Ethernet ports, we connected to a POPnetserver 4600 from First Internet Array. This NAS device features a 2GHz P4 processor running FreeBSD and hot-swap Ultra ATA drives. The openBench Labs test POPnetserver sported 2 Intel 10/100/1000 Gigabit Ethernet ports and four hot-swap 180GB parallel Ultra ATA drives, which were configured as a single RAID Level 5 volume. To test Ethernet throughput, we imported a directory tree from the POPnetserver that was shared over NFS and ran our oblDisk benchmark to read and write a 2GB file. The 2GB file size was chosen to minimize any local caching effects. The results from these tests were even more dramatic than the results from oblMemBench. The Opteron-based Appro 1122Hi was able to drive NFS I/O throughput over a gigabit Ethernet connection for a single read process at 200% of the throughput rate of the Xeon-based Appro 1224Xi and at 125% of the rate measured on the Appro 1224Xi for a single write process. With four read processes, running the benchmark on the Appro 1122Hi essentially saturated the capacity of a single gigabit connection |
|
|
|
|
While not quite as dramatic, when we turned to Ultra320 SCSI disk throughput using each server's 133MHz PCI-X expansion slot, once again the Appro 1122Hi with its HyperTransport microarchitecture rose to the forefront. For these I/O tests we used an Adaptec 39320D-R controller and created a 4-drive RAID 0 stripe set using Maxtor Atlas 10K IV drives. Once again during reads, I/O throughput on the Opteron-based system was consistently greater when performing both single and multithreaded read requests. While running oblDisk, we measured the average throughput to be about 20% higher on the Appro 1122Hi compared to the Appro 1224Xi. On writes, however, the tables were reversed. On a single write process, the Xeon-based system was the one to provide about 20% greater throughput. The bottom line throughout all of the tests was a
consistently measurable edge in throughput with the Opteron-based Appro 1122Hi server in a pure IA-32 applications
environment. |
|
|
|