BATTLE OF THE I/O HEAVYWEIGHTS

Can a server with a narrow low-power I/O bus designed to work well in embedded and desktop systems go toe-to-toe with a Xeon server? To find out, openBench Labs puts two servers—one Opteron, one Xeon—head-to-head.

   
 
by Jack Fegreus

December 29,2003
   
     
 

Xeon-based computers today represent the quintessential IA-32 server system. As CPUs like Intel's Xeon follow Moore's Law and become exponentially more powerful, system architects have to deal with greater integration of very fast, high-volume inter-processor data traffic along with the integration of I/O functions such as USB and Ethernet into core logic components.

 
         
 
OPENBENCH LABS SCENARIO
UNDER EXAMINATION
I/O throughput using dual-processor Opteron and Xeon servers

WHAT WE TESTED

Appro 1122Hi 1U Server

Dual 1.4GHz AMD Opteron CPUs
AMD 8131 and 8111 chipsets
2GB PC2700 DDR memory
2 Broadcom 1Gbit NICs
100MHz PCI-X expansion slot

Appro 1224Xi 1U Server
Dual 2.4GHz Intel Xeon CPUs
Intel E7501 chipset
1GB PC2100 DDR memory
2 Intel 1Gbit NICs
133MHz PCI-X expansion slot


HOW WE TESTED

SUSE Linux 9.0 Professional for AMD 64
Linux Kernel 2.4.21
GCC 3.3.1
Support for AMD Opteron and Athlon-64

Intel C++ for Linux v7.1
Free non-commercial license





Adaptec 39320D-R
133MHz PCI-X
Ultra320 SCSI


Maxtor Atlas 10K IV drive
Ultra320 SCSI
4.3ms seek time
72MB/s data throughput



FIA POPnetserver 4600
Free BSD
1U 2GHz P4
(4) Hot-swap ULTRA ATA drives
(2) Intel 1Gbit NICs
RAID Levels 0, 1, 5


Benchmarks:
oblCPU v3.0
oblMemBench v2.0
oblDisk 2.0


KEY FINDINGS

Using DDR memory clocked 25% faster, throughput on the Appro 1122Hi was on the order of 50% greater.
NFS I/O throughput over gigabit Ethernet was 100% greater on reads and 25% greater on writes using the Appro 1122Hi.
Using PCI-X for UltraSCSI 320 I/O, throughput on reads was 25% greater using the Appro 1122Hi.
 

Currently there are two chipsets from Intel that implement what Intel dubs "NetBurst microarchitecture" for building a Xeon-based system: the E7501 and the E7505. Another family of chipsets that are popular among systems vendors, including IBM and HP, are the ServerWorks Grand Champion series from Broadcom.

At the heart of all of these chipsets for Intel's 32-bit architecture (IA-32), there are three major components: a Chipset Memory Controller Hub (MCH), an I/O Controller Hub for legacy I/O, and PCI/PCI-X 64-bit Hubs for PCI/PCI-X bus expansion.

The E7501 chipset, which is used in the Appro 1224Xi server tested by openBench Labs, is tailored for dual-processor configurations. At the heart of this chipset, the MCH provides a 533MHz front side bus interface for the processor, a memory controller, which can deliver 4.27 GB per second using 266 MHz (PC2100) DDR memory, a hub interface for legacy I/O, and three high-performance hub interfaces for PCI/PCI-X bridges, each of which provides 1.066 GB per second I/O peak bandwidth for a total of 3.2 GB per second.

Bigger, better, faster, is all very cool for IT buyers; however. it creates a seriously hot problem for system architects and board-level designers. All of this higher component integration increases the number of power and ground pins needed to provide sufficient current to bring all of the buses in and out of chipset packages. That's a big problem, because hand-in-hand with high pin counts come increased power consumption, increased RF radiation, and all of the accompanying issues that make it difficult to meet FCC, and VDE equipment certification requirements.

What's more, multi-drop bus architectures such as PCI exhibit significant electrical and bandwidth degradation when more hardware devices are added. For example, the maximum supported clock speed for the PCI-X1.0 spec of 133MHz must be reduced to 100MHz when a second PCI-X device is attached.

 
     
 

To address this problem, AMD debuted a new microarchitecture, dubbed "HyperTransport technology," when it launched the 64-bit Opteron CPU. In creating this new I/O scheme, AMD focused on developing a universal electronic signaling technology for interprocessor data traffic that would make it easy to build HyperTransport motherboards for an entire spectrum of systems ranging from simple embedded devices all the way up to supercomputers, like Red Storm for Sandia National Labs, using current printed circuit board technology. The idea was to provide system architects with lower pin counts, low-latency responses, fewer buses, and more bandwidth, while maintaining compatibility with legacy PC buses and transparency to their devices for operating system drivers.

To achieve this end, HyperTransport technology departs from the longstanding Northbridge/Southbridge interconnect model. In its place, HyperTransport technology I/O links introduce the construct of flow-through tunnels, which can can pass data at up to 12.8GB per second in a daisy chain—up to 31 tunnels can be in a chain. These tunnels transfer data between devices at up to 12.8GB per second using two unidirectional point-to-point links, which has the added benefit of eliminating the arbitration overhead of a shared bus. The result is a narrow, high speed, low power I/O bus that is capable of handling clock speeds ranging from 200 to 800MHz. Furthermore, by connecting bridges to tunnels, switch and star HyperTransport topologies are also possible.

Not only does HyperTransport technology eliminate the Northbridge role, which is played by the MCH in Intel's E7501 and E7505 chipsets, the Opteron CPU incorporates its own memory controller. Such a direct interface to memory can significantly reduce access latency for the processor. What's more, memory access speeds should scale directly with processor  frequency. In theory, this puts potential memory throughput at the magical 12.8GB per second. Closer to reality, the potential for PC2700 DDR memory is 5.3GB per second which is 24% greater than the 4.27GB per second theoretical throughput ceiling for the E7501 with PC2100 DDR memory.

To test this system design theory, openBench Labs put two servers from Appro, one Xeon-based and one Opteron-based, head to head. In both cases, the servers were 1U dual-processor configurations.

For our Xeon-based test system featuring Intel's NetBurst microarchitecture, we used an Appro 1224Xi server. Our test server sported dual 2.4GHz Xeon CPUs and was built on a Tyan Tiger motherboard. This motherboard utilizes the Intel E7501 chipset, supports up to 12GB of PC2100 DDR memory, and integrates a number of peripherals on the board including an ATA 100 disk controller, two Intel gigabit Ethernet NICs, an ATI RageXL graphics controller, and a full-speed 133MHz PCI-X slot for I/O expansion.

 
         
 

To test an Opteron-based system with AMD's HyperTransport microarchitecture, we used an Appro 1122Hi server. Our test server featured dual 1.4GHz Opteron CPUs and was built on an MSI K8D-Master motherboard. The MSI K8D-Master is not to be confused with the popular MSI K8T-Neo, which uses VIA chipsets for desktop systems. The MSI K8D-Master utilizes the AMD 8131 and AMD 8111 chipsets to deliver server-class I/O.

The AMD 8131 provides an I/O bus tunnel for two 64-bit 133MHz PCI-X buses. On our test system, only one was useable given the physical constraints of a 1U chassis. The AMD 8111 implements an I/O bus hub for typical desktop I/O buses including ATA 133, USB 2, and gigabit Ethernet. An ATI RageXL graphics controller as well as slots to install up to 12GB of PC2700 DDR memory are also integrated on the MSI K8D-Master board.

We began our testing by running version 3.0 of the oblCPU benchmark suite. This suite contains 34 calculation-intense kernels that are rich in floating point arithmetic. The benchmark suite was compiled with version 7.1 of the Intel C/C++ compiler for Intel 32-bit architecture. As a result, the benchmark suite ran on the Opteron CPU via the IA-32 compatibility library that is part of the SUSE 9.0 Professional distribution with 64-bit AMD support.

 
Using our oblCPU benchmark compiled with Intel C/C++ v7.1, we tested both the Xeon- and Opteron-based servers. In overall performance, our 1.4GHz Opetron CPU proved quite comparable to our 2.4GHz Xeon CPU, even though the Xeon was clocked nearly 72% faster. Surprisingly, the Opteron CPU performed with greater consistency over all of the benchmark suite's 34 kernels. This is reflected in the spread of the 95% statistical confidence region, which is more evenly distributed above and below the geometric mean performance.
 
     
 

Not surprisingly, the Opteron CPU performed a great deal more work per clock cycle than the Xeon CPU. While the Xeon CPU was clocked 72% faster than the Opteron CPU, with respect to a 1 GHz P-III processor, the geometric mean to the Xeon CPU's performance was less than 12% higher than that of the Opteron CPU.

Interestingly running our benchmarks via the compatibility library, the Opteron CPU's performance scaled very similarly to a 32-bit Athlon CPU from AMD. In general, openBench Labs has measured Athlon performance to be 10-to-15% higher than a P-III CPU at the same clock speed. In these tests, the Opteron CPU performed 16% higher than what we would expect from a P-III CPU at 1.4GHz. Along this same vein, the Opteron CPU demonstrated less variation in performance over all of the kernels in the suite than did the Xeon CPU, which ran through a few of the benchmark kernels in astonishingly fast order.

This can be seen in the relatively even distribution of the 95%-confidence interval above and below the geometric mean for both the P-III and Opteron processors. The Xeon CPU, on the other hand, has most of the 95%-confidence interval above the geometric mean indicating a number of high-end outlier data points.

 
         
 

We next ran oblMemBench, which measures memory bandwidth. Memory latency has become a major bottleneck for achieving high performance for various applications. In response, front side bus (FSB) speeds on IA-32 CPUs have risen at a rapid rate to take advantage of DDR memory technology. New Intel P4 CPUs now feature a 800MHz FSB.

With a theoretical memory bandwidth of 4.27GB per second, our Appro 1224Xi server with its Xeon-based CPU and E7501 chipset performed better than any previous IA-32 server that we have tested. The Opteron CPU with its integrated memory controller in the Appro 1122Hi server, however, demonstrated even greater memory throughput.

Even accounting for PC2700 memory in the Appro 1122Hi memory, which is clocked 25% faster than the PC2100 memory in the Appro 1224Xi, it is reasonable to extrapolate from the results as memory throughput on the Appro 1122Hi exceeded that of the Appro 1224Xi by considerably more than 25%. Clearly, integrating the memory controller into the Opteron CPU was having a positive effect in lowering memory access latency.

 
Even after accounting for faster memory, the Appro 1122Hi measurably outpaced its Xeon-based sibling when stressing memory bandwidth. 
 

 

         
 

The results of our memory benchmark, which stresses memory bandwidth, were impressive on the Appro1122Hi. Nonetheless, these results were a direct result of AMD's integration of memory controller functionality into the CPU chip. We still had not stressed the external I/O capabilities involving either of the AMD HyperTransport chipsets: the AMD 8111 and the AMD 8131.

We began our I/O testing focusing on the AMD 8111 I/O Bus hub, which handles the standard desktop I/O sources including Ethernet. Using one of the server's gigabit Ethernet ports, we connected to a POPnetserver 4600 from First Internet Array. This NAS device features a 2GHz P4 processor running FreeBSD and hot-swap Ultra ATA drives. The openBench Labs test POPnetserver sported 2 Intel 10/100/1000 Gigabit Ethernet ports and four hot-swap 180GB parallel Ultra ATA drives, which were configured as a single RAID Level 5 volume.

To test Ethernet throughput, we imported a directory tree from the POPnetserver that was shared over NFS and ran our oblDisk benchmark to read and write a 2GB file. The 2GB file size was chosen to minimize any local caching effects. The results from these tests were even more dramatic than the results from oblMemBench.

The Opteron-based Appro 1122Hi was able to drive NFS I/O throughput over a gigabit Ethernet connection for a single read process at 200% of the throughput rate of the Xeon-based Appro 1224Xi and at 125% of the rate measured on the Appro 1224Xi for a single write process. With four read processes, running the benchmark on the Appro 1122Hi essentially saturated the capacity of a single gigabit connection

 

The most dramatic throughput differences between the Opteron-based and Xeon-based Appro servers occurred during throughput tests over gigabit Ethernet. Using an FIA POPnetserver to host an NFS volume, I/O throughput, as measured by running oblDisk over the network, on the Opteron-based Appro server was roughly double that of its Xeon-based counterpart.

 

         
 

While not quite as dramatic, when we turned to Ultra320 SCSI disk throughput using each server's 133MHz PCI-X expansion slot, once again the Appro 1122Hi with its HyperTransport microarchitecture rose to the forefront. For these I/O tests we used an Adaptec 39320D-R controller and created a 4-drive RAID 0 stripe set using Maxtor Atlas 10K IV drives.

Once again during reads, I/O throughput on the Opteron-based system was consistently greater when performing both single and multithreaded read requests. While running oblDisk, we measured the average throughput to be about 20% higher on the Appro 1122Hi compared to the Appro 1224Xi. On writes, however, the tables were reversed. On a single write process, the Xeon-based system was the one to provide about 20% greater throughput.

The bottom line throughout all of the tests was a consistently measurable edge in throughput with the Opteron-based Appro 1122Hi server in a pure IA-32 applications environment.

 

Only when we performed local writes to our RAID 0 volume did throughput on the Appro1122Hi fall below that of the Appro1224Xi. On reads, the Opteron-based Appro 1122Hi demonstrated a consistent edge of about 20% until we reached the capacity of the PCI-X channel.