|
BE EXTREME Leveraging AMD's dual-core CPUs and the latest in HyperTransport ASIC technology, Appro goes Xtreme with a new line of servers poised to shakeup high-end datacenters trying to cope with the demands of server consolidation 'and virtualization. |
||||
|
|
|
Appro has built a solid foundation in the high-performance computational-computing niche for applications such as data mining, simulation and rendering, Now Appro has unveiled its “XtremeServer” product line. These new servers are uniquely well positioned to resolve the issues faced by CIOs embarking on server consolidation and virtualization projects in their datacenters. The XtremeServers sport the latest server technology options, including dual-core AMD Opteron CPUs. Dual-core CPUs can double computational performance while maintaining the power and heat footprint of a single-core CPU for superior per-watt performance. The XtremeServer also combines direct-connect HyperTransport microarchitecture with switched PCI Express (PCIe) I/O, which bridges internal I/O subsystems such as the SATA disk controller, to provide an exceptional level of I/O scalability. To examine the potential performance of the Appro XtremeServers in a datacenter environment, openBench Labs set up a 1U 2-way server and a 3U 4-way server. The 2-way and 4-way XtremeServers feature AMD 280 and 880 dual-core CPUs and can be configured with up to 64GB and 123GB of DDR memory respectively. Each of our test servers were configured with 8GB of DDR memory clocked at 400MHz |
|
For comparison we also set up a 4-way server that featured single-core CPUs, an HP ProLiant DL580 G3 server. Like the Appro XtremeServers. the HP ProLiant DL580 G3 server was configured with 8GB of DDR2 memory clocked at 400MHz. The HP ProLiant DL580 sported 3.33GHz Intel Xeon CPUs with Extended Memory 64 Technology (EM64T). EM64T CPUs allow the system to address up to 1TB of combined virtual and physical memory via a flat 64-bit virtual address space, 64-bit pointers, 64-bit general-purpose registers, and 64-bit integer support. On each of the three servers, openBench Labs installed SUSE Linux 10. Unlike SUSE Linux Enterprise Serve (SLES), which has a guaranteed 5-year life cycle for support, SUSE Linux is released every six months with an extensive set of emerging-technology applications for both server and desktop environments. SUSE Linux is intended as a technical preview of what Linux production environments will likely be running in the near future.
The most extensive and important change may be the overhaul given to the language-independent optimizers. Every component of the SUSE Linux 10.0 distribution has been compiled and checked with GCC 4.0. SLES 10, which will be released in late spring of this year, will be built upon the latest version of the new GNU Compiler Collection and will target mission-critical use. |
|
We started our assessment by compiling single-threaded 64-bit versions of our oblCPU v4.5 benchmark that were optimized for execution on their target platforms: either AMD64 or EM64T CPUs. In addition to using GNU C 4.0, we used the Intel C v9.0 compiler to generate code for optimal execution on the Xeon EM64T-based server. Using the new GNU C 4.0 compiler, AMD64 Opteron CPU performance scaled along clock-speed lines set by an Intel Pentium III CPU. This was not the case for the superpipelined Xeon EM64T CPU. In the jump from Pentium III to Pentium 4 processor architecture, from which Xeon CPUs have evolved, Intel instituted a superpipelined architecture to exploit instruction level parallelism without involving any special coding by the programmer. With GNU C 4.0, the 3.3 GHz Xeon EM64T CPU provided only 75% of the CPU performance of the 2.4GHz Opteron 880 CPU. Only by usng the Intel C 9.0 compiler were we able to properly exploit the Xeon’s superpipelined architecture and approach the performance of the AMD64 Opteron CPU. |
|
|
While superpipelining is transparent to the applications programmer, this is decidedly not the case for the developers of compilers. The structure of compiled code plays an essential role in avoiding the costly stalls that prevent the exploitation of instruction level parallelism and the realization of the associated performance speedup that processor pipelines can provide. In particular, the executable code generated with the Intel C compiler executed 30% faster than the GNU C 4.0 executable on the Xeon-based HP Proliant DL580 G3 server. |
![]() openBench Labs Both 32- and 64-bit versions of oblCPU v4.0, oblDisk v3.0, and oblMemBench are available for download at the openBench Labs site, Over the coming weeks, we will be adding more of our benchmarks to that repository. |
The AMD64 Opteron CPU also implements floating-point and integer pipelines; however, AMD64 pipelines have half the number of stages that are found in corresponding pipelines on a Xeon CPU. As a result, using the GNU C 4.0 compiler, the performance of the oblCPU benchmark on an AMD64 Opteron CPU scales far more predictably in terms of the CPU’s clock speed in comparison to our 1GHz Pentium III. On the other hand, using the GNU C 4.0 compiler, a Xeon EM64T CPU with a 38% faster clock speed provided only 75% of the processing power of an AMD64 Opteron CPU: |
|
That difference directly effects the issue of power per watt. One factor that works to increase that voltage rating is the clock speed. When Intel moved up the clock speed of the Pentium 4 to 2.6GHz, Intel also found it necessary to raise the core voltage rating of the Pentium 4 from 1.5V to 1.525V. |
|
To assess the scalability of SMP load performance, we launched multiple copies of our single-threaded oblCPU benchmark on each of the three servers. The results were quite dramatic. In the cases of the 2-way XtremeServer and the HP Proliant DLT580 G3, the difference in the net SMP load performance between the two servers paralleled the difference in performance between single instances of oblCPU. As a result, the two-way XtremeServer with dual-core AMD 280 CPUs supported a greater net SMP processing load compared to the 4-way HP Proliant DL580 G3 server with four single-core Intel Xeon EM64T CPUs and hyperthreading. In these openBench Labs tests, job scheduler effects were the same on each system as all ran the same build of SUSE Linux 10 on the 2.6.13-15 Linux kernel. When we normalized the net SMP load performance, the SMP load profile for the 2-way Appro XtremeServer and the 4-way HP Proliant DL580 G3 emerged as being identical. While hyperthreading makes the 4-way HP Proliant DL580 G3 with Xeon EM64T CPUs appear to the Linux operating system to be an 8-way server, our SMP load benchmark results demonstrated that its scaling profile was nothing like that of the 4-way Appro XtremeServer, for which dual-core CPUs provide true 8-way SMP load scalability. |
|
|
An interesting SMP scalability pattern arose for all systems. CPU performance scaling was linear for the first n oblCPU processes (n=the number of CPUs). Load scalability continued to increase, albeit more slowly, reaching a peak with 2n-1 instances of oblCPU running. Increasing the number of processes to 2n, puts an equal load on each CPU preventing the scheduler from leveraging a free CPU. As a result, normalized SMP scalability falls back to the number of CPUs. As the number of oblCPU processes continue to increase, this pattern repeats; but as overhead increases peak SMP load performance slowly falls back to the value n. |
|
Not only does HyperTransport Technology eliminate the shared Northbridge frontside bus, each Opteron CPU incorporates its own memory controller, which has a 6.4GB per second throughput rate for local memory. On dual-core CPUs, the two cores share one controller. Such a direct embedded controller interface significantly reduces memory access latency and enables memory bandwidth speeds to scale directly with the number of processors and their frequency. That makes a non-uniform memory architecture (NUMA) an easy configuration for multiprocessor Opteron systems to support. In such a configuration, each CPU is directly attached to its own bank of memory. For a CPU to access memory outside of the bank under its direct control, it must communicate directly with the CPU attached to that memory over an inter-processor HyperTransport link. That CPU will access the required data and pass it back to the initiator CPU over a HyperTransport inter-processor link, |
|
With each AMD 880 CPU in the 4-way Appro XtremeServer sporting four HyperTransport inter-processor links and each AMD 280 CPU in the 2-way Appro XtremeServer sporting one inter-processor link, the potential memory bandwidth for the Appro XtremeServers is 25.6GB per second for the 4-way server and 12.8GB per second for the 2-way server. From a microarchitecture perspective, a NUMA server architecture represents a smart way to create a highly scalable configuration. |
|
To realize all of that potential bandwidth, however, it will be necessary to access multiple banks of memory under the control of multiple CPUs. For application programmers—not to mention all of the existing application programs—that smart microarchitecture has all the potential to become a complex programming nightmare. To avoid just such a disaster, AMD64 technology makes the intricacies of NUMA transparent at the operating system level. At boot time, the AMD64 multiprocessor system BIOS calls on each processor to report its local memory. The system BIOS then maps each of the local memory pools into a single globally-addressable physical memory space. What's more, the AMD64 processors automatically maintain cache coherency across that global address space. As a result, programs access any data belonging to any processor with normal memory operations to the global address space. For programmers and system administrators, the only difference between NUMA and a traditional SMP system is the speed. With that level of performance and the ability to support up to 64GB (2-way) and 128GB (4-way) of memory. the Appro XtremeServers are equally well-suited for memory intensive business tasks such as OLAP business intelligence and data mining, as they are for performing computational physics and engineering simulations. |
|
|
For peripheral I/O, both internal and external, the Appro XtremeServers feature the latest in PCI Express (PCIe) I/O technology. For connecting I/O subsystems, PCIe, like HyperTransport, replaces an old shared-bus I/O architecture, in this case PCI-X, a silicon-based switched I/O architecture modeled on super computers and mainframes. As such, PCIe represents a significant advance in I/O throughput. |
|
For PCIe-switched I/O, the 4-way Appro Xtreme server features nVidia’s nForce Professional 2200 and 2050 ASICs for PCIe. The nVidia nForce Professional 2200 also acts as a bridge for internal I/O connections including SATA, for which this server provides six hot-swap bays. We populated those bays with Western Digital 10K Raptor drives—one of which was reserved for SUSE Linux 10 Professional and the remaining five were combined as a high-throughput software RAID 0 device. Running our benchmark for disk throughput, oblDisk v3.0, we pegged the SATA array as capable of sustaining extraordinary levels of continuous throughput, which averaged 155MB per second on writes and 252MB per second on reads. We also configured the four hot-swap bays on the HP ProLiant DL580 with 15K Ultra320 SCSI drives. which were connected to an HP StorageWorks 6i SmartArray controller. Similar to the Appro XtremeServer, we configured one drive for the OS and the remaining drives were put into a hardware-based RAID 0 array. As with previous CPU performance and memory bandwidth benchmarks the 4-way Appro XtremeServer held throughput advantages on reads and writes that were on the order of 3.5- and 1.5-to-1. |
|
|
|