BE EXTREME

Leveraging AMD's dual-core CPUs and the latest in HyperTransport ASIC technology, Appro goes Xtreme with a new line of servers poised to shakeup high-end datacenters trying to cope with the demands of server consolidation 'and virtualization.

   
 


by Jack Fegreus

April 3, 2006

   
     
 
F
or IT, the fundamental question has long been how to determine "the right size box." That question is fueled by the need to handle changes in work loads that effect CPU, memory, and I/O resource utilization. Historically, IT had only one option. That option was to "scale-up" and add resources to an up-sized box. The problem with that strategy was the lack of a fine grained structure for temporal changes. With the introduction of low-cost rack-mounted servers an alternative "scale-out" strategy arose to decompose applications and deploy their functions across multiple networked servers. This set the stage for a fine-grained Service-oriented Architecture (SoA) based on asymmetric provisioning of application infrastructures.

Nonetheless, too many distributed servers can easily trigger a number of cascading negative effects. Rack mount servers are relatively inexpensive to acquire and easy to configure; however a high-density server deployment can quickly add up to considerable costs in floor space, management, networking, power, and heat. While chip areas have fallen and computational power increased, power consumption and heat dissipation have become more problematic. Increasing the number of servers also increases variability and lack of consistency in the data stored on these systems, which can create serious problems for IT when attempting to develop services relying on data from different systems.

With business-organization gurus obsessed with the “customer experience," IT strategies are once again turning to server consolidation schemes. Accelerating that trend are the Total Cost of Ownership (TCO) models that point to the costs of operating a server as dwarfing the costs of acquiring a server. The Robert Frances Group pegs the acquisition cost of a server to be only 20% of its TCO. This puts the spotlight on hardware resource utilization as the key to cost containment. It also puts IT under closer scrutiny to improve resource efficiency.

At the same time, many CEOs, who must dealing with issues such as shrinking product life cycles and the pursuit of global markets, have begun to institute real-time sense-and-respond strategies for business operations. Those strategies elevate the premium on server performance and leave CIOs on the horns of a dilemma: To drive down costs, they must increase server utilization; but to meet the processing requirements of a sense-and-respond business strategy, they must increase real-time performance. Successful resolution of those conflicting issues requires a sophisticated hybrid approach to traditional scale-up and scale-out strategies and a new paradigm in server design.

 
         
 
OPENBENCH LABS SCENARIO

 
UNDER EXAMINATION:  Dual-core AMD64 Opteron Servers

 WHAT WE TESTED:

Appro 1U XtremeServer
(2) Dual-Core AMD 280 CPUs
8GB DDR 400MHz ECC registered RAM
PCIe: (1)16x; PCI-X: (1)133MHz

Appro 3U XtremeServer
(4) Dual-Core AMD 880 CPUs
8GB DDR 400MHz
ECC registered RAM
PCIe: (2)16x, (1)4x; PCI-X: (3)133MHz

nVidia PCIe ASIC
(6) Hot-swap SATA drive bays

  HOW WE TESTED:

HP ProlLiant DL580 G3 Server
(4) 3.3GHz Xeon EM64T CPUs
8GB .DDR-2 ECC registered RAM





SUSE Linux 10.0
GNU Compiler Collection version 4.0
Xen 3 virtualization




Intel C++ Compiler for Linux v9.0
Free license for non-commercial development
Advanced optimization supporting EM64T architecture (v9.0.030)
Eclipse-based IDE (32-bit systems)

Benchmarks:
oblCPU v4.0

oblMemBench v4.0
oblDisk v3.0

 KEY FINDINGS:

 Both the 2-way and 4-way Appro XtremeServers ran our oblCPU benchmark faster than the 4-way HP ProLiant DL580 G3, with Xeon EM64T CPUs clocked 38% faster.

The 2-way Appro XtremeServer and the 4-way HP ProLiant DL580 G3 server exhibited the identical SMP load scaling profile even though the 4-way Xeon-based DL580 presents itself to the operating system as an 8-way server.

The 4-way Appro XtremeServer provided twice the memory bandwidth as the 4-way HP ProLiant DL580 G3.

Using hot-swap SATA JBOD drives and software RAID, the 4-way Appro XtremeServer provided twice the I/O throughput on reads and three times the I/O throughput on writes versus the Ultra320 SCSI RAID subsystem on the HP ProLiant DL580 G3.

 

Appro has built a solid foundation in the high-performance computational-computing niche for applications such as data mining, simulation and rendering, Now Appro has unveiled its “XtremeServer” product line. These new servers are uniquely well positioned to resolve the issues faced by CIOs embarking on server consolidation and virtualization projects in their datacenters.

The  XtremeServers sport the latest server technology options, including dual-core AMD Opteron CPUs. Dual-core CPUs can double computational performance while maintaining the power and heat footprint of a single-core CPU for superior per-watt performance. The XtremeServer also combines direct-connect HyperTransport microarchitecture with switched PCI Express (PCIe) I/O, which bridges internal I/O subsystems such as the SATA disk controller, to provide an exceptional level of I/O scalability.

To examine the potential performance of the Appro XtremeServers in a datacenter environment, openBench Labs set up a 1U 2-way server and a 3U 4-way server. The 2-way and 4-way XtremeServers feature AMD 280 and 880 dual-core CPUs and can be configured with up to 64GB and 123GB of DDR memory respectively. Each of our test servers were configured with 8GB of DDR memory clocked at 400MHz

 
     
 

The dual-core AMD 280 and 880 CPUs are clocked at 2.4GHz. Differences between the two are a direct result of the direct-connect HyperTransport microarchitecture utilized by AMD. HyperTransport technology replaces the 20-year old Northbridge/Southbridge shared-bus model with high-bandwidth, low-latency, flow-through tunnels that pass data at up to 12.8GB per second using two unidirectional point-to-point links. Connecting bridges to tunnels provides a mechanism to create switch and star HyperTransport topologies. For 2-way systems, AMD 280 CPUs provide for one HyperTransport inter-processor connection and for 4- or 8-way systems the 880 CPUs provide for four inter-processor connections.

For comparison we also set up a 4-way server that featured single-core CPUs, an HP ProLiant DL580 G3 server. Like the Appro XtremeServers. the HP ProLiant DL580 G3 server was configured with 8GB of DDR2 memory clocked at 400MHz. The HP ProLiant DL580 sported 3.33GHz Intel Xeon CPUs with Extended Memory 64 Technology (EM64T). EM64T CPUs allow the system to address up to 1TB of combined virtual and physical memory via a flat 64-bit virtual address space, 64-bit pointers, 64-bit general-purpose registers, and 64-bit integer support.

On each of the three servers, openBench Labs installed SUSE Linux 10. Unlike SUSE Linux Enterprise Serve (SLES), which has a guaranteed 5-year life cycle for support, SUSE Linux is released every six months with an extensive set of emerging-technology applications for both server and desktop environments. SUSE Linux is intended as a technical preview of what Linux production environments will likely be running in the near future.

To help our assessment of Appro’s XtremeServers, the most important new technology introduced in SUSE 10 is version 4.0 of GCC, which now designates the GNU Compiler Collection. Historically, the moniker "GCC" explicitly stood for the GNU C Compiler; however, the acronym now properly designates the complete distribution—from language front ends, to language-independent optimizers, through to support libraries of GNU programming languages, including C, C++, Objective-C, Objective-C++, Java, Fortran, and Ada.

The most extensive and important change may be the overhaul given to the language-independent optimizers. Every component of the SUSE Linux 10.0 distribution has been compiled and checked with GCC 4.0. SLES 10, which will be released in late spring of this year, will be built upon the latest version of the new GNU Compiler Collection and will target mission-critical use.

 
         
 

We started our assessment by compiling single-threaded 64-bit versions of our oblCPU v4.5 benchmark that were optimized for execution on their target platforms: either AMD64 or EM64T CPUs. In addition to using GNU C 4.0, we used the Intel C v9.0 compiler to generate code for optimal execution on the Xeon EM64T-based server.

Using the new GNU C 4.0 compiler, AMD64 Opteron CPU performance scaled along clock-speed lines set by an Intel Pentium III CPU. This was not the case for the superpipelined Xeon EM64T CPU. In the jump from Pentium III to Pentium 4 processor architecture, from which Xeon CPUs have evolved, Intel instituted a superpipelined architecture to exploit instruction level parallelism without involving any special coding by the programmer.

With GNU C 4.0, the 3.3 GHz Xeon EM64T CPU provided only 75% of the CPU performance of the 2.4GHz Opteron 880 CPU. Only by usng the Intel C 9.0 compiler were we able to properly exploit the Xeon’s superpipelined architecture and approach the performance of the AMD64 Opteron CPU.

 
Normalizing the raw execution time for each kernel and then calculating the geometric mean over all of the kernels gives the best single number representation of overall CPU performance. We also calculate a statistical 95% confidence interval for performance, which reflects the spread in performance measured for individual kernels.
 
     
 

While superpipelining is transparent to the applications programmer, this is decidedly not the case for the developers of compilers. The structure of compiled code plays an essential role in avoiding the costly stalls that prevent the exploitation of instruction level parallelism and the realization of the associated performance speedup that processor pipelines can provide. In particular, the executable code generated with the Intel C compiler executed 30% faster than the GNU C 4.0 executable on the Xeon-based HP Proliant DL580 G3 server.

 
             

 
openBench Labs

Both 32- and 64-bit versions of oblCPU v4.0, oblDisk v3.0, and oblMemBench are available for download at the openBench Labs site, Over the coming weeks, we will be adding more of our benchmarks to that repository.
   

The AMD64 Opteron CPU also implements floating-point and integer pipelines; however, AMD64 pipelines have half the number of stages that are found in corresponding pipelines on a Xeon CPU. As a result, using the GNU C 4.0 compiler, the performance of the oblCPU benchmark on an AMD64 Opteron CPU scales far more predictably in terms of the CPU’s clock speed in comparison to our 1GHz Pentium III. On the other hand, using the GNU C 4.0 compiler, a Xeon EM64T CPU with a 38% faster clock speed provided only 75% of the processing power of an AMD64 Opteron CPU:

 
     
 

That difference directly effects the issue of power per watt.  One factor that works to increase that voltage rating is the clock speed. When Intel moved up the clock speed of the Pentium 4 to 2.6GHz, Intel also found it necessary to raise the core voltage rating of the Pentium 4 from 1.5V to 1.525V.

 
         
 

To assess the scalability of SMP load performance, we launched multiple copies of our single-threaded oblCPU benchmark on each of the three servers. The results were quite dramatic.

In the cases of the 2-way XtremeServer and the HP Proliant DLT580 G3, the difference in the net SMP load performance between the two servers paralleled the difference in performance between single instances of oblCPU. As a result, the two-way XtremeServer with dual-core AMD 280 CPUs supported a greater net SMP processing load compared to the 4-way HP Proliant DL580 G3 server with four single-core Intel Xeon EM64T CPUs and hyperthreading.

In these openBench Labs tests, job scheduler effects were the same on each system as all ran the same build of SUSE Linux 10 on the 2.6.13-15 Linux kernel.

When we normalized the net SMP load performance, the SMP load profile for the 2-way Appro XtremeServer and the 4-way HP Proliant DL580 G3 emerged as being identical. While hyperthreading makes the 4-way HP Proliant DL580 G3 with Xeon EM64T CPUs appear to the Linux operating system to be an 8-way server, our SMP load benchmark results demonstrated that its scaling profile was nothing like that of the 4-way Appro XtremeServer, for which dual-core CPUs provide true 8-way SMP load scalability.

 
By plotting the net performance achieved by all of the individual instances of oblCPU, we were able to get a measure of the total CPU performance throughput that each of these servers was able to deliver. By normalizing net SMP performance load to the performance of a single instance of oblCPU, (mouse over) it becomes clear that the net SMP performance load can be viewed as a function of the performance of a single instance of oblCPU on that system, the number of CPUs, and the job scheduler of the underlying OS.
 
 

An interesting SMP scalability pattern arose for all systems. CPU performance scaling was linear for the first n oblCPU processes (n=the number of CPUs). Load scalability continued to increase, albeit more slowly, reaching a peak with 2n-1 instances of oblCPU running. Increasing the number of processes to 2n, puts an equal load on each CPU preventing the scheduler from leveraging a free CPU. As a result, normalized SMP scalability falls back to the number of CPUs. As the number of oblCPU processes continue to increase, this pattern repeats; but as overhead increases peak SMP load performance slowly falls back to the value n.

 
 
HyperTransport Technology microarchitecture enables a great deal of innovation in board design. The 1U XtremeServer tested by openBench Labs had two dual-core AMD 280 CPUs and 4GB of memory connected to each CPU. The 3U (mouse over)  XtremeServer used AMD 880 dual-core CPUs with four HyperTransport inter-processor links each to scale up to 8 CPUs. Appro exploits that scale-up mechanism to put either 2 or 4 CPUs in the 3U XtremeServer. The motherboard has 2 CPUs with memory. A daughter card also has 2 CPUs and memory and links to the motherboard's CPUs via HyperTransport connections.
 

Not only does HyperTransport Technology eliminate the shared Northbridge  frontside bus, each Opteron CPU incorporates its own memory controller, which has a 6.4GB per second throughput rate for local memory. On dual-core CPUs, the two cores share one controller. Such a direct embedded controller interface significantly reduces memory access latency and enables memory bandwidth speeds to scale directly with the number of processors and their frequency.

That makes a non-uniform memory architecture (NUMA) an easy configuration for multiprocessor Opteron systems to support. In such a configuration, each CPU is directly attached to its own bank of memory. For a CPU to access memory outside of the bank under its direct control, it must communicate directly with the CPU attached to that memory over an inter-processor HyperTransport link. That CPU will access the required data and pass it back to the initiator CPU over a HyperTransport inter-processor link,

 
 

With each AMD 880 CPU in the 4-way Appro XtremeServer sporting four HyperTransport inter-processor links and each AMD 280 CPU in the 2-way Appro XtremeServer sporting one inter-processor link, the potential memory bandwidth for the Appro XtremeServers is 25.6GB per second for the 4-way server and 12.8GB per second for the 2-way server. From a microarchitecture perspective, a NUMA server architecture represents a smart way to create a highly scalable configuration.

 
         
 

To realize all of that potential bandwidth, however, it will be necessary to access multiple banks of memory under the control of multiple CPUs. For application programmers—not to mention all of the existing application programs—that smart microarchitecture has all the potential to become a complex programming nightmare.

To avoid just such a disaster, AMD64 technology makes the intricacies of NUMA transparent at the operating system level. At boot time, the AMD64 multiprocessor system BIOS calls on each processor to report its local memory. The system BIOS then maps each of the local memory pools into a single globally-addressable physical memory space. What's more, the AMD64 processors automatically maintain cache coherency across that global address space. As a result, programs access any data belonging to any processor with normal memory operations to the global address space. For programmers and system administrators, the only difference between NUMA and a traditional SMP system is the speed.

With that level of performance and the ability to support up to 64GB (2-way) and 128GB (4-way) of memory. the Appro XtremeServers are equally well-suited for memory intensive business tasks such as OLAP business intelligence and data mining, as they are for performing computational physics and engineering simulations.

 
Using our memory bandwidth benchmark, throughput easily scaled to the match the theoretical specs for both the 2-way and 4-way XtremeServers. By testing on a 6GB block of memory, we guaranteed that oblMemBench would touch each local memory pool. As in the case of our oblCPU performance benchmark, memory bandwidth performance on the 2-way Appro XtremeServer closely paralleled performance on the 4-way HP ProLiant DL580 G3, while the 4-way Appro XtremeServer nearly doubled that level of throughput.
 
     
 

For peripheral I/O, both internal and external, the Appro XtremeServers feature the latest in PCI Express (PCIe) I/O technology. For connecting I/O subsystems, PCIe, like HyperTransport, replaces an old shared-bus I/O architecture, in this case PCI-X, a silicon-based switched I/O architecture modeled on super computers and mainframes. As such, PCIe represents a significant advance in I/O throughput.

 
         
 

 For PCIe-switched I/O, the 4-way Appro Xtreme server features nVidia’s nForce Professional 2200 and 2050 ASICs for PCIe. The nVidia nForce Professional 2200 also acts as a bridge for internal I/O connections including SATA, for which this server provides six hot-swap bays. We populated those bays with Western Digital 10K Raptor drives—one of which was reserved for SUSE Linux 10 Professional and the remaining five were combined as a high-throughput software RAID 0 device.

Running our benchmark for disk throughput, oblDisk v3.0, we pegged the SATA array as capable of sustaining extraordinary levels of continuous throughput, which averaged 155MB per second on writes and 252MB per second on reads. We also configured the four hot-swap bays on the HP ProLiant DL580 with 15K Ultra320 SCSI drives. which were connected to an HP StorageWorks 6i SmartArray controller. Similar to the Appro XtremeServer, we configured one drive for the OS and the remaining drives were put into a hardware-based RAID 0 array. As with previous CPU performance and memory bandwidth benchmarks the 4-way Appro XtremeServer held throughput advantages on reads and writes that were on the order of 3.5- and 1.5-to-1.

 
Using KSysGuard we monitored I/O throughput while running oblDisk v3.0. On the 4-way Xtremeserver, SATA I/O is handled via a PCIe to HyperTransport bridge ASIC from nVidia. Throughput was double that of reads and triple that of writes (mouse over) on the 4-way HP Proliant DL580 with Ultra320 SCSI.
 
     
 

That performance led to SuperComputing Online choosing the  XtremeServer as Product of the Year. The Product of the Year Awards Program provides IT professionals recommended innovative products to evaluate for addressing data-intensive computing challenges. The products were judged by the editors in conjunction with a team of high-performance computing experts. The products were evaluated based on five criteria: innovation, performance, value, ease of use and manageability, ease of integration into existing environments, and functionality.