AMD64 NUMAOLOGY

With the introduction of the Linux 2.6 kernel, multiprocessor systems now come in two flavors: traditional SMP and NUMA. The differences are neither transparent nor trivial.

   
 


by Jack Fegreus

June 2, 2005

   
     
 
Non-Uniform Memory Access or NUMA has a long history of use in niche MPP (massively parallel processor) systems. For general computing applications, conventional wisdom held Symmetric Multi-Processing or SMP to be the architecture of choice for traditional business computing loads. Fortunately end users were shielded from most of the controversy, since NUMA architecture was beyond the scope of off-the-shelf operating systems to support.

In an SMP system, all processors access a shared pool of memory over a central shared bus. As a result, as far as the operating system is concerned the computer system looks and behaves the same (symmetric) from any CPU. Better yet for IT, running, managing, and developing software for an SMP system requires no additional training or expertise. For any new technology, the surest route for IT adoption is through transparency.

For the systems designer, SMP is far from transparent. The big issue is how to contend with multiple CPUs that are attempting to access a common pool of memory and a common pool of devices.

 
         
 
OPENBENCH LABS SCENARIO

UNDER EXAMINATION:  Quad-processor Opteron NUMA Server

WHAT WE TESTED:
Appro 1142H Opteron Server
(4) AMD64 Opteron CPUs
848 Series CPUs at 2.2GHz
2GB DDR400 ECC registered memory per CPU
Dual 1000-Mbit Broadcomm Ethernet ports
133MHz PCI-X card slot
Dual Ultra320 Hot-Swap SCSI Drives

HOW WE TESTED:

SUSE LINUX Enterprise Server 9
AMD64 version
Linux Kernel 2.6 with support for:
Non-Uniform Memory Access (NUMA)
64GB RAM
Gnu C 3.3

Intel C++ Compiler for Linux v8.1
Free license for non-commercial development
Advanced optimization supporting EM64T architecture
Eclipse-based IDE


Emulex LP10000 HBA
Full duplex 2Gb/s Fibre Channe
133/100/66 MHz PCI-X and PCI compatibility

Onboard hardware context cache for high transaction performance

nStor 4520 Storage System
(2) WahooXP RAID controllers
(15) Seagate Cheetah drives
15000 rpm

Appro 1224Xi-21 1U Server
Dual 2.4GHz Intel Xeon CPUs
1GB DDR memory
Dual 10/100/1000-Mbit Ethernet ports
133MHz PCI-X expansion slot

Benchmarks:
oblCPU v3.0
oblMemBench v2.0

oblFilePerf v1.0

KEY FINDINGS:

The highest consistent computational performance for both the IA-32 and AMD64 servers occurred running oblCPU compiled with the Intel C v8.1 compiler.

The Eclipse IDE for the Intel compiler would not install on the AMD64-based server
Benchmarks compiled on the AMD64 server with Gnu C frequently encountered runtime memory segmentation faults.
Our oblCPU computational benchmark consistently ran on a single processor with no auto parallelization on the NUMA architecture of the AMD64 server
Our oblMemBench bandwidth benchmark ran child processes on multiple CPUs on the NUMA server resulting in an upper and lower bound on performance.
 

For Intel IA-32 architecture, the solution was the “Northbridge” Memory Controller Hub (MCH). All processors compete for fixed-memory and front-side bus bandwidth. That makes system scalability directly dependent on the implementation of the shared bus and MCH.

From the start, SMP systems hit the scalability wall at 8 CPUs. The practical off-the-shelf operating system limit became four processors and the SMP market sweet spot quickly centered on dual-processor systems.

With today’s top processors clocking in at around 3GHz and sporting front-side bus speeds of 800MHz, the idea of eliminating that central control switch to improve scalability has been revisited by a number of CPU chip designers including AMD.

What's more, multi-drop bus architectures such as that for PCI-X exhibit significant electrical and bandwidth degradation as more hardware devices are added. As a result, high-speed devices have to be located on separate PCI-X busses to maintain full 133MHz clocking.

While CPU scalability is an admirable goal, the market demand for single systems that sport 32 or more processors is not exactly overwhelming. There is, however, another interesting aspect of NUMA beyond scalability: system cost reduction.

 
     
 

To address this problem, AMD debuted a new microarchitecture, dubbed "HyperTransport technology," (HTT)—quite different from Intel's Hyper Threading Technology, which is also referred to with the initials HTT—when it launched the 64-bit Opteron CPU. In creating this new I/O scheme, AMD focused on developing a universal electronic signaling technology for interprocessor data traffic that would make it easy to build motherboards for an entire spectrum of systems ranging from simple embedded devices all the way up to supercomputers, like Red Storm for Sandia National Labs.

The idea was to provide system architects with lower pin counts, fewer buses, low-latency responses, and more bandwidth, while maintaining compatibility with legacy PC buses. What's more, AMD was able to depart from the Northbridge/Southbridge interconnect model by replacing traditional bridges and hubs with the construct of flow-through HyperTransport tunnels for I/O links. These tunnels transfer data between devices at up to 12.8GB per second using two unidirectional 6.4GB per second point-to-point links, The tunnels have the added benefit of eliminating the arbitration overhead necessary on a shared bus.

Following that construct of decreasing cost through greater integration, every 64-bit AMD (AMD64) Opteron CPU is designed with its own integrated low-latency memory controller. That in turn makes a simple NUMA architecture a natural for Opteron-based multiprocessor systems. Each CPU is given its own bank of DDR memory and all inter-processor communications are passed over a HyperTransport link. In theory, memory access speeds should scale directly with processor frequency with potential memory throughput at the magical 12.8GB per second. What's more, by reducing the number of silicon components that are necessary, AMD's NUMA architecture also makes it easier to design a processor board that includes more connections for the larger memory space that a 64-bit system can support .

Included among those inter-processor communications being passed around are memory access requests. Such a scheme naturally means memory access times will not be uniform as accessing memory attached to the processors memory controller will naturally be faster. To provide the lowest latency for overall memory access, the operating system needs to be NUMA-aware and geared to expect different response times when a processor requests information from memory.

To make it easier for operating system developers to port their OS to multiprocessor Opteron-based servers, AMD64 technology attempts to make this issue transparent. At boot time, the BIOS for multiprocessor AMD64 systems calls on each processor to locate its local memory and then maps that into a single globally addressable physical memory space. What’s more, AMD64 processors automatically maintain cache coherency across that global address space. As a result, SUSE was able to support dual-processor Opteron configurations under SUSE LINUX Enterprise Server (SLES) 8, which was built on the 2.4 Linux kernel. Software was able to access any data belonging to any processor with a normal memory operation to the global address space as if it were a traditional SMP system.

 
         
 

Optimal performance on NUMA systems, however, requires that processes be located on processors that are as close as possible to the memory that the process accesses. This requires greater NUMA awareness within the scheduler to support what’s dubbed “locality of processes to memory.” As a result, optimal performance will most often be obtained for a process by allocating all memory for that process from the same processor and dispatching all child processes on the same processor through the life of the parent processes.

The Linux scheduler in 2.4 kernel had a single runqueue design, which limited throughput and increased lock contention on systems with multiple CPUs. To improve the scheduler and make it NUMA-aware, a new multi-queue scheduler with a runqueue per processor was developed.

 
The Tyan K8QS Pro motherboard in the Appro 1142H Server has distinct local memory banks next to each processor. PCI-X slots for devices such as the Emulex 10000 M2 Fibre Channel HBA are handled by an AMD 8131 HTT tunnel. Serial I/O passes through an AMD 8111 HTT processor.
 
     
 

The new scheduler changed the load balancing logic and facilitates dispatching processes on the same processor to take advantage of cache warmth. This makes the execution of individual processes more efficient and produces much lower CPU-to-CPU memory access traffic than what a single shared-memory bus would experience.

To investigate how well theory balances with practicality, openBench Labs tested a 1U quad-processor server from Appro. The outstanding thermal management and innovative mechanical and air duct design of the Appro 1142H server, provides efficient cooling for four AMD64 Opteron 848 processors each clocked at 2.2GHz. This server is designed to deliver the highest power-to-density ratio of any 1U server available for mission-critical technical applications such as video streaming, computational clustering, and serving J2EE and database applications.

 We began evaluation by installing the 64-bit version of SUSE LINUX Enterprise Server 9 for AMD64 architecture on the Appro 1142H. Included with SLES 9 for AMD64 architecture is a 64-bit version of the Gnu C compiler (gcc v3.3) version. In addition, we installed the Intel’s 8.1 C/C++ compilers, which support the new 64-bit Extended Memory Technology (EM64T found on the new Xeon processors and which is also supported by the full 64-bit Opteron CPUs. In particular, EM64T adds a 64-bit flat virtual address space, which enables up to 1TB of memory; 64-bit wide general-purpose registers; 64-bit pointers; and 64-bit integer support. According to Intel, these changes should give a new Xeon CPU a 40-to-60% edge over a similarly clocked early generation Xeon processor.

In addition, we installed the Intel’s 8.1 C/C++ compilers, which support the new 64-bit Extended Memory Technology (EM64T found on the new Xeon processors and which is also supported by the full 64-bit Opteron CPUs. In particular, EM64T adds a 64-bit flat virtual address space, which enables up to 1TB of memory; 64-bit wide general-purpose registers; 64-bit pointers; and 64-bit integer support. According to Intel, these changes should give a new Xeon CPU a 40-to-60% edge over a similarly clocked early generation Xeon processor.

In addition to EM64T support, version 8.1 of the Intel C/C++ compiler suite also provides the option to install an integrated Eclipse-based IDE. Unfortunately, when we ran the install script for the Intel C/C++ compiler on the Opteron system, the option to install the Eclipse IDE did not appear. For a baseline comparison, we also installed Intel C/C++ on a dual-Xeon Appro 1224Xi server which sported 32-bit Xeon processors clocked at 2.4MHz and was configured with SUSE LINUX Professional 9.4. On this server we had no problem installing the Eclipse IDE.

 
         
 

Having previously tested 32-bit versions of our benchmarks on Opteron CPUs, we wanted to investigate the bounds on true 64-bit performance. Attempting to resolve that issue, however, only managed to raise more issues. Whether compiled with Intel C or Gnu C, execution of our computational benchmark, oblCPU, was a study in contrasts. At run time, both the Gnu C and PathLink 64-bit executables generated memory segmentation faults in the same kernels of the suite. The only version of oblCPU to compile and run successfully on the Opteron server was compiled with Intel C v8.1.

With the version 8.1 Intel compiler on both the Xeon- and Opteron-based servers, we invoked the “-xW -ip -O3 -parallel” optimization switches for aggressive optimization and automatic parallelization. These settings generated a significant degree of automatic parallel execution, which was observed with the KDE SystemGuard monitor. We pegged the geometric mean performance of the 33 computational kernels in the oblCPU suite as 3.15 times faster than that of the oblCPU suite compiled with Gnu C 3.2 running on a 1GHz P-III processor.

 
The best compiler for harnessing existing code on the multiprocessor Opteron-based system with its NUMA architecture is Intel’s v8.1. Both our computational and I/O benchmarks suffered from runtime memory segmentation faults when compiled with 64-bit Gnu C on the Opteron system. All programs compiled for 32-bit X86 systems worked flawlessly.
 
         
 

We then compiled the oblCPU suite with Gnu C 3.4 and set the optimization switches at “-O3 -ffast-math -mfpmath=sse –funroll-loops -fpeel-loops.” Once again we observed significant processing activity on multiple processors. On the dual-Xeon, throughput was measured at 1.85 times faster than on our 1GHz P-III standard. This gave the Intel C compiler a 70% performance edge on the dual-Xenon system.

Optimal performance on NUMA systems, however, requires that processes be located on processors that are as close as possible to the memory that the process accesses. This requires greater NUMA awareness within the scheduler to support what’s dubbed “locality of processes to memory.” As a result, optimal performance will most often be obtained for a process by allocating all memory for that process from the same processor and dispatching all child processes on the same processor through the life of the parent processes.

 
On IA-32 SMP architecture, oblCPU was automatically parallelized when compiled with Gnu C and Intel C (mouse over) to run across multiple processors.
 
         
 

This makes the execution of individual processes more efficient and produces much lower CPU-to-CPU memory access traffic than what a single shared-memory bus would experience.

Next we ran both versions of oblCPU on the Appro quad-Opteron server. The Gnu C version of the oblCPU suite improved 58% as performance jumped to 2.92 times that of our 1GHz P-III system. 

The Intel C version showed a minimal 10% improvement. What’s more, when we recompiled oblCPU using the Intel C v8.1 compiler on the quad-Opteron server, performance was statistically equivalent to performance of the code compiled on the dual-Xeon server.

On the Opteron system, this put the Intel-compiled code edge at 19%. Nonetheless, this edge evaporated for the computational kernels that did not incur runtime memory segmentation faults. For those kernels, executables compiled with 64-bit Gnu C—we added the “–mach=k8” optimization switch—performance was on a par with the Intel C executable.

 
 On the AMD64 NUMA architecture,  oblCPU executed on just a single processor. This limited the performance improvement garnered by running a single instance of oblCPU on a quad-Opteron rather than a dual-Xeon system. This behavior, however, enabled linear scalability (mouse over) for the execution of four instances of oblCPU.
 
     
 

Far more interesting was the disappearance of parallelization form the benchmark suite on the NUMA quad-Opteron processor. Apparently to preserve locality of reference for the data, both the Gnu C- and Intel C-compiled versions of oblCPU ran strictly on one processor. As a result, the Intel C compiled version did not take advantage of any parallelization and performance was capped to a single processor. For this reason, performance for a single instance of our oblCPU benchmark improved a mere 10% when we moved from the dual-Xeon system to the quad-Opteron system.

While performance improvement for a single instance of oblCPU was minimal, a very different picture was provided when we ran four simultaneous instances of oblCPU. In this scenario, the quad-Opteron provided perfectly linear scaling as each instance of the benchmark suite was given an entire processor.

 
         
 

Given the performance of the oblCPU benchmark suite, we chose to launch our oblMemBench benchmark for memory bandwidth so that it would run only a single process at a time. Our normal procedure when running oblMemBench is to run the benchmark simultaneously on all available CPUs to determine overall bandwidth by measuring cumulative throughput. This choice to focus in on the throughput of a single process generated a number of surprising results.

On both the Xeon-based SMP system and the Opteron-based NUMA system, child processes jumped processors. This behavior was not surprising for SMP, but it certainly was for NUMA. Even more surprising, we observed upper and lower bounds on memory throughput. on both the SMP and the NUMA system,

On the Opteron-based NUMA system, the differences in memory throughput are readily attributable to the differences in overhead for CPUs when accessing memory locally versus remotely. Remote access requires intervention by the CPU for which the desired memory address is local.

 
To measures memory bandwidth, oblMemBench thrashes a fix block of memory by accessing data locations using increasingly larger strides. Unlike oblCPU, however, oblMemBench launched child processes on different processors.
 
         
 

On the Xeon-based SMP system, however, there is no notion of local or remote node. While SMP creates no notion of local or remote CPUs, the Xeon  Hyper Threading Technology creates the construct of a virtual CPU in addition to the physical CPU. That’s why both Windows and Linux report the dual-processor Xeon as a quad-processor system. Furthermore, the differences between real and virtual Hyper Threading processors introduce an environment on SMP Xeon systems that in some ways parallels the NUMA environment of Opteron system when it comes to accessing memory.

Even more interesting from an I/O throughput perspective, however, was the performance of oblMemBench when compiled on the Opteron system with the 64-bit version of Gnu C. In this case, there were no runtime memory segmentation faults generated by the 64-bit compiled code. Even more important, memory throughput measured running the benchmark created with the Gnu C 64-bit compiler doubled in comparison to the Intel C version.

What’s more, performance for the shortest strides was entirely cached. As a result memory access was virtually instantaneous for the system clock. This result has interesting implications for I/O and DMA transfers.

To delve deeper into the I/O issue, we attempted to compile and run our disk benchmarks with the 64-bit version of Gnu C. Once again, however, these efforts were thwarted by the occurrence of runtime errors in the guise of memory segmentation faults. As a result, we turned to compiling the benchmarks on the quad-Opteron system using Intel C v8.1.

In earlier tests of iSCSI and Fibre Channel I/O using benchmarks compiled on a 32-bit Xeon system, the Appro 1142H sustained 86.2MB/per second using a single logical Fibre Channel drive and scaled to just under 150MB per second using three logical drives exported by an nStor 4520 array.

 
Memory throughput on both the quad-Opteron and dual-Xeon systems was characterized by an upper and lower bounds. On the NUMA system this asymmetry was a result of differences in overhead for memory access between local and remote processors vis-à-vis the memory block. On the SMP dual-Xeon system, the asymmetry was caused by virtual processors created by Hyper Threading Technology.
 
         
 

Having measured a 20% performance edge in throughput on the quad-Opteron system while testing memory bandwidth on both the quad-Opteron and dual-Xeon systems while using executables compiled with Intel C v8.1, we fully anticipated that I/O throughput using PCI-X expansion slots would demonstrate a similar edge on the quad-Opteron system. It did not.

 For this test we used an Emulex 10000 M2 Fibre Channel HBA, which supports a PCI-X speed of 133MHz. We ran our oblFileRead benchmark which opens and reads all of the files in a target directory.

 
Reversing the results of both our oblCPU and oblMemBench tests, average PCI-X I/O throughput measured with our oblFileRead benchmark was marginally higher on the dual-Xeon system. More interesting, instantaneous peak throughput was significantly higher on the Xeon-based system.
 
     
 

In this case, we ran the benchmark against our standard backup-test directory, which contains 10GB of data. This directory was stored on a 4-disk RAID0 array built with 15,000 rpm Seagate FC drives on an nStor 4520 Storage Array.

The results of our I/O test were quite different from all of the other benchmarks. Using the Intel C compiler to create executables on each system, CPU performance and memory throughput on the quad-Opteron with its HyperTransport design and NUMA architecture consistently demonstrated a 10-to-20% edge over the IA-32 dual Xeon system.

That paradigm was entirely reversed with the results of our oblFileRead benchmark. Running this benchmark, the dual-Xeon system held a sight 4% edge in overall throughput. More remarkably, instantaneous I/O burst rates over our Fibre Channel storage network were typically 20% greater on the IA-32 Xeon-based system than on the HyperTransport Opteron-based system.

In coming weeks we will round out these tests of 64-bit processing with examinations of both 64-bit Xeon and Itanium servers.