|
AMD64 NUMAOLOGY With the introduction of the Linux 2.6 kernel, multiprocessor systems now come in two flavors: traditional SMP and NUMA. The differences are neither transparent nor trivial. |
||||
|
|
|
For Intel IA-32 architecture, the solution was the “Northbridge” Memory Controller Hub (MCH). All processors compete for fixed-memory and front-side bus bandwidth. That makes system scalability directly dependent on the implementation of the shared bus and MCH. From the start, SMP systems hit the scalability wall at 8 CPUs. The practical off-the-shelf operating system limit became four processors and the SMP market sweet spot quickly centered on dual-processor systems. With today’s top processors clocking in at around 3GHz and sporting front-side bus speeds of 800MHz, the idea of eliminating that central control switch to improve scalability has been revisited by a number of CPU chip designers including AMD. What's more, multi-drop bus architectures such as that for PCI-X exhibit significant electrical and bandwidth degradation as more hardware devices are added. As a result, high-speed devices have to be located on separate PCI-X busses to maintain full 133MHz clocking. While CPU scalability is an admirable goal, the market demand for single systems that sport 32 or more processors is not exactly overwhelming. There is, however, another interesting aspect of NUMA beyond scalability: system cost reduction. |
|
To address this problem, AMD debuted a new microarchitecture, dubbed "HyperTransport technology," (HTT)—quite different from Intel's Hyper Threading Technology, which is also referred to with the initials HTT—when it launched the 64-bit Opteron CPU. In creating this new I/O scheme, AMD focused on developing a universal electronic signaling technology for interprocessor data traffic that would make it easy to build motherboards for an entire spectrum of systems ranging from simple embedded devices all the way up to supercomputers, like Red Storm for Sandia National Labs.
Following that construct of decreasing cost through greater integration, every 64-bit AMD (AMD64) Opteron CPU is designed with its own integrated low-latency memory controller. That in turn makes a simple NUMA architecture a natural for Opteron-based multiprocessor systems. Each CPU is given its own bank of DDR memory and all inter-processor communications are passed over a HyperTransport link. In theory, memory access speeds should scale directly with processor frequency with potential memory throughput at the magical 12.8GB per second. What's more, by reducing the number of silicon components that are necessary, AMD's NUMA architecture also makes it easier to design a processor board that includes more connections for the larger memory space that a 64-bit system can support . Included among those inter-processor communications being passed around are memory access requests. Such a scheme naturally means memory access times will not be uniform as accessing memory attached to the processors memory controller will naturally be faster. To provide the lowest latency for overall memory access, the operating system needs to be NUMA-aware and geared to expect different response times when a processor requests information from memory. To make it easier for operating system developers to port their OS to multiprocessor Opteron-based servers, AMD64 technology attempts to make this issue transparent. At boot time, the BIOS for multiprocessor AMD64 systems calls on each processor to locate its local memory and then maps that into a single globally addressable physical memory space. What’s more, AMD64 processors automatically maintain cache coherency across that global address space. As a result, SUSE was able to support dual-processor Opteron configurations under SUSE LINUX Enterprise Server (SLES) 8, which was built on the 2.4 Linux kernel. Software was able to access any data belonging to any processor with a normal memory operation to the global address space as if it were a traditional SMP system. |
|
Optimal performance on NUMA systems, however, requires that processes be located on processors that are as close as possible to the memory that the process accesses. This requires greater NUMA awareness within the scheduler to support what’s dubbed “locality of processes to memory.” As a result, optimal performance will most often be obtained for a process by allocating all memory for that process from the same processor and dispatching all child processes on the same processor through the life of the parent processes. The Linux scheduler in 2.4 kernel had a single runqueue design, which limited throughput and increased lock contention on systems with multiple CPUs. To improve the scheduler and make it NUMA-aware, a new multi-queue scheduler with a runqueue per processor was developed. |
|
|
The new scheduler changed the load balancing logic and facilitates dispatching processes on the same processor to take advantage of cache warmth. This makes the execution of individual processes more efficient and produces much lower CPU-to-CPU memory access traffic than what a single shared-memory bus would experience. To investigate how well theory balances with practicality, openBench Labs tested a 1U quad-processor server from Appro. The outstanding thermal management and innovative mechanical and air duct design of the Appro 1142H server, provides efficient cooling for four AMD64 Opteron 848 processors each clocked at 2.2GHz. This server is designed to deliver the highest power-to-density ratio of any 1U server available for mission-critical technical applications such as video streaming, computational clustering, and serving J2EE and database applications. We began evaluation by installing the 64-bit version of SUSE LINUX Enterprise Server 9 for AMD64 architecture on the Appro 1142H. Included with SLES 9 for AMD64 architecture is a 64-bit version of the Gnu C compiler (gcc v3.3) version. In addition, we installed the Intel’s 8.1 C/C++ compilers, which support the new 64-bit Extended Memory Technology (EM64T found on the new Xeon processors and which is also supported by the full 64-bit Opteron CPUs. In particular, EM64T adds a 64-bit flat virtual address space, which enables up to 1TB of memory; 64-bit wide general-purpose registers; 64-bit pointers; and 64-bit integer support. According to Intel, these changes should give a new Xeon CPU a 40-to-60% edge over a similarly clocked early generation Xeon processor. In addition, we installed the Intel’s 8.1 C/C++ compilers, which support the new 64-bit Extended Memory Technology (EM64T found on the new Xeon processors and which is also supported by the full 64-bit Opteron CPUs. In particular, EM64T adds a 64-bit flat virtual address space, which enables up to 1TB of memory; 64-bit wide general-purpose registers; 64-bit pointers; and 64-bit integer support. According to Intel, these changes should give a new Xeon CPU a 40-to-60% edge over a similarly clocked early generation Xeon processor. In addition to EM64T support, version 8.1 of the Intel C/C++ compiler suite also provides the option to install an integrated Eclipse-based IDE. Unfortunately, when we ran the install script for the Intel C/C++ compiler on the Opteron system, the option to install the Eclipse IDE did not appear. For a baseline comparison, we also installed Intel C/C++ on a dual-Xeon Appro 1224Xi server which sported 32-bit Xeon processors clocked at 2.4MHz and was configured with SUSE LINUX Professional 9.4. On this server we had no problem installing the Eclipse IDE. |
|
Having previously tested 32-bit versions of our benchmarks on Opteron CPUs, we wanted to investigate the bounds on true 64-bit performance. Attempting to resolve that issue, however, only managed to raise more issues. Whether compiled with Intel C or Gnu C, execution of our computational benchmark, oblCPU, was a study in contrasts. At run time, both the Gnu C and PathLink 64-bit executables generated memory segmentation faults in the same kernels of the suite. The only version of oblCPU to compile and run successfully on the Opteron server was compiled with Intel C v8.1. With the version 8.1 Intel compiler on both the Xeon- and Opteron-based servers, we invoked the “-xW -ip -O3 -parallel” optimization switches for aggressive optimization and automatic parallelization. These settings generated a significant degree of automatic parallel execution, which was observed with the KDE SystemGuard monitor. We pegged the geometric mean performance of the 33 computational kernels in the oblCPU suite as 3.15 times faster than that of the oblCPU suite compiled with Gnu C 3.2 running on a 1GHz P-III processor. |
|
|
We then compiled the oblCPU suite with Gnu C 3.4 and set the optimization switches at “-O3 -ffast-math -mfpmath=sse –funroll-loops -fpeel-loops.” Once again we observed significant processing activity on multiple processors. On the dual-Xeon, throughput was measured at 1.85 times faster than on our 1GHz P-III standard. This gave the Intel C compiler a 70% performance edge on the dual-Xenon system. Optimal performance on NUMA systems, however, requires that processes be located on processors that are as close as possible to the memory that the process accesses. This requires greater NUMA awareness within the scheduler to support what’s dubbed “locality of processes to memory.” As a result, optimal performance will most often be obtained for a process by allocating all memory for that process from the same processor and dispatching all child processes on the same processor through the life of the parent processes. |
|
|
This makes the execution of individual processes more efficient and produces much lower CPU-to-CPU memory access traffic than what a single shared-memory bus would experience. Next we ran both versions of oblCPU on the Appro quad-Opteron server. The Gnu C version of the oblCPU suite improved 58% as performance jumped to 2.92 times that of our 1GHz P-III system. The Intel C version showed a minimal 10% improvement. What’s more, when we recompiled oblCPU using the Intel C v8.1 compiler on the quad-Opteron server, performance was statistically equivalent to performance of the code compiled on the dual-Xeon server. On the Opteron system, this put the Intel-compiled code edge at 19%. Nonetheless, this edge evaporated for the computational kernels that did not incur runtime memory segmentation faults. For those kernels, executables compiled with 64-bit Gnu C—we added the “–mach=k8” optimization switch—performance was on a par with the Intel C executable. |
|
|
Far more interesting was the disappearance of parallelization form the benchmark suite on the NUMA quad-Opteron processor. Apparently to preserve locality of reference for the data, both the Gnu C- and Intel C-compiled versions of oblCPU ran strictly on one processor. As a result, the Intel C compiled version did not take advantage of any parallelization and performance was capped to a single processor. For this reason, performance for a single instance of our oblCPU benchmark improved a mere 10% when we moved from the dual-Xeon system to the quad-Opteron system. While performance improvement for a single instance of oblCPU was minimal, a very different picture was provided when we ran four simultaneous instances of oblCPU. In this scenario, the quad-Opteron provided perfectly linear scaling as each instance of the benchmark suite was given an entire processor. |
|
Given the performance of the oblCPU benchmark suite, we chose to launch our oblMemBench benchmark for memory bandwidth so that it would run only a single process at a time. Our normal procedure when running oblMemBench is to run the benchmark simultaneously on all available CPUs to determine overall bandwidth by measuring cumulative throughput. This choice to focus in on the throughput of a single process generated a number of surprising results. On both the Xeon-based SMP system and the Opteron-based NUMA system, child processes jumped processors. This behavior was not surprising for SMP, but it certainly was for NUMA. Even more surprising, we observed upper and lower bounds on memory throughput. on both the SMP and the NUMA system, On the Opteron-based NUMA system, the differences in memory throughput are readily attributable to the differences in overhead for CPUs when accessing memory locally versus remotely. Remote access requires intervention by the CPU for which the desired memory address is local. |
|
|
On the Xeon-based SMP system, however, there is no notion of local or remote node. While SMP creates no notion of local or remote CPUs, the Xeon Hyper Threading Technology creates the construct of a virtual CPU in addition to the physical CPU. That’s why both Windows and Linux report the dual-processor Xeon as a quad-processor system. Furthermore, the differences between real and virtual Hyper Threading processors introduce an environment on SMP Xeon systems that in some ways parallels the NUMA environment of Opteron system when it comes to accessing memory. Even more interesting from an I/O throughput perspective, however, was the performance of oblMemBench when compiled on the Opteron system with the 64-bit version of Gnu C. In this case, there were no runtime memory segmentation faults generated by the 64-bit compiled code. Even more important, memory throughput measured running the benchmark created with the Gnu C 64-bit compiler doubled in comparison to the Intel C version. What’s more, performance for the shortest strides was entirely cached. As a result memory access was virtually instantaneous for the system clock. This result has interesting implications for I/O and DMA transfers. To delve deeper into the I/O issue, we attempted to compile and run our disk benchmarks with the 64-bit version of Gnu C. Once again, however, these efforts were thwarted by the occurrence of runtime errors in the guise of memory segmentation faults. As a result, we turned to compiling the benchmarks on the quad-Opteron system using Intel C v8.1. In earlier tests of iSCSI and Fibre Channel I/O using benchmarks compiled on a 32-bit Xeon system, the Appro 1142H sustained 86.2MB/per second using a single logical Fibre Channel drive and scaled to just under 150MB per second using three logical drives exported by an nStor 4520 array. |
|
|
Having measured a 20% performance edge in throughput on the quad-Opteron system while testing memory bandwidth on both the quad-Opteron and dual-Xeon systems while using executables compiled with Intel C v8.1, we fully anticipated that I/O throughput using PCI-X expansion slots would demonstrate a similar edge on the quad-Opteron system. It did not. For this test we used an Emulex 10000 M2 Fibre Channel HBA, which supports a PCI-X speed of 133MHz. We ran our oblFileRead benchmark which opens and reads all of the files in a target directory. |
|
|
In this case, we ran the benchmark against our standard backup-test directory, which contains 10GB of data. This directory was stored on a 4-disk RAID0 array built with 15,000 rpm Seagate FC drives on an nStor 4520 Storage Array. The results of our I/O test were quite different from all of the other benchmarks. Using the Intel C compiler to create executables on each system, CPU performance and memory throughput on the quad-Opteron with its HyperTransport design and NUMA architecture consistently demonstrated a 10-to-20% edge over the IA-32 dual Xeon system. That paradigm was entirely reversed with the results of our oblFileRead benchmark. Running this benchmark, the dual-Xeon system held a sight 4% edge in overall throughput. More remarkably, instantaneous I/O burst rates over our Fibre Channel storage network were typically 20% greater on the IA-32 Xeon-based system than on the HyperTransport Opteron-based system. In coming weeks we will round out these tests of 64-bit processing
with examinations of both 64-bit Xeon and Itanium servers. |