DUAL DUO HYPER-THREADING

If you are serious about performance, Intel Xeon CPUs and PCI-X peripheral slots are de rigueur for servers, but not sufficient!

   
 
by Jack Fegreus

March 27,2003
   
     
 

When openBench Labs began testing an extraordinary enterprise-class 1U server from Appro, we fully expected the tests to be a very simple and uncomplicated matter. After all, this 30lb 1U server packs one of the fastest, most powerful CPU cores available: dual Intel Xeon CPUs based on the E7501 chipset with its 533MHz front side bus. Rounding out our Appro 1224Xi server was a GB of PC2100 ECC DDR memory (the Tyan motherboard in  the server supports up to 12GB), an embedded Adaptec Ultra320 SCSI controller, two gigabit Intel Ethernet ports, and a 64-bit 133MHz PCI-X expansion slot.

With all that processing power in a 1U system, it should come as no surprise that Appro servers target applications like financial modeling, digital rendering, and seismic analysis which all require the highest floating-point and memory bandwidth performance. And that's just what the new generation of Intel Xeon processors with 512-KB L2 cache and 533 MHz system bus delivers.

 
         
 
OPENBENCH LABS SCENARIO
UNDER EXAMINATION
Dual-processor Xeon server

WHAT WE TESTED

Appro 1224Xi-21 1U Server

Dual 2.4GHz Intel Xeon CPUs
1GB DDR memory
Embedded Ultra320 SCSI
(2) Intel 10/100/1000-Mbit Ethernet ports
133MHz PCI-X expansion slot
ATI Rage graphics controller


Intel C++ for Linux
version 7.0

Free non-commercial
license
 

HOW WE TESTED
Red Hat Linux 8.0 (kernel 2.4.18-27)
SMP Kernel 1.4.18-27
GNU C 3.2

KEY FINDINGS

The Linux kernel must be at least 2.4.17
With application code that does not take advantage of HTT, a 2.4GHz Xeon CPU benchmarked like a 1.8GHz P-III
Benchmarks compiled with gnu 3.2 or VS 6.0 on Windows did not exploit HTT
 

The E7501 chipset is tailored for dual-processor configurations using what Intel dubs "NetBurst microarchitecture." At the heart of this scheme are three component interconnects called a Hub Interface 2.0 (HI2.0).  Each HI2.0 provides 1.066 GB/s I/O peak bandwidth for PCI/PCI-X bridges. That adds up to 3.2 GB/s of peak bandwidth for high-speed I/O. That could put a serious crimp in memory resource. To avoid any memory bottleneck, the E7501 chipset features a 533 MHz front side bus capable of delivering  4.27 GB/s of data from DDR266 DIMMs.

That latter specification is very important given the fact that memory latency has become a major bottleneck for achieving high performance for various applications. To this end we have seen a number of new hardware technologies introduced to speed memory access. On the other hand, a number of multithreading techniques have also emerged to hide memory latency.

One of the most promising ways to enhance the performance of applications is to exploit thread-level parallelism. This is just what Intel has been doing in its Xeon CPU chips and has now added to the latest generation of Pentium 4 CPUs. To make this work, there are two critical pieces of technology that are absolutely necessary. The first piece is a bit of chip-level legerdemain that Intel calls "Hyper-Threading Technology" (HTT). This new architecture permits a single CPU chip to execute data instructions from different threads in parallel. As a result, a single hyper-threading processor is able to manage instructions as if it were two independent processors.

To do this, the all of new Xeon (server) and Pentium 4 (workstation) processors sport a characteristic 533MHz front side bus and have a rich set of performance-enabling features that include distinctively large register caches and two sets of Streaming-SIMD-Extensions: SSE and SSE2. In a nutshell, vectorization detects patterns of sequential data accesses by the same instruction, and transforms that code for Single Instruction Multiple Data (SIMD) execution. As a result, the SSE2 instructions and large registers are the keys to enabling multiple functional units to operate simultaneously on packed data elements, which are used to represent short vectors. The Intel compilers use a number of esoteric optimization technologies, including alignment optimizations to vectorize instruction loops that perform a single operation on multiple elements.

 
     
 

While these restructuring techniques have been around for some time, parallelizing a loop can result in slower execution if the overhead of dispatching threads, scheduling those threads, and sharing resources are significant compared to the total workload performed by the loop. As a result, the most important job of the Intel compiler is to examine all the operations in the loop body and estimate the grain-size per-loop iteration on the targeted CPU microarchitecture to estimate the total workload of the loop and determine if the loop should be parallelized. It is this process, dubbed "intra-register vectorization" by Intel,  that enables these new HTT CPUs to effectively exploit multi-level parallelism.

 
         
 

As to the details of how and why this all works, we leave that to a year-long seminar in compiler architecture. Nonetheless, we have introduced the all-important C-word: compilers. Hyper-Threading Technology-enabled processors significantly increase the performance of application programs if, and only if, those programs exhibit a high degree of parallelism. That means all of the potential performance gains can only be obtained when an application is efficiently multithreaded. And that leaves the choice to do it manually by a very smart code wizard or automatically by the deus ex machina of parallelization: a very smart compiler.

 
Open Reader Survey
Does your site develop in C++? Yes No No Answer
Does your site develop in Java? Yes No No Answer
Does your site develop in FORTRAN? Yes No No Answer
What is the dominant language at your site? C++
Java
FORTRAN
No Answer
Click for
Current Tally
 
         
 

As far as operating systems are concerned, the Linux kernel 2.4.x was made aware of HTT with the release of version 2.4.17 of the Linux kernel. Windows 2000 Server becomes HTT aware by applying Service Pack 3. We used Red Hat 8.0, which was using the 2.4.18-27 kernel, in our first testing of CPU processing capabilities. This kernel knows about the HTT logical processor and treats each HTT-enabled CPU as two distinct physical processors. The same is true for Windows 2000 Server SP3.

Nonetheless, the scheduler used in the stock kernel 2.4 is still considered naive because it does not distinguish the difference between logical processors and physical processors when it comes to resource contention. In other words, the scheduler is just as likely to distribute two threads on CPU0 and CPU1, which are the logical processors associated with the first physical CPU in the Appro 1224Xi server, as it is to distribute those two threads on CPU0 and CPU2, which are logical processors on each of the two physical CPUs in the server.

 
The KDE System Guide utility came up on our dual-Xeon Appro 1224Xi server showing four CPUs with a minimal amount of kernel overhead activity showing up on all "four" CPUs. When we ran our oblCPU benchmark, however, all of the activity appeared on just one logical CPU. On the other hand, when we ran a copy of oblCPU that had been automatically parallelized with the Intel 7.0 compiler (mouse over image), CPU activity appeared on 2 logical CPUs, albeit the same physical CPU. When we ran the multithreaded oblMemBench (click and hold on image), process threads were distributed equally across all 4 CPUs and we measured throughput rates on the order of 3GB per second.
 
     
 

This is being very actively worked on for the Linux 2.5 kernel. Application performance on HTT-enabled CPUs could like experience boosts on the order of 50%. In the mean time, in-house-developed applications can benefit greatly from a simple recompile with the Intel 7.0 Linux C++ compiler, which now available for free under a non-commercial development license.

Our first step was to run version 2.0 of oblCPU, which normalizes all results to a 600MHz Pentium-III (value of 100) and was compiled with gcc v3.0. The benchmark, which correctly pegs a 1.26MHz P-III in an HP NetServer as 2.06 times as powerful as our 600MHz CPU, accessed our 2.4 GHz Xeon processor in the Appro 1224Xi to be only about  2.5 times faster than a 600MHz P-III. We simply could not believe the results. Something was radically wrong.

To eliminate both Red Hat and the Appro 1224Xi as the root causes of our anomalous results, we ran Windows 2000 Server on and IBM xSeries 235, which is also Xeon-powered. The results were strikingly similar. The super-powered IBM xSeries Server was testing like a 1.5GHz Pentium-III.

 
         
 

If the problem wasn't the Appro server or the Red Hat Linux distribution, we were left with just one very painful conclusion: there was a problem with our benchmarks. The problems fell into two categories: operational and structural.

On the operational side, the oblCPU 2.0 suite had been compiled with GNU C 3.0 on Linux, Visual Studio 6.0 on Windows, and Intel C++ on both Linux and Windows. In all cases, nothing had been implemented in the compilation process to take advantage of HTT-enabled processors. We began to address this by recompiling the benchmarks with GNU C 3.2 and for optimization added the -msse2 to the -funroll-loops and -O3 switches that had previously been implemented. The results for GNU C 3.2 showed only a marginal improvement over the previous 3.0 results.

Still, the results did reveal a structural problem: A small but statistically significant number of kernels were now spending more time moving numbers around in overhead functions than they were doing calculations. Essentially the computational speed of the new processors out-stripped the useful work done in these benchmark kernels.

 
Using GNU C 3.2 to test the Appro 1224X server, the initial results vis vis a dual processor HP NetServer were less than impressive, even after we had taken out the kernels that had structural overhead constraints.
 

     
 

Our next step was to install the latest version of the Intel C compiler, which is now up to version 7.0. In our previous tests with version 6.0, we had measured a 36% performance improvement over GNU C 3.0.4. Unfortunately when it comes to the Xeon processors, 36% better than dreadful is still dismal. So when we recompiled the new oblCPU 2.5 suite with the Intel 7.0 compiler, we changed our switch options from -ip, -xK, and -O3 to ip, -axW, and -O3

For the Intel C++ compiler, the -ip switch introduces interprocedural optimization, which improves performance in programs that frequently call small or medium-sized functions by "inlining" the function's code. This eliminates setting up parameters for a call as well as the branch itself, which was part of the structural problem we found in some of our benchmarks. The -O3 switch provides for loop unrolling similar to the -funroll-loops switch for GNU C and enables data prefetching, which intelligently puts data in cache before the program requires it.

 
         
 

What would bring the big jump in performance, however, was the introduction of the -axW switch. This switch introduces the advanced Streaming-SIMD-Extensions SSE2 for Pentium 4 and Xeon processors.

As expected, the version 7 Intel C++ compiler added 35% to the performance of the HP NetServer over GNU C 3.2. Recall that version 6 of the Intel C compiler added 36% to the performance of that NetServer when compared to oblCPU compiled with GNU C 3.0. The well-anticipated shock came when we ran the Intel 7.0 version of oblCPU: performance on our Appro 1224Xi jumped 84%.

What's more we measured the level of improvement when it came to memory bandwidth. Those results will follow in the next issue of Open when we delve into I/O bandwidth.

Click to download a zip of oblCPU compiled with GNU C v3.2
Click to download a zip of oblCPU statically compiled with Intel C++ 7.0

Only when we compiled our oblCPU benchmark suite with version 7.0 of the Intel C++ compiler did we get the performance level we expected from the 2.4GHz Xeon CPU.