|
DUAL DUO HYPER-THREADING If you are serious about performance, Intel Xeon CPUs and PCI-X peripheral slots are de rigueur for servers, but not sufficient! |
||||
![]() by Jack Fegreus March 27,2003 |
|
The E7501 chipset is tailored for dual-processor configurations using what Intel dubs "NetBurst microarchitecture." At the heart of this scheme are three component interconnects called a Hub Interface 2.0 (HI2.0). Each HI2.0 provides 1.066 GB/s I/O peak bandwidth for PCI/PCI-X bridges. That adds up to 3.2 GB/s of peak bandwidth for high-speed I/O. That could put a serious crimp in memory resource. To avoid any memory bottleneck, the E7501 chipset features a 533 MHz front side bus capable of delivering 4.27 GB/s of data from DDR266 DIMMs. That latter specification is very important given the fact that memory latency has become a major bottleneck for achieving high performance for various applications. To this end we have seen a number of new hardware technologies introduced to speed memory access. On the other hand, a number of multithreading techniques have also emerged to hide memory latency. One of the most promising ways to enhance the performance of applications is to exploit thread-level parallelism. This is just what Intel has been doing in its Xeon CPU chips and has now added to the latest generation of Pentium 4 CPUs. To make this work, there are two critical pieces of technology that are absolutely necessary. The first piece is a bit of chip-level legerdemain that Intel calls "Hyper-Threading Technology" (HTT). This new architecture permits a single CPU chip to execute data instructions from different threads in parallel. As a result, a single hyper-threading processor is able to manage instructions as if it were two independent processors. To do this, the all of new Xeon (server) and Pentium 4 (workstation) processors sport a characteristic 533MHz front side bus and have a rich set of performance-enabling features that include distinctively large register caches and two sets of Streaming-SIMD-Extensions: SSE and SSE2. In a nutshell, vectorization detects patterns of sequential data accesses by the same instruction, and transforms that code for Single Instruction Multiple Data (SIMD) execution. As a result, the SSE2 instructions and large registers are the keys to enabling multiple functional units to operate simultaneously on packed data elements, which are used to represent short vectors. The Intel compilers use a number of esoteric optimization technologies, including alignment optimizations to vectorize instruction loops that perform a single operation on multiple elements. |
|
While these restructuring techniques have been around for some time, parallelizing a loop can result in slower execution if the overhead of dispatching threads, scheduling those threads, and sharing resources are significant compared to the total workload performed by the loop. As a result, the most important job of the Intel compiler is to examine all the operations in the loop body and estimate the grain-size per-loop iteration on the targeted CPU microarchitecture to estimate the total workload of the loop and determine if the loop should be parallelized. It is this process, dubbed "intra-register vectorization" by Intel, that enables these new HTT CPUs to effectively exploit multi-level parallelism. |
|
As to the details of how and why this all works, we leave that to a year-long seminar in compiler architecture. Nonetheless, we have introduced the all-important C-word: compilers. Hyper-Threading Technology-enabled processors significantly increase the performance of application programs if, and only if, those programs exhibit a high degree of parallelism. That means all of the potential performance gains can only be obtained when an application is efficiently multithreaded. And that leaves the choice to do it manually by a very smart code wizard or automatically by the deus ex machina of parallelization: a very smart compiler. |
|
This is being very actively worked on for the Linux 2.5 kernel. Application performance on HTT-enabled CPUs could like experience boosts on the order of 50%. In the mean time, in-house-developed applications can benefit greatly from a simple recompile with the Intel 7.0 Linux C++ compiler, which now available for free under a non-commercial development license. Our first step was to run version 2.0 of oblCPU, which normalizes all results to a 600MHz Pentium-III (value of 100) and was compiled with gcc v3.0. The benchmark, which correctly pegs a 1.26MHz P-III in an HP NetServer as 2.06 times as powerful as our 600MHz CPU, accessed our 2.4 GHz Xeon processor in the Appro 1224Xi to be only about 2.5 times faster than a 600MHz P-III. We simply could not believe the results. Something was radically wrong. To eliminate both Red Hat and the Appro 1224Xi as the root causes of our anomalous results, we ran Windows 2000 Server on and IBM xSeries 235, which is also Xeon-powered. The results were strikingly similar. The super-powered IBM xSeries Server was testing like a 1.5GHz Pentium-III. |
|
If the problem wasn't the Appro server or the Red Hat Linux distribution, we were left with just one very painful conclusion: there was a problem with our benchmarks. The problems fell into two categories: operational and structural. On the operational side, the oblCPU 2.0 suite had been compiled with GNU C 3.0 on Linux, Visual Studio 6.0 on Windows, and Intel C++ on both Linux and Windows. In all cases, nothing had been implemented in the compilation process to take advantage of HTT-enabled processors. We began to address this by recompiling the benchmarks with GNU C 3.2 and for optimization added the -msse2 to the -funroll-loops and -O3 switches that had previously been implemented. The results for GNU C 3.2 showed only a marginal improvement over the previous 3.0 results. Still, the results did reveal a structural problem: A small but statistically significant number of kernels were now spending more time moving numbers around in overhead functions than they were doing calculations. Essentially the computational speed of the new processors out-stripped the useful work done in these benchmark kernels. |
|
|
Our next step was to install the latest version of the Intel C compiler, which is now up to version 7.0. In our previous tests with version 6.0, we had measured a 36% performance improvement over GNU C 3.0.4. Unfortunately when it comes to the Xeon processors, 36% better than dreadful is still dismal. So when we recompiled the new oblCPU 2.5 suite with the Intel 7.0 compiler, we changed our switch options from -ip, -xK, and -O3 to ip, -axW, and -O3 For the Intel C++ compiler, the -ip switch introduces interprocedural optimization, which improves performance in programs that frequently call small or medium-sized functions by "inlining" the function's code. This eliminates setting up parameters for a call as well as the branch itself, which was part of the structural problem we found in some of our benchmarks. The -O3 switch provides for loop unrolling similar to the -funroll-loops switch for GNU C and enables data prefetching, which intelligently puts data in cache before the program requires it. |
|
What would bring the big jump in performance, however, was the introduction of the -axW switch. This switch introduces the advanced Streaming-SIMD-Extensions SSE2 for Pentium 4 and Xeon processors. As expected, the version 7 Intel C++ compiler added 35% to the performance of the HP NetServer over GNU C 3.2. Recall that version 6 of the Intel C compiler added 36% to the performance of that NetServer when compared to oblCPU compiled with GNU C 3.0. The well-anticipated shock came when we ran the Intel 7.0 version of oblCPU: performance on our Appro 1224Xi jumped 84%. What's more we measured the level of improvement when it came to
memory bandwidth. Those results will follow in the next issue of Open when we delve into I/O bandwidth.
|
|