NOTE: The benchmarks have been updated for Intel C version 7 (see link at bottom)
HIGH CMANSHIP
   
 

Intel's C++ v5 Compilers for Linux and Windows smoke GNU C and MS Visual C++ in number crunching benchmarks.

   
  by Keith Walls and Jack Fegreus    
     
 

Very few technologies excite the dyed-in-the-wool techie as much as a new compiler. There’s something very heady about the feeling that your code is driving the hardware to new heights. Add to that the thought of all that technological power behind the deceptively simple “build” button or “make” command. Whatever the motivation, there is no doubt that technology thrill seekers will marvel at the new Intel C/C++ compiler. And for serious old-line number crunching applications, Intel has a Fortran compiler which employs the same optimization techniques.

On a more prosaic plane, Linux ISVs whose software products are computationally intense, such as 3D graphics rendering, will garner an enormous efficiency boost in performance on both single and multiprocessor systems. Performance improvements of the order of 30% or more are naturally of a high order of importance. When they come with a simple recompile they can be a godsend.

 
       
 
OPENBENCH LABS SCENARIO
UNDER EXAMINATION
Intel C++ for Linux, Windows
http://developer.intel.com/software/products/compilers/

HOW WE TESTED

BUILD PLATFORM
733 MHz Intel Pentium III
SuSE Linux version 7.3 Professional
Microsoft Windows 2000 Professional, SP2

TEST PLATFORM
HP Omnibook 6000
700 MHz Intel Mobile Pentium III
http://www.hp.com
SuSE Linux 7.3
http://www.suse.com
Windows XP Pro
www.microsoft.com
OBLcpu benchmark

KEY FINDINGS
 On Linux, numerically intense CPU performance improved by 47% compared to GNU C v2.95
 On Windows, numerically intense CPU performance improved by 37% compared to MS Visual Studio.
 CPU performance for Intel on Linux and MS Visual Studio.NET on Windows in a dead heat.
 AMD Athon-based systems benefited equally with Pentium III-based systems.

 

For a long time, OpenBench Labs has measured a substantial difference in CPU performance between Linux and Windows 2000 systems using our CPU benchmark suite of 34 kernels. While there was a decided improvement with the introduction of the Linux 2.4 kernel, there was never any doubt that the Microsoft Visual Studio and the GNU C compilers where generating substantially different code. That fact alone muddied any performance comparison.

When we first examined Windows XP and SuSE 7.3 on the HP Omnibook 6000 using the OBLcpu benchmark compiled with gcc v2.95 and MS Visual C++ v6.0, the results pegged CPU performance under Linux about 18% less than CPU performance under Windows XP. The factor we could not distinguish adequately was the degree to which the compiler rather than the underlying operating system was the culprit in the laggard performance of Linux.

Open Reader Survey
Does your site develop in C/C++? Yes No No Answer
Does your site develop in Java? Yes No No Answer
Does your site develop in FORTRAN Yes No No Answer
What is the dominant language at your site? C/C++
Java
FORTRAN
No Answer
Click for
Current Tally
 
         
 

The new Intel compiler suite answers a lot of our questions. For the highly technical Open Source development community it does raise a modest issue: This compiler itself is not Open Source. This compiler is the fruit of a great deal of work conducted at the Intel Microcomputer Software Labs, along with two prestigious companies recently acquired by Intel: Kuch and Associates. Nonetheless, the magnitude of the performance differential in numerically intense applications is such that only the most dramatic sort of improvement in the long-awaited version 3 of the GNU C/C++ compiler—a "beta" of v3 is included with the SuSE 7.3 distribution, but gcc v2.95 is installed by default—will stay the hammer that drives a stake through the fibrillating heart of the aging technology behind the GNU C compiler. May it rest in peace.

Target specific object code and executable file format notwithstanding, what really makes the Intel compiler so very interesting is the fact that it delivers the same instruction sequences on both Linux and Windows platforms. The differences in performance across the two operating system platforms are thus precisely that: differences in the operating systems. That includes, to some degree, the file formats, as well as the image activation, scheduling and interrupt interference of the individual operating systems. As a result, the new Intel compiler affords a unique opportunity to discover the baseline performance of the raw systems hardware with superbly optimized code, and to discover the relative performance differences between Linux and Windows in terms of the performance they can deliver from the same hardware.

 
In our initial tests of the OBLcpu benchmark on Windows XP and SuSE 7.3, there was much less variance in the performance of the version compiled with MS Visual C++ when compared to the version compiled with GNU C. This is reflected in the narrow margin of error when looking at the 95% statistical confidence region about the geometric mean. In addition, the confidence intervals above and below the Windows XP geometric mean are virtually symmetrical.
 

In essence, the Intel compiler attempts to maximally exploit the 128-bit Streaming Single-Instruction-Multiple-Data (SIMD) extensions on packed integers and floating point numbers, which enable fine-grained code parallelization, found in the Pentium III and Pentium 4 CPUs. These instructions were introduced on Pentium III CPUs to provide floating point operations on 4 single-precision floating-point numbers and access to the 64-bit integer technology of MMX. This is extended in the he Pentium 4 with support for floating-point operations on two double-precision floating-point numbers and 128-bit integer technology in MMX.

To do this, the Intel compilers use a number of esoteric optimization technologies, such as alignment optimizations and advanced instruction selection in order to vectorize instruction loops that perform a single operation on multiple elements in a data set. What’s more, the majority of scientific, engineering, and multimedia applications easily take advantage of these streaming SIMD extensions. These programs are characterized by a control flow that is data independent, regular and re-occurring memory access patterns, and localized reoccurring operations performed on the data.

The Intel compilers magically do all of this work to exploit the implicit parallelism of any application’s source code in the background as they analyze the original source code, restructure the code, and finally translate the restructured code into SIMD instructions. In the program analysis phase, the compiler performs a series of progressively more complex tests to route out all data dependence problems. The restructuring phase then focuses on converting the input program through traditional techniques such as idiom recognition, loop interchange and loop distribution into a form that is more amenable to the Pentium III and Pentium 4 CPUs.

While these restructuring techniques have been around for some time, parallelizing a loop can result in slower execution if the overhead of dispatching threads, scheduling those threads, and sharing resources is significant compared to the total workload performed by the loop. As a result, the most important job of the Intel C++/Fortran compiler is to examine all the operations in the loop body and estimate the grain-size per-loop iteration on the targeted microarchitecture to estimate of the total workload of the loop and determine if the loop should be parallelized.

The alternative to all of this compiler wizardry is to explicitly exploit an application’s parallelism at the source code level, which is a cumbersome and error-prone task. More importantly, it is a very expensive task in that it greatly complicates program development and maintenance. Manually rolling your own parallel code requires the use of inline assembly to generate SIMD instructions, considerable understanding of thread libraries, and a considerable understanding of the application on a macro scale.

Once the compiler is installed, a juggle then ensues getting the script variables pointed at the correct directory for the binaries and the license data files. At this point, we encountered a bug in the Linux initialization script that was shipped with the compiler kit. Intel’s professional support identified the bug quickly and shipped a replacement script file in short order.

In order to have the compiler available at the command line for a user, one must manually edit the login scripts to define the script variables, and include the path to the compiler binaries. With that accomplished, there is almost parity between the icc—the Intel compiler is generally invoked with the ‘icc’ command—and the gcc compiler at the command line. A quick edit of the makefile yields the recompiled code after a ’clean’ of the application binary target directory and a new ‘make’.

At this point, it’s also important to note that the Intel compiler on both platforms is a little more pedantic than the default settings for either the GNU or the Microsoft compilers. Marginal error conditions that are dismissed by the other compilers are reported as warning level issues by the Intel compiler.

Having compiled and tested all four executables on our development system, we next distributed the four executables to our primary test bed: an HP Omnibook 6000 running Windows XP Professional, SuSE Linux 7.3. As with the installation of the compilers, program distribution is a bit more complex on Linux. Along with the Linux executables, you’ll need to distribute the Intel run-time library, which you are completely free to do under the Intel license. So before running the Intel-compiled benchmarks, we had to drop two library modules in the /lib directory.

Once the library modules were copied onto the target system, we were ready to run our OBLcpu benchmark. The performance results were beyond being unquestionably superior: they were staggering. On SuSE 7.3, the geometric mean performance for our 34 kernels soared by 47%. In particular, 30 kernels were dramatically faster when compiled with the Intel C/C++ compiler; 3 kernels were marginally slower; and only 1 of the 34 kernels was significantly slower with the Intel compiler. As we noted, the real trick to exploiting implicit parallelism in highly-serial source code is to avoid slowing programs down with complex parallel instruction streams that simply add execution overhead.

With the Intel C++ compilers on both Windows and Linux, performance jumped very dramatically. Performance variance among kernels on either Linux or Windows, much more closely reflects performance differences among kernel when compiled under GNU C, especially at the high end. The differential between the geometric mean of Linux and Windows performance also narrows to within 10%.

On our Omnibook test platform running Windows XP Professional, the performance speedup versus MS Visual C++ 6.0 was not quite as dramatic as when running Linux. Here the improvement was on the order of 35%. This was still enough to keep the performance of the Intel-compiled OBLcpu benchmark on Windows XP Pro marginally faster—about 10%—than the performance of the OBLcpu when compiled under Intel and run on SuSE 7.3 Linux.

That becomes even more interesting when Microsoft Visual Studio.NET is brought into the equation. While still a beta product, tests of the new complier has so far proven it to be about 25% faster than the current MS Visual C++. That puts the final performance numbers for OBLcpu compiled under Intel on Linux and OBLcpu compiled under Visual Studio.NET on Windows 2000 dead even. Furthermore, while it is not a reasonable comparison until the new Microsoft Windows .NET servers are shipping, and Visual Studio .NET is in production, the installation of Intel compiler is far faster and easier than the installation of Visual Studio .NET.

   
  Within the OBLcpu benchmark suite, the performance of 30 out of 34 kernels improved using the Intel C++ compiler. The level of improvement varied dramatically with many kernels clocking in 100% faster. Overall the geometric mean improvement as 47%. Only 4 kernels exhibited any signs of performance degradation and of those only 1, gamsim, showed a significant decline in performance.  
     
 

To round out our single processor CPU tests of the Intel compilers, we turned our final attention on a system running with an AMD Athlon CPU. In all previous tests with both the GNU C and MS Visual C++ compilers, AMD Athlon CPUs have consistently performed about 20% faster than a comparable clocked Intel Pentium III CPU—CPU processing power per MHz. When we ran the Intel-compiled version of OBLcpu on the Athlon-powered system the percent improvement was virtually identical to the results run on the Omnibook.

 
         
   
Linux on Xeon CPUs: Get the most out of thread-level parallelism via Hyper-Threading

If you are serious about performance, Intel Xeon CPUs and PCI-X peripheral slots are de rigueur for servers, but not sufficient! Linux kernel support for Xeon CPUs began with the 2.4.17 kernel, but openBench Labs tests of a dual Xeon server reveal that running applications without a little help from Intel may prove more than a little disappointing. CLICK to read the review and download the benchmarks.

 
         
   
Tux Takes the .NET: SuSE Linux Enterprise Server vs Windows 2003 Server

With an IBM xSeries 235 eServer at center court, openBench Labs takes a first look at Windows 2003 Server EE and SuSE Linux Enterprise Server at the great software-development divide: .NET Framework 1.1 with Visual Studio .NET 2003.
CLICK to read the review.