CING PARALLEL
WITH NO PAIN

V6 of Intel's C/C++ Compiler crushes MS .NET and KAPs it off by automatically generating parallel code.

   
  by Jack Fegreus and Keith Walls        
     
 

This is the year when the pundits proclaim Java will surpass C as the most popular programming language. Perhaps. but in the meantime, Intel is making the performance issues between C and Java even more difficult to sidestep. With the release of version 6.0 of the Intel C/C++ compilers for Linux and Windows, C and C++ on Linux become a number-crunching forces which cannot be ignored. For IT managers still insisting that hefty license fees are a mark of software quality, the day of reckoning has arrived.

Back in January OpenBench Labs tested version 5.0 of Intel's C/C++ compilers for Linux and Windows. The results were nothing short of spectacular. On SuSE 7.3, the geometric mean performance for our 34 kernels soared by 47% over the version compiled with gcc 2.95. In particular, 30 kernels were dramatically faster when compiled with the Intel C/C++ compiler; 3 kernels were marginally slower; and only 1 of the 34 kernels was significantly slower with the Intel compiler. For ISVs whose software products are computationally intense, such as 3D graphics rendering, a performance improvement of the order of 30% or more is naturally of a high order of importance. When it comes with a simple recompile, then it's a godsend.

 
         
 

For a long time, OpenBench Labs had measured a substantial difference in CPU performance between Linux and Windows 2000 systems using our CPU benchmark suite of 34 kernels. When we first examined Windows XP and SuSE 7.3 on the HP Omnibook 6000 using the OBLcpu benchmark compiled with gcc v2.95 and MS Visual C++ v6.0, the results pegged CPU performance under Linux about 18% less than CPU performance under Windows XP. Nonetheless, the performance differential introduced with the Intel C/C++ compiler was enough to vault Linux C performance ahead of MS Visual Studio V6 and into a dead heat with the then beta release of Visual Studio .Net on Windows 2000. But that was so four months ago.

 
Open Reader Survey
Does your site develop in C/C++? Yes No No Answer
Does your site develop in Java? Yes No No Answer
Does your site develop in FORTRAN Yes No No Answer
What is the dominant language at your site? C/C++
Java
FORTRAN
No Answer
Click for
Current Tally
 
       
 
OPENBENCH LABS SCENARIO
UNDER EXAMINATION
Intel C++ for Linux,
 http://developer.intel.com/software/products/compilers/

REFERENCE PLATFORM
HP Vectra Workstation
600MHz Intel Pentium III
SuSE Linux version 8.0 Professional
http://www.suse.com
GNOME 1,4 desktop
gcc 3.0.4
Microsoft Windows 2000 Professional, SP2
Microsoft Visual Studio .Net

SERVER PLATFORM
HP NetServer LP 1000r
Dual 1.26 GHz Intel Pentium III (Tualatin)

http://www.hp.com
SuSE Linux version 8.0 Professional
Windows 2000 Server
www.microsoft.com
OBLcpu benchmark v2.0

KEY FINDINGS
 On Linux, numerically intense CPU performance improved by 62% compared to GNU C v3.0.4
 CPU performance for Intel C/C++ on SuSE Linux exceeded the performance of MS Visual Studio .NET on Windows 2000 by 19%
 Using auto-parallelization, we effortlessly generated multithreaded code for our SMP server

 

In the intervening months, the Intel compilers have begun to gather a lot of serious interest among developers. Microsoft has launched its .Net development tools. And the version 3 GNU C compiler has moved much closer to the mainstream, albeit SuSE version 8 still considers version 3.0.4 as "experimental."

With Intel's release of version 6.0 of their compilers this month, OpenBench Labs deemed it worth revisiting the performance issues among the three major C/C++ compilers: GNU, Intel and Microsoft.

Given the initial performance lead Intel demonstrated with v5, the goal for v6 was complete compatibility with GNU C. A perfect 10 for Intel was to enable the building of the Linux kernel with the Intel compiler. It looks like they scored about 9.2. Using some unpublished workarounds, Intel has been able to internally build a stable kernel. The rest of us will have to wait for the next release.

Nonetheless, v6 of the Intel C compiler has added substantial compatibility features with the GNU C compiler. These features include support for GNU inline assembly support on IA-32, GNU C language extensions, binary compatibility with GNU  C object files, and the use of the glibc library. Like the GNU C compiler, the Intel compiler supports the evolving C++ ABI. When fully implemented by all compiler vendors, this ABI should allow C++ objects files and libraries to be compatible across different compilers.

 
     

On the MS Windows side, v6 of the Intel C++ compiler is source and object compatible with Microsoft Visual C++ and plugs into the Microsoft Visual Studio IDE. For Microsoft Visual Studio .Net, the 32-bit Intel C/C++ compiler has partial integration. Projects can be moved from Visual Studio 6.0 into Visual Studio .Net and then built with the Intel compilers.

Our first step was to revise our CPU benchmark suite. Thanks to improved code analysis on the part of the Intel compiler, we had to drop one of the kernels, which was correctly analyzed by the Intel compiler and optimized out of existence. For the remaining 33 benchmark kernels, we needed to increase their execution time to stabilize the results for faster CPUs and more aggressive compiler options. To this end, we made our reference system a  600MHz Pentium III-based HP Vectra Workstation running SuSE v8.0, the GNOME v1.4 desktop environment, and gcc v3.0.4.

We designate this workstation as having a Linux Performance Index (LPI) of 100 and compare the actual runtimes of the kernels on any test system to the runtimes on our reference system. Previously the reference system had been a 300MHz Pentium III-based system running Windows 2000. In switching from Windows to Linux as our reference platform, the most important thing to note is the designation of the GNOME desktop environment for the reference system. In testing with both the GNU and Intel compilers, we consistently measured a 7.5% improvement in performance when running the benchmarks under GNOME v1.4 as opposed to KDE v3.0.

Since our goal to optimize performance of the widest range of new PC-based server systems, we chose to allow both the GNU and Intel compilers to optimize code for Pentium-III processors, but not for Pentium-4. For our reference gcc platform, this meant invoking the -funroll-loops and -O3 switches. On the Intel compiler, we set the -ip, -xK, and -O3 switches as our default. We also tested the Program Guided Optimization (PGO) environment and the -parallel switch, which introduces the Kuck and Associates parallel processing (KAP) technology to the Intel compiler.

         
 

For the Intel C/C++ compiler the -O3 automatic optimization switch provides for loop unrolling similar to the -funroll-loops switch for gcc. Loop unrolling reduces the number of iterations in a loop by replicating the code within a loop. While creating a larger executable image, loop replication significantly reduces overhead with large numbers of iterations.

As a simple example, consider the initialization of an array of 1,000 numbers to a constant value. For a developer, the simple thing to do is repeat the assignment 1,000 times. Better for the computer is to assign the constant value to two numbers in the array for 500 times. Using just the -O3 switch on Intel gave us the exact same performance on Linux as Visual Studio .Net provided on Windows 2000 Server: a 36% performance improvement over gcc 3.0.4.

The next big jump in performance came from data prefetching, which is activated by the -O3 switch and vectorization, which for P-III processors is turned on by the -xK switch. Data prefetching reduces memory latency and improves application performance by intelligently putting data in cache before the program requires it. These instructions overlap memory accesses with other computations.

  Vectorization detects patterns of sequential data accesses by the same instruction, and transforms the code for Single Instruction Multiple Data (SIMD) execution. The Intel compiler attempts to maximally exploit the 128-bit SIMD extensions on packed integers and floating point numbers, which enable fine-grained code parallelization, found in the Pentium III and Pentium 4 CPUs.

 
In our initial tests of the OBLcpu v2 benchmark on Win2000 Server and SuSE 8.0, performance  from our standard gcc 3.0.4 code demonstrated the most variability on the high-end of performance. It is interesting to note that while it solidly surpassed the performance of code compiled with Visual Studio .Net, the essential characteristics of the Intel compiled code was the same as the Visual Studio .Net code. This is reflected in the larger margin of error above the geometric mean for the 95% statistical confidence region. Note the symmetry in the performance of code compiled with gcc 2.95 in comparison to the gcc 3.0.4 standard.

 

To do this, the Intel compilers use a number of esoteric optimization technologies, such as alignment optimizations and advanced instruction selection in order to vectorize instruction loops that perform a single operation on multiple elements in a data set. The Intel compiler performs a series of progressively more complex tests to route out all data dependence problems during the analysis phase of compilation. It then restructures the code and translates the restructured code into SIMD instructions.

While these restructuring techniques have been around for some time, parallelizing a loop can result in slower execution if the overhead of dispatching threads, scheduling those threads, and sharing resources is significant compared to the total workload performed by the loop. As a result, the most important job of the Intel compiler is to examine all the operations in the loop body and estimate the grain-size per-loop iteration on the targeted CPU microarchitecture to estimate the total workload of the loop and determine if the loop should be parallelized.

The final step, which added about 7% more performance and brought our 600MHz workstation to a 162 LPI, was to introduce interprocedural optimization via the -ip switch. Interprocedural optimization improves performance in programs that frequently call small or medium-sized functions. It can be especially beneficial when those functions are called within loops. The idea is to reduce call overhead by "inlining" the functions code—i.e., directly adding the code at the location of the call and eliminating the call. This eliminates setting up parameters for a call as well as the branch itself.

Often such interprocedural optimization becomes quite complex, especially with large programs with a plethora of logic branches. Nonetheless, it is often the case that many of the logic branches are seldom if ever taken. Needless to say this discussion has little connection with benchmarks which are written to deliberately always execute in the exact same manner. For programs that are highly dependent on data to define how the logic executes, Intel provides a Profile-Guided Optimization (PGO). The PGO process involves two compilation steps separated by a learning period during which the program is executed several times to produce runtime profiles.

After generating a number of profiles the program is recompiled with the profiles used to give hints to the compiler about how to optimize the code. It came as no surprise to any of us when our invariant runtime profiles simply replicated the results of our manually directed optimization commands.

In our final set of tests, we moved our benchmark program onto an HP NetServer LP 1000r. This server has dual 1.26 GHz Tualatin P-III processors. These hot CPUs sport a .13-micron design core in contrast to the .18-micron Coppermine design and feature a 512KB Level 2 cache on the chip. Given the Tualatin's larger cache along with the compiler's microarchitecture-specific switches for exploiting parallelization and prefetching (-xK) at the chip level, we expected to see a performance level slightly greater than what the 2X that clock speed alone would project. We were not disappointed.

Running our maximally optimized single-threaded benchmark code on the HP NetServer 1000, CPU performance was pegged at 347 LPI. In our build, we now added the -parallel switch, which implements the first pieces of KAP parallel processing technology automagically. Without having to rewrite the code and manually enter OpenMP directives or insert pragma statements manually, the compiler was able to detect which loops could benefit from multi-threaded execution, and generated the appropriate threading calls automatically.

As a result, 4 kernels were identified by the compiler as good targets for multi-threading. These kernels typically ran upwards of 2-to-3 times faster with the multi-threaded code executing in parallel on both CPUs. By just introducing multi-threaded code in 4 of the kernels, the geometric mean of overall performance increased 12% to 386 LPI. In addition the spread of the 95% confidence interval expanded greatly due to these four outlier points.

Now for the bad news for those IT managers who don't feel comfortable without spending prodigious amounts of money. If you are not writing software for commercial sales and you can live without premium support, which has a guaranteed response time, the Intel C/C++ compiler can be downloaded for free.  Unfortunately, on Windows there is one catch: You'll need a valid version of MS Visual C++ for Intel C/C++ to install

With the auto-parallelization feature turned on, the Intel compiler generated multi-threaded code for our dual processor server which doubled the performance of these kernels and boosted overall performance by 12%.

         
   
Intel C v7 and Linux on Xeon CPUs: Get the most out of thread-level parallelism via Hyper-Threading

If you are serious about performance, Intel Xeon CPUs and PCI-X peripheral slots are de rigueur for servers, but not sufficient! Linux kernel support for Xeon CPUs began with the 2.4.17 kernel, but openBench Labs tests of a dual Xeon server reveal that running applications without a little help from Intel may prove more than a little disappointing. CLICK to read the review and access the benchmarks.

 
     
   
Tux Takes the .NET: SuSE Linux Enterprise Server vs Windows 2003 Server

With an IBM xSeries 235 eServer at center court, openBench Labs takes a first look at Windows 2003 Server EE and SuSE Linux Enterprise Server at the great software-development divide: .NET Framework 1.1 with Visual Studio .NET 2003.
CLICK to read the review.