SUSE LINUX 10's GNU
POWER DIMENSION

As Linux moves moves into the role of a core rather than edge technology, server utilization and efficiency no longer remain just the domain of high-power computational computing. To meet the growing demands on processor power, SUSE Linux 10 has fully embraced the new GNU Compiler Collection.

   
 


by Jack Fegreus

January 9, 2006

   
     
 
SUSE Linux has developed a considerable following in the open source community in part because of its extensive integrated set of the latest applications in a stable well-documented distribution. Unlike SUSE Linux Enterprise Serve, which has a guaranteed 5-year life cycle for support, SUSE Linux is released every six months with an extensive set of emerging-technology applications for both server and desktop environments. Packed with bleeding-edge technology, SUSE Linux is therefore not intended for deployment in a production environment. What SUSE Linux does provide developers and corporate IT is a technical preview of what Linux production environments will likely become in the near future.

A prime example of a key emerging server technology introduced in SUSE Linux is the Xen virtual machine monitor (VMM). First packaged with SUSE Linux 9.3, Xen was developed by by the Systems Research Group at the University of Cambridge Computer Laborator, with a strong contribution from IBM. Using the Xen virtual machine monitor (VMM), multiple independent Linux systems can run on the same machine with close-to-native performance as they share hardware resources.

 
         
 
OPENBENCH LABS SCENARIO

 
UNDER EXAMINATION:  GNU/ Linux with GCC version 4.0

 WHAT WE TESTED:

SUSE Linux 10.0

GNU Compiler Collection version 4.0
Xen 3 virtualization
Beagle search tool
AppArmor security tools for application firewalls
Mono
Firefox 1.0
OpenOffice.org 2.0
VoIP
 

 HOW WE TESTED:

Appro 1142H Opteron Server
(4) AMD64 2.2GHz 848 Series Opteron CPUs
NUMA architecture
8GB  DDR400 ECC registered RAM
2GB memory per CPU


HP ProlLiant DL580 G3 Server
(4) 3.3GHz Xeon EM64T CPUs
SMP architecture
8GB PC2-3200.DDR-2 ECC registered RAM



HP ProLiant ML350 G3 server
(2) 2.2GHz Intel Xeon CPUs
SMP architecture
2GB PC2100 DDR ECC registered RAM




Intel C++ Compiler for Linux v9.0
Free license for non-commercial development
Advanced optimization supporting EM64T architecture (v9.0.030)
Eclipse-based IDE (32-bit systems)


Benchmarks:

oblCPU v4.0


 KEY FINDINGS:

Our oblCPU v3.0 benchmark suite ran 10% faster on SUSE Linux 10 without having to recompile the code.

After recompiling oblCPU 3.0, performance of the benchmark suite ran 32% faster than it had on SUSE Linux 9.3 compiled with GNU C 3.4.
On both Intel P3 and AMD Opteron processors, oblCPU 4.0 ran about 10% faster when compiled with GNU C 4.0 as compared to Intel C/C++ v 9.0 as compared to GNU C 4.0.
Updates to  Linux for NUMA architecture improve CPU performance on the order of 10-15%.
With both the GNU C and Intel C/C++ compilers Opteron performance scaled with clock speed vis à vis P3 processors.
With both the GNU C and Intel C/C++ compilers Xeon EM64T performance scaled with clock speed vis à vis 32-bit Xeon processors.
With processor-specific optimizations, performance of oblCPU 4.0 on Xeon and Xeon EM64T CPUs was 30% faster when compiled with Intel C/C++.
With both the GNU C and Intel C/C++ compilers 64-bit versions of oblCPU v4.0 performed no faster than their respective 32-bit versions.
 

To do this, Xen is architected as a "para-virtualization" scheme. The VMM,  also dubbed the hypervisor, presents each virtual machine with a hardware abstraction that is similar, but not identical, to the underlying system architecture. As a result, for Xen to host an operating system, one of two things must occur:

The OS must be modified.
The CPU must support virtualization.

For guest Linux systems there is a specialized Xen Linux micro-kernel. YaST, the SUSE Linux system management foundation, has a module group that simplifies the installation and configuration of virtual hosts running SUSE Linux 10. To run Windows on Xen, it will be necessary to have a system with Intel's VT or AMD's Pacifica processors.

The modified Linux micro=kernel does not change the application binary interface (ABI). So in theory, no changes are required to guest applications. In reality things can be a bit more painful. In particular, Linux support for the Native POSIX Threading Library (NPL) requires Thread Local Storage (TLS) and for Xen, this can be a problem.

On x86-32 architecture, Xen uses memory segmentation to protect the memory used by the hypervisor in a way that is incompatible with the way the TLS library uses memory segmentation. As a result, it is often necessary to disable the TLS library. We found necessary to run the oblCPU benchmark suite on SUSE Linux 9.3 with Xen 2.6.

 
     
 

Version 3.0 of Xen, which is bundled with SUSE Linux 10.0, makes a number of significant advances over the 2.6 version including support for x86-64 architecture, which encompasses both Intel EMT64 and AMD Opteron processors. In particular support for x86-64 architecture introduces a number of significant differences for Xen in order to support that environment.

The most important difference is the lack of memory segment limits, which prohibits segmentation from being used to protect the hypervisor. On x86-64 CPUs, page-level protection is employed in a manner that is particularly suitable to the Opteron architecture. Over the coming weeks, openBench Labs will be doing considerable testing of Xen 3.0 on SMP Opteron systems sporting dual-core CPUs.

For those who have grown long in the tooth while waiting for Microsoft's Longhorn OS, the most innovative new desktop technology is clearly Beagle.  which provides users with a powerful desktop search utility. Beagle is actually a spinout from a visionary project called "dashboard," which is designed to setup implicit query processing while users work. The idea is that as a user works on a project or receives a message during the day, the user can be automatically presented with everything on the computer related to whatever is the focus of their attention.

The futuristic notion of implicit queries calls for a very powerful search mechanism. As that mechanism, Beagle is equally at home today fetching data in response to explicit queries.

 
         
  Beagle is architected in a classic client-server manner. On the backend there is the Beagle daemon, which builds an index for each user in the background using the Apache project's Lucene search engine as a technology base. Requests come from Beagle client programs, which call on the Beagle search services. The primary client tool is query tool dubbed "Best," which is an acronym for "bleeding edge search tool". Best can search a wide range of document formats including source code, MS Office, OpenOffice, Web caches, images, audio files, email, RSS feeds, and instant messages.

Considered a Gnome desktop application, SUSE 10 will not automatically install Beagle with the KDE desktop. Nonetheless, Best and a number of other applications will work nicely on a KDE desktop, once the Beagle daemon is manually launched: You'll need to edit the .profile script to make Beagle automatic. Nonetheless, we did run into a problem with kmail, which Beagle documents as being supported. Beagle was easily able to find email messages, addresses, and calendar data related to a Best query, when that data was stored within the Evolution personal information manager. Data in email messages stored in kmail went undetected by Beagle. 

 
With background indexing, Beagle searches are fast and exhaustive. Using Best, queries can be targeted at specific classes of data, such as email or recent web pages, or default to all data sources. The client-server architecture of Beagle permits the embedding of Beagle client capabilities in applications such as the Epiphany Web browser (mouse over). By clicking on the browser's Find button, users can search a Web page for content.
 
     
 

For IT operations, SUSE Linux 10 contains a light-weight version of Novell's AppArmor package. Integrated as a YaST module, AppArmor is a mainframe-class security package that essentially provides systems administrators with a way to create application-specific firewalls. This is accomplished through the scripting of application profiles that specify the files that the program may read, write, and execute.

This technique of explicitly defining which files and program modules an application can access has long been used to insure the accuracy of batch processes in mainframe-class IT operations. More importantly, such a scheme extends nicely into today's open networking environment as a means to prevent intruders from exploiting a back door that might exist in any application and thereby gain access to business-critical files.

The most important new technology in SUSE 10, however, is the inclusion of version 4.0 of GCC, which now officially designates the GNU Compiler Collection. Historically, the moniker "GCC" stood explicitly just for the “GNU C Compiler.” That use of the GCC acronym commonly continues when discussing C language compilation. Nonetheless, the acronym now properly designates the complete integrated distribution—from language front ends, to language-independent optimizers, through to support libraries—of GNU programming languages, including C, C++, Objective-C, Objective-C++, Java, Fortran, and Ada.

 
         
 

Gone too is the artifact of implementing language compilers, such as FORTRAN, to first generate code in another high level language, such as C. GCC compilers now generate machine code directly. This should not be confused with the C preprocessor, which is an integral feature of the C, C++, Objective-C and Objective-C++ languages.

More importantly, the GCC language-independent optimizers have been extensively overhauled. Michael Matz, GCC Team leader for SUSE Linux at Novell cautions: “Users may see some significant changes in performance for the better; however, changes in the optimizer can also degrade performance in some cases. Every component of SUSE Linux 10.0 distribution has been compiled and checked with GCC 4.0, to provide users with improved responsiveness.”

 
Open Reader Survey
Does your site develop in C/C++? Yes No
No Answer
Does your site develop in Java? Yes No
No Answer
Does your site develop in FORTRAN Yes No
No Answer
What' is your site's primary development language? C/C++Java FORTRANNo Answer
Click for
Current Tally
 
     
 

When we loaded SUSE Linux 10.0, there was indeed a perceived difference in system performance. Standard configuration tasks seemed quicker. To verify these qualitative perceptions, we ran our purely quantitative oblCPU benchmark suite.

We began by running version 3.0 of the oblCPU benchmark suite. Version 3.0 has 34 computational kernels, which report execution time normalized to that of a 1GHz Pentium 3 processor—a legacy HP Netserver is reserved for that task. For many, that choice of a reference system raises a number of red flags. Nonetheless, normalizing oblCPU results to an older generation P3 CPU has several interesting advantages.

First, the P3 remains a popular well-known system as many older P3 systems still remain in use. More importantly, in the jump from P3 to P4—from which the Xeon evolved—Intel designed much deeper pipelines into the P4 processor's architecture. That radical departure puts AMD's families of processors more closely aligned with the P3 today than Intel's.

Pipelining is designed to exploit parallelism in instructions. Instructions are executed as they pass through the pipeline in stages. The number of stages defines the depth of the pipeline. All of the stages run in lockstep during a machine cycle. The longest time required to move an instruction through a pipe stage defines the length of time for a machine cycle, which is then applied to every pipe stage.

For an individual instruction, the overhead of pipe stages slows execution. For multiple instructions, the effect can be quite the opposite. The trick is to overlap instructions so that a new instruction enters the first stage of the pipeline on every machine cycle. As a result, a completed instruction pops out of the last stage of the pipeline on every machine cycle.

If a pipeline has ten stages, a single instruction will still take ten machine cycles to pass through the pipeline. Nonetheless, in ten machine cycles, ten instructions—rather than one—will be completed. More importantly, all of this magic is totally masked from the high-level language programmer. If you think that sounds like alchemy, you're right.

The "miracle of the instructions" is predicated on the assumption that the pipeline is kept full. That assumption is a pretty tall order. Certain events, dubbed hazards, can make it impossible for an instruction to execute within the proper pipeline stage on the designated machine cycle. These hazards can occur for numerous reasons, including an unavailable hardware unit, required data not ready for access, or an expected branch in logic. When such hazards occur, the pipeline stalls as machine cycles pass with no processing done. Worse yet, a branch hazard could force the pipeline to be flushed and restarted. As a result, compiled code structure is an essential for exploiting processor pipelines and avoiding costly stalls.

 
         
 

 We started our assessment by loading SUSE Linux 10 on our reference HP Netserver with its 1GHz P3 CPU. Next we ran version 3.0 of oblCPU, which had been compiled with GNU C 3.4 and normalized to the CPU time taken with our reference server running SUSE Linux 9.3. On SUSE Linux 10.0, the geometric mean for all of the computational kernels proved to be 10% faster than running on SUSE Linux 9.3. The statistical 95% confidence interval for performance of the 34 kernels ranged between 105 and 120

Next we recompiled oblCPU version 3.0 using GNU C 4.0. This produced an executable version that was 5% smaller in size and 32% faster in execution. We then installed Intel C/C++ v9.0 and recompiled oblCPU v3.0 with similar options. Quite surprisingly, performance lagged measurably behind that of GNU C 4.0. More importantly, this advantage for the GNU C-compiled versions of oblCPU was also reflected in tests of AMD64 CPUs, which also eschew superpipelining.

We finished our initial testing by adding compiler options to exploit the P3 architecture. With GNU C, this amounted to adding two options: -mfpmath=sse and -march=pentium3. With both the GNU and Intel compilers, this change garnered about 17% more in CPU performance. What's more, these optimizations similarly improved performance on all of the systems tested and added virtually nothing to the size of the executable image. As a result, we included P3 optimization as a standard compilation option for oblCPU v4.0.

 
Without recompiling, we ran our oblCPU v3.0 benchmark on SUSE 10 Linux and measured a 10% boost in performance. Recompiling under GNU C 4.0 increased performance by 32%. Consistently on the P3, performance of Intel-compiled versions of oblCPU measurably trailed GNU C versions.
 
 

The Intel C/C++ package on Linux remains free for sites that are developing software for their own use and not commercial sale. The standard package includes a compiler and debugger for 32-bit systems, a compiler and debugger for 64-bit Xeon CPUs with Extended Memory 64 Technology (EM64T)—AMD Opteron CPUs are treated as Xeon EM64T processors—and an Eclipse GUI for 32-bit CPUs. The installation script detects the type of CPU and installs the appropriate options: On 64-bit CPUs both the 32-bit and the 64-bit compilers will be installed. It's also worth noting that the Intel compiler requires that both GNU C and GNU C++ (a.k.a. g++) are installed for in order to function.

Our testing took on several dimensions:

Compiled code performance: GNU vs. Intel.
Code optimization: P3 base optimization vs. CPU-specific optimization.
CPU architecture: Intel Pentium 3, Xeon, Xeon EM64T. and AMD64 Opteron
32-bit vs. 64-bit compiler performance.

Testing was done across a range of servers. Representing an entry-level SMB file and print server, we used a 2=way HP ProLiant ML350 G3 server with  2.2GHz Xeon CPUs. For an enterprise class server suitable for large database-driven applications, we chose a 4-way HP ProLiant DL580 G3 server with 3.3GHz Xeon EM64T processors. Finally, for a high performance computing (HPC) server suitable for a clustered configuration, we choose a 4-way Appro 1142H with 2.2GHz Opteron 848 CPUs.

 
             

 
openBench Labs

Both the 32-bit and the 64-bit versions of oblCPU v4.0 are available for download at the openBench Labs site, Over the coming weeks, we will be adding more of our benchmarks to that repository.
    Testing was done across a range of servers. Representing an entry-level SMB file and print server, we used a 2=way HP ProLiant ML350 G3 server with  2.2GHz Xeon CPUs. For an enterprise class server suitable for large database-driven applications, we chose a 4-way HP ProLiant DL580 G3 server with 3.3GHz Xeon EM64T processors. Finally, for a high performance computing (HPC) server suitable for a clustered configuration, we choose a 4-way Appro 1142H with 2.2GHz Opteron 848 CPUs.  
 

A key element of this testing was to generate a new version of oblCPU normalized to GNU 4.0 performance. We also wanted this new version to compile as a 64-bit AMD64 Opteron and Intel Xeon EM64T. In previous attempts with earlier compilers, a significant number of kernels generated segmentation faults. This time with GNU 4.0, all of the kernels once again compiled and ran as 32-bit applications, however two kernels related to LU matrix decomposition ran dramatically slower on both Opteron and Xeon EM64T systems. Removing those two kernels from the suite was all that was needed to compile and run 64-bit GNU and Intel versions of oblCPU.

 
     
 
We created a new version of oblCPU, which we normalized to the performance of a 1GHz P3 processor running SUSE 10. We then used this benchmark to compare CPU architectures and compilers. The bottom lined turned out to be superpipelining: Intel employs this technique of boosting processor power in its P4 and Xeon families. Superpipelining is transparent to high-level language programmers, but is highly dependent on compilers to generate compatible code. It is here that the Intel compiler shines with a 30% advantage. Nonetheless, on P3 and AMD64 Opteron architectures, the GNU C compiler holds a distinct 10% edge over the Intel compiler.
 
     
 

Next we generated 32-bit executables of oblCPU v4.0 using GNU C 4.0 and Intel C/C++ v9.0, Using GNU C, we generated 4 executables: These were optimized for P3, P4, EM64T, and AMD64 (K8) architectures. Using the Intel compiler, we were limited to creating three versions—one for each of the Intel architectures. The results of these benchmarks painted a remarkable picture for anyone trying to maximize Linux performance.

With the GNU C compiler, our substitution of processor-specific architecture optimizations for P3 architecture optimizations changed the results of all of the kernels; however, while these changes were also reflected in the 95% confidence interval for kernel performance, the geometric means showed virtually no difference in performance. This phenomenon was also measured in the performance of our benchmark suite when compiled with Intel C/C++ v9.0 and run on the Opteron CPU.

Interestingly, Opteron performance scaled linearly with clock speed vis à vis P3 performance using both GNU C 4.0 and Intel C/C++ v9.0. Given the performance of oblCPU 4.0 on a P3 after being compiled with GNU C (100) and Intel C/C++ (94), the relative performance of oblCPU on the 2.2 GHz Opteron compiled with GNU C (220) and Intel C/C++ (199) was reasonably in line.

Performance scaling on the Intel Xeon family of 32- and 64-bit CPUs was a study in contrasts. With both compilers, the 64-bit Xeon EM64T scaled linearly with clock speed when compared to the 32-bit Xeon. Nonetheless, similarities in performance characteristics ended with that comparison. Starting with our base P3 optimization scheme, performance of oblCPU using the Intel C/C++ compiler provided roughly a 13% advantage in CPU performance compared to using the GNU C compiler.

More importantly, when we substituted P4/Xeon optimization for P3 optimization using the Intel C/C++ compiler, we increased benchmark performance by more than another 13% . With that compile-time substitution, the Intel C/C++ version of oblCPU had a 30% performance edge over the standard GNU C-compiled  version of oblCPU. Since the Xeon EM64T scaled with clock speed vis à vis the Xeon with both compilers, our benchmark sported the same 13% and 30% advantages using Intel C/C++ on the Xeon EM64T CPU. The cost of these dramatic improvements came in the form of significantly larger—on the order of 50%—binary executables.

Having run all of our 32-bit versions of oblCPU across all systems, we turned our attention to compiling oblCPU on both of our 64-bit servers. It is most important to note that we made no changes to the code that we had used in our 32-bit versions of the benchmark. As will be the case at most sites, our initial foray into 64-bit computing was a simple recompilation effort. In our first 64-bit computing tests, we were looking for any transparent boost that would be a result of the underlying CPU architecture.

Given all of the hoopla surrounding 64-bit computing, we expected to measure something measurably significant. Intel's marketing for Xeon EM64T processors touts larger caches, greater memory access, and 64-bit registers as having the potential to push performance by 30-to-40%. In our case, potential faded into stark similarity. The most distinguishing performance characteristic of a 64-bit version of oblCPU was its inability to run on a 32-bit system.

While performance measurements showed remarkable uniformity in the 32-bit and 64-bit versions of oblCPU, there was one anomaly worth highlighting. The 64-bit version of oblCPU generated with the Intel C/C++ v9.0 compiler exploited the Xeon EM64T architecture just as well as the 32-bit version. Nonetheless, the size of the 64-bit binary executable generated by the Intel C/C++ compiler was no larger than the executables generated by GNU C.

The bottom line for the IT manager looking to maximize Linux performance is that all roads lead to Rome, but the roads are decidedly one way. Currently the fastest Opteron and Xeon processors are clocked at 2.8GHz and 3.8GHz respectively. Based on the performance of oblCPU, both of those processors should deliver about 2.75 times the processing power of a 1GHz P3, assuming programs are compiled with GNU C or Intel C/C++ respectively. Use the GNU C compiler on that 3.8 GHz Xeon and program performance will drop to about 2.3 times that of a 1GHz P3. Run a program compiled with Intel C/C++ on that 2.8GHz Opteron and performance will drop to about 2.3 times that of a 1GHz P3.