|
SUSE LINUX 10's GNU POWER DIMENSION As Linux moves moves into the role of a core rather than edge technology, server utilization and efficiency no longer remain just the domain of high-power computational computing. To meet the growing demands on processor power, SUSE Linux 10 has fully embraced the new GNU Compiler Collection. |
||||
|
|
|
To do this, Xen is architected as a "para-virtualization" scheme.
The VMM, also dubbed the hypervisor, presents each virtual machine with a hardware abstraction that is
similar, but not identical, to the underlying system architecture. As a result, for Xen to host an
operating system, one of two things must occur: For guest Linux systems there is a specialized Xen Linux micro-kernel. YaST, the SUSE Linux system management foundation, has a module group that simplifies the installation and configuration of virtual hosts running SUSE Linux 10. To run Windows on Xen, it will be necessary to have a system with Intel's VT or AMD's Pacifica processors. The modified Linux micro=kernel does not change the application binary interface (ABI). So in theory, no changes are required to guest applications. In reality things can be a bit more painful. In particular, Linux support for the Native POSIX Threading Library (NPL) requires Thread Local Storage (TLS) and for Xen, this can be a problem. On x86-32 architecture, Xen uses memory segmentation to protect the memory used by the hypervisor in a way that is incompatible with the way the TLS library uses memory segmentation. As a result, it is often necessary to disable the TLS library. We found necessary to run the oblCPU benchmark suite on SUSE Linux 9.3 with Xen 2.6. |
|
Version 3.0 of Xen, which is bundled with SUSE Linux 10.0, makes a number of significant advances over the 2.6 version including support for x86-64 architecture, which encompasses both Intel EMT64 and AMD Opteron processors. In particular support for x86-64 architecture introduces a number of significant differences for Xen in order to support that environment. The most important difference is the lack of memory segment limits, which prohibits segmentation from being used to protect the hypervisor. On x86-64 CPUs, page-level protection is employed in a manner that is particularly suitable to the Opteron architecture. Over the coming weeks, openBench Labs will be doing considerable testing of Xen 3.0 on SMP Opteron systems sporting dual-core CPUs.
The futuristic notion of implicit queries calls for a very powerful search mechanism. As that mechanism, Beagle is equally at home today fetching data in response to explicit queries. |
|
Beagle is architected in a classic client-server manner. On the backend there is the Beagle
daemon, which builds an index for each user in the background using the Apache project's Lucene search engine as a
technology base. Requests come from Beagle client programs, which call on the Beagle search services. The primary
client tool is query tool dubbed "Best," which is an acronym for "bleeding edge search tool". Best can search a
wide range of document formats including source code, MS Office, OpenOffice, Web caches, images, audio files,
email, RSS feeds, and instant messages. Considered a Gnome desktop application, SUSE 10 will not automatically install Beagle with the KDE desktop. Nonetheless, Best and a number of other applications will work nicely on a KDE desktop, once the Beagle daemon is manually launched: You'll need to edit the .profile script to make Beagle automatic. Nonetheless, we did run into a problem with kmail, which Beagle documents as being supported. Beagle was easily able to find email messages, addresses, and calendar data related to a Best query, when that data was stored within the Evolution personal information manager. Data in email messages stored in kmail went undetected by Beagle. |
|
|
This technique of explicitly defining which files and program modules an application can access has long been used to insure the accuracy of batch processes in mainframe-class IT operations. More importantly, such a scheme extends nicely into today's open networking environment as a means to prevent intruders from exploiting a back door that might exist in any application and thereby gain access to business-critical files.
|
|
Gone too is the artifact of implementing language compilers, such as FORTRAN, to first generate code in another high level language, such as C. GCC compilers now generate machine code directly. This should not be confused with the C preprocessor, which is an integral feature of the C, C++, Objective-C and Objective-C++ languages. More importantly, the GCC language-independent optimizers have been extensively overhauled. Michael Matz, GCC Team leader for SUSE Linux at Novell cautions: “Users may see some significant changes in performance for the better; however, changes in the optimizer can also degrade performance in some cases. Every component of SUSE Linux 10.0 distribution has been compiled and checked with GCC 4.0, to provide users with improved responsiveness.” |
|
|
When we loaded SUSE Linux 10.0, there was indeed a perceived difference in system performance. Standard configuration tasks seemed quicker. To verify these qualitative perceptions, we ran our purely quantitative oblCPU benchmark suite. We began by running version 3.0 of the oblCPU benchmark suite. Version 3.0 has 34 computational kernels, which report execution time normalized to that of a 1GHz Pentium 3 processor—a legacy HP Netserver is reserved for that task. For many, that choice of a reference system raises a number of red flags. Nonetheless, normalizing oblCPU results to an older generation P3 CPU has several interesting advantages. First, the P3 remains a popular well-known system as many older P3 systems still remain in use. More importantly, in the jump from P3 to P4—from which the Xeon evolved—Intel designed much deeper pipelines into the P4 processor's architecture. That radical departure puts AMD's families of processors more closely aligned with the P3 today than Intel's. Pipelining is designed to exploit parallelism in instructions. Instructions are executed as they pass through the pipeline in stages. The number of stages defines the depth of the pipeline. All of the stages run in lockstep during a machine cycle. The longest time required to move an instruction through a pipe stage defines the length of time for a machine cycle, which is then applied to every pipe stage. For an individual instruction, the overhead of pipe stages slows execution. For multiple instructions, the effect can be quite the opposite. The trick is to overlap instructions so that a new instruction enters the first stage of the pipeline on every machine cycle. As a result, a completed instruction pops out of the last stage of the pipeline on every machine cycle. If a pipeline has ten stages, a single instruction will still take ten machine cycles to pass through the pipeline. Nonetheless, in ten machine cycles, ten instructions—rather than one—will be completed. More importantly, all of this magic is totally masked from the high-level language programmer. If you think that sounds like alchemy, you're right. The "miracle of the instructions" is predicated on the assumption that the pipeline is kept full. That assumption is a pretty tall order. Certain events, dubbed hazards, can make it impossible for an instruction to execute within the proper pipeline stage on the designated machine cycle. These hazards can occur for numerous reasons, including an unavailable hardware unit, required data not ready for access, or an expected branch in logic. When such hazards occur, the pipeline stalls as machine cycles pass with no processing done. Worse yet, a branch hazard could force the pipeline to be flushed and restarted. As a result, compiled code structure is an essential for exploiting processor pipelines and avoiding costly stalls. |
|
We started our assessment by loading SUSE Linux 10 on our reference HP Netserver with its 1GHz P3 CPU. Next we ran version 3.0 of oblCPU, which had been compiled with GNU C 3.4 and normalized to the CPU time taken with our reference server running SUSE Linux 9.3. On SUSE Linux 10.0, the geometric mean for all of the computational kernels proved to be 10% faster than running on SUSE Linux 9.3. The statistical 95% confidence interval for performance of the 34 kernels ranged between 105 and 120 Next we recompiled oblCPU version 3.0 using GNU C 4.0. This produced an executable version that was 5% smaller in size and 32% faster in execution. We then installed Intel C/C++ v9.0 and recompiled oblCPU v3.0 with similar options. Quite surprisingly, performance lagged measurably behind that of GNU C 4.0. More importantly, this advantage for the GNU C-compiled versions of oblCPU was also reflected in tests of AMD64 CPUs, which also eschew superpipelining. We finished our initial testing by adding compiler options to exploit the P3 architecture. With GNU C, this amounted to adding two options: -mfpmath=sse and -march=pentium3. With both the GNU and Intel compilers, this change garnered about 17% more in CPU performance. What's more, these optimizations similarly improved performance on all of the systems tested and added virtually nothing to the size of the executable image. As a result, we included P3 optimization as a standard compilation option for oblCPU v4.0. |
|
|
The Intel C/C++ package on Linux remains free for sites that are developing software for their own use and not commercial sale. The standard package includes a compiler and debugger for 32-bit systems, a compiler and debugger for 64-bit Xeon CPUs with Extended Memory 64 Technology (EM64T)—AMD Opteron CPUs are treated as Xeon EM64T processors—and an Eclipse GUI for 32-bit CPUs. The installation script detects the type of CPU and installs the appropriate options: On 64-bit CPUs both the 32-bit and the 64-bit compilers will be installed. It's also worth noting that the Intel compiler requires that both GNU C and GNU C++ (a.k.a. g++) are installed for in order to function. Our testing took on several dimensions:
Testing was done across a range of servers. Representing an entry-level SMB file and print server, we used a 2=way HP ProLiant ML350 G3 server with 2.2GHz Xeon CPUs. For an enterprise class server suitable for large database-driven applications, we chose a 4-way HP ProLiant DL580 G3 server with 3.3GHz Xeon EM64T processors. Finally, for a high performance computing (HPC) server suitable for a clustered configuration, we choose a 4-way Appro 1142H with 2.2GHz Opteron 848 CPUs. |
![]() openBench Labs Both the 32-bit and the 64-bit versions of oblCPU v4.0 are available for download at the openBench Labs site, Over the coming weeks, we will be adding more of our benchmarks to that repository. |
Testing was done across a range of servers. Representing an entry-level SMB file and print server, we used a 2=way HP ProLiant ML350 G3 server with 2.2GHz Xeon CPUs. For an enterprise class server suitable for large database-driven applications, we chose a 4-way HP ProLiant DL580 G3 server with 3.3GHz Xeon EM64T processors. Finally, for a high performance computing (HPC) server suitable for a clustered configuration, we choose a 4-way Appro 1142H with 2.2GHz Opteron 848 CPUs. |
|
A key element of this testing was to generate a new version of oblCPU normalized to GNU 4.0 performance. We also wanted this new version to compile as a 64-bit AMD64 Opteron and Intel Xeon EM64T. In previous attempts with earlier compilers, a significant number of kernels generated segmentation faults. This time with GNU 4.0, all of the kernels once again compiled and ran as 32-bit applications, however two kernels related to LU matrix decomposition ran dramatically slower on both Opteron and Xeon EM64T systems. Removing those two kernels from the suite was all that was needed to compile and run 64-bit GNU and Intel versions of oblCPU. |
|
|
Next we generated 32-bit executables of oblCPU v4.0 using GNU C 4.0 and Intel C/C++ v9.0, Using GNU C, we generated 4 executables: These were optimized for P3, P4, EM64T, and AMD64 (K8) architectures. Using the Intel compiler, we were limited to creating three versions—one for each of the Intel architectures. The results of these benchmarks painted a remarkable picture for anyone trying to maximize Linux performance. With the GNU C compiler, our substitution of processor-specific architecture optimizations for P3 architecture optimizations changed the results of all of the kernels; however, while these changes were also reflected in the 95% confidence interval for kernel performance, the geometric means showed virtually no difference in performance. This phenomenon was also measured in the performance of our benchmark suite when compiled with Intel C/C++ v9.0 and run on the Opteron CPU. Interestingly, Opteron performance scaled linearly with clock speed vis à vis P3 performance using both GNU C 4.0 and Intel C/C++ v9.0. Given the performance of oblCPU 4.0 on a P3 after being compiled with GNU C (100) and Intel C/C++ (94), the relative performance of oblCPU on the 2.2 GHz Opteron compiled with GNU C (220) and Intel C/C++ (199) was reasonably in line. Performance scaling on the Intel Xeon family of 32- and 64-bit CPUs was a study in contrasts. With both compilers, the 64-bit Xeon EM64T scaled linearly with clock speed when compared to the 32-bit Xeon. Nonetheless, similarities in performance characteristics ended with that comparison. Starting with our base P3 optimization scheme, performance of oblCPU using the Intel C/C++ compiler provided roughly a 13% advantage in CPU performance compared to using the GNU C compiler. More importantly, when we substituted P4/Xeon optimization for P3 optimization using the Intel C/C++ compiler, we increased benchmark performance by more than another 13% . With that compile-time substitution, the Intel C/C++ version of oblCPU had a 30% performance edge over the standard GNU C-compiled version of oblCPU. Since the Xeon EM64T scaled with clock speed vis à vis the Xeon with both compilers, our benchmark sported the same 13% and 30% advantages using Intel C/C++ on the Xeon EM64T CPU. The cost of these dramatic improvements came in the form of significantly larger—on the order of 50%—binary executables. Having run all of our 32-bit versions of oblCPU across all systems, we turned our attention to compiling oblCPU on both of our 64-bit servers. It is most important to note that we made no changes to the code that we had used in our 32-bit versions of the benchmark. As will be the case at most sites, our initial foray into 64-bit computing was a simple recompilation effort. In our first 64-bit computing tests, we were looking for any transparent boost that would be a result of the underlying CPU architecture. Given all of the hoopla surrounding 64-bit computing, we expected to measure something measurably significant. Intel's marketing for Xeon EM64T processors touts larger caches, greater memory access, and 64-bit registers as having the potential to push performance by 30-to-40%. In our case, potential faded into stark similarity. The most distinguishing performance characteristic of a 64-bit version of oblCPU was its inability to run on a 32-bit system. While performance measurements showed remarkable uniformity in the 32-bit and 64-bit versions of oblCPU, there was one anomaly worth highlighting. The 64-bit version of oblCPU generated with the Intel C/C++ v9.0 compiler exploited the Xeon EM64T architecture just as well as the 32-bit version. Nonetheless, the size of the 64-bit binary executable generated by the Intel C/C++ compiler was no larger than the executables generated by GNU C. The bottom line for the IT manager looking to maximize Linux
performance is that all roads lead to Rome, but the roads are decidedly one way. Currently the fastest Opteron and
Xeon processors are clocked at 2.8GHz and 3.8GHz respectively. Based on the performance of oblCPU, both of those
processors should deliver about 2.75 times the processing power of a 1GHz P3, assuming programs are compiled with
GNU C or Intel C/C++ respectively. Use the GNU C compiler on that 3.8 GHz Xeon and program performance will drop to
about 2.3 times that of a 1GHz P3. Run a program compiled with Intel C/C++ on that 2.8GHz Opteron and performance
will drop to about 2.3 times that of a 1GHz P3. |