|
|
HIGH CMANSHIP
REDUX: SERVER I/O |
![]() |
||
|
HP’s 1U Tualatin-powered Netserver rocks the rack. |
||||
by
Jack Fegreus and
Keith Walls |
|
First we tested Intel’s C++ compiler by recompiling our CPU benchmark suite and running the code on a typical high-end corporate laptop system: an HP Omnibook 6000. The results were impressive enough for OpenBench Labs to make the Intel-compiled version of our benchmarks our new standard. Now we've run the benchmarks on HP's newest generation of multiprocessor servers optimized for rack density, the 1U Netserver LP 1000r. That said, we didn't expect to encounter any major performance surprises. We did. For the corporate survivors of the dot-gone debacle, the constraints on IT infrastructure are tougher than ever before. User sophistication has grown almost as quickly as the complexity of the technology. Thanks to the explosive growth in cable modem and DSL usage in the SOHO arena, service expectations for Internet servers have become just as demanding as the expectations for an intranet server on a corporate LAN. A server can never be too fast, but it must never be ‘not there.’ |
|
|
||||||
|
To meet these demands, server farms featuring racks of super thin servers with very phat processing capabilities flourish throughout IT. To these server platforms, additional hardware and software features such as RAID, load balancing and high-availability clustering is often added. Within this context, OpenBench Labs began an examination of the HP Netserver LP 1000r running under RedHat Linux v7.2. Our phat Netserver came equipped with dual Pentium III processors clocked at 1.26 GHz, 512MB of PC133 SDRAM, two embedded 100-Mbit per second NICs, an HP NetRAID-2M controller, and 3 hot-swappable disk bays. These hot CPUs are the new Tualatin processors which sport a .13-micron design core in contrast to the .18-micron Coppermine design. Tualatin CPUs feature a 512KB Level 2 cache on the chip and consume less power than the earlier Coppermine P-IIIs. Nonetheless, popping open the tool-less chassis housing the Netserver LP 1000r easily confirms that everything possible has been done to maximize airflow around these chips. We started our tests by running the OBLcpu benchmark compiled with GNU C v2.95 and Intel C++ v5.05. CPU performance with the GNU-compiled benchmark suite was right in line with our expectations for previous P-III systems. We anticipated that performance would be about 8-to10% less than 4 times the performance of a 300MHz P-III system running under Windows 2000. Posting a geometric mean of 369, the 1.26GHz Tualatin CPU clocked in right on target. |
|
|
||||||
|
Perhaps this explains the thunderous yawn that Tualatin has received in the processor arena. The Tualatin processor has been all but lost in the lavish attention paid to the hyperactive Pentium-4 CPU. Despite all the brouhaha, however, current Pentium-4 CPUs are not enabled for multiprocessor configurations. This leaves Tualatin with its capacious 512KB L2 cache as the de facto low-cost, high-performance, alternative to the Pentium Xeon in dual-processor SMP server configurations such as our HP Netserver LP 1000r. As a result, we were hoping that we might see an even higher performance increase than normal with the Intel-compiled version of OBLcpu. Such an increase would be an indication that the Intel C++ compiler was able to generate an executable that could exploit the Netserver's dual CPUs via medium-grained auto-parallelization of the source code. The results, however, were quite consistent with all of the results that we have seen in our tests of single processor systems. In particular, the Intel-compiled version of the OBLcpu suite pegged CPU performance at 5.29 times that of our reference platform, which is a 300MHz P-III system running Windows 2000. The first significant difference in performance exposed by the new benchmark executables on the Netserver came in our memory bandwidth tests. Running the OBLmemband benchmark as compiled under version 2.95 of GNU C, the results were very much in line with all of the other top-line multiprocessor PC servers that we have tested running under both Linux and Windows 2000. |
|
|
For strides of 4 bytes, throughput with a single thread peaked at around 370MB per second. Typical performance of Pentium-III systems with PC133 SDRAM and a 256KB L2 cache is in the range of 250-to-300MB per second for such small strides. Also typical of dual processor SMP systems built using Pentium-III CPUs is a doubling of performance for small 4 byte strides with two threads. On the HP Netserver LP 1000r, the GNU-compiled version of OBLmemband pegged peak throughput at 670MB per second with two active threads with each thread having a stride of 4 bytes. At 16-byte strides, memory throughput for two threads converged on the performance level of a single thread. This behavior is typical of Pentium-III systems under both Linux with GNU-C and Windows 2000 with MS Visual C++. When we ran OBLmemband as compiled with the Intel C++ compiler, memory bandwidth increased for all strides. In particular the profile of throughput performance more closely resembled that of an AMD workstation with PC2100 DDRAM, which is the AMD variant of double clocked memory on a 266MHz front side bus. For small 4-byte strides, throughput with one thread soared to 724MB per second as compared to 370MB per second. With two threads, there was only a modest increase to 736MB per second. Moreover, even when we were paging of of the disk with 512-byte strides, the Intel-compiled benchmark held a slight 10% performance edge with throughput at 30MB per second. |
|
|||||
|
Analyzing the results of OBLload compiled with both Intel C++ and GNU C compilers produced an unanticipated result: The Intel compiled benchmark code demonstrated a distinct performance edge in TP-oriented I/O performance. Examining the total number of I/Os processed versus the number of processes issuing the I/O requests, we find that with more than 40 process threads, the number of I/Os completed with the Intel-compiled code is about 15% higher. The Intel-compiled code appears to be utilizing the multiprocessor architecture of the HP Netserver LP 1000r more efficiently. This observation is corroborated when the benchmark results are examined in terms of the average I/O response time versus the number of I/O operations being processed. With a response time of 35ms, performance of the GNU-compiled code hit a wall of 1400 I/Os per second. As the response time increased beyond 35ms, the number of I/O operations that could be completed did not increase. For a 35ms response time, the I/O completion count for the Intel-compiled benchmark was also 1400 I/O operations per second. Nonetheless, as response time lengthened, the Intel-compiled benchmark continued to complete a growing number of I/O operations. The peak I/O load grew to just under 1600 I/O operations per second before response time exceeded 100ms. In both cases it should be noted, these performance levels are far in excess of the demands of all but the very largest TP operations. These performance results with the benchmarks compiled with GNU C would in themselves peg the HP Netserver LP 1000r as a superior system. The combination of the Netserver architecture and the Intel-compiled benchmarks raise that performance level to nothing short of stellar. There is still one more element that HP has added to the Netserver LP 1000r: TopTools agents for Linux. |
|
The TopTools agents for Linux rely on SNMP to provide remote management capabilities to a Windows system running the TopTools management console. The management console is not yet available for Linux servers or workstations. TopTools can also be integrated into other network management frameworks such as Tivoli, HP's OpenView and CA's Unicenter. In this first version,
the TopTools agents on Linux are not yet as comprehensive as the agents on
Windows, which provide more robust information about the total state of
the client through WEBM. Currently, system status information for a
Netserver running Linux is limited to just the status of the networking
subsystem. Nonetheless, the networking details that include uptime
statistics and live NIC traffic volume provide all of the basic essentials
needed to monitor if not manage this server. |
|
|
On our Omnibook test platform running Windows XP
Professional, the performance speedup versus MS Visual C++ 6.0 was not quite
as dramatic as when running Linux. Here the improvement was on the order of
35%. This was still enough to keep the performance of the Intel-compiled
OBLcpu benchmark on Windows XP Pro marginally faster—about 10%—than the
performance of the OBLcpu when compiled under Intel and run on SuSE 7.3
Linux. That becomes even more interesting when Microsoft Visual Studio.NET is brought into the equation. While still a beta product, tests of the new complier has so far proven it to be about 25% faster than the current MS Visual C++. That puts the final performance numbers for OBLcpu compiled under Intel on Linux and OBLcpu compiled under Visual Studio.NET on Windows 2000 dead even. Furthermore, while it is not a reasonable comparison until the new Microsoft Windows .NET servers are shipping, and Visual Studio .NET is in production, the installation of Intel compiler is far faster and easier than the installation of Visual Studio .NET. To round out our single processor CPU tests of the
Intel compilers, we turned our final attention on a system running with an
AMD Athlon CPU. In all previous tests with both the GNU C and MS Visual C++
compilers, AMD Athlon CPUs have consistently performed about 20% faster than
a comparable clocked Intel Pentium III CPU—CPU processing power per MHz.
When we ran the Intel-compiled version of OBLcpu on the Athlon-powered
system the percent improvement was virtually identical to the results run on
the Omnibook. |
|||
|
|
CLICK for a comparison of I/O performance using ReiserFS, JFS and ext3FS on RAID. | ||