HIGH CMANSHIP REDUX:
SERVER I/O
   
 

HP’s 1U Tualatin-powered Netserver rocks the rack.

   
  by  Jack Fegreus and Keith Walls      
     
 

First we tested Intel’s C++ compiler by recompiling our CPU benchmark suite and running the code on a typical high-end corporate laptop system: an HP Omnibook 6000. The results were impressive enough for OpenBench Labs to make the Intel-compiled version of our benchmarks our new standard. Now we've run the benchmarks on HP's newest generation of multiprocessor servers optimized for rack density, the 1U Netserver LP 1000r. That said, we didn't expect to encounter any major performance surprises.  We did.

For the corporate survivors of the dot-gone debacle, the constraints on IT infrastructure are tougher than ever before. User sophistication has grown almost as quickly as the complexity of the technology. Thanks to the explosive growth  in cable modem and DSL usage in the SOHO arena, service expectations for Internet servers have become just as demanding as the expectations for an intranet server on a corporate LAN. A server can never be too fast, but it must never be ‘not there.’

 
       
 
OPENBENCH LABS SCENARIO
UNDER EXAMINATION
HP Netserver LP 1000r
HP NetRAID-2M

http://hp.com
Intel C++ for Linux, Windows http://developer.intel.com/software/products/compilers/

HOW WE TESTED
RedHat Linux 7.2
http://www.redhat.com
Cremax ICY Dock MB018
http://www.cremax.com
(4) Seagate Cheetah Ultra160 SCSI drives
http://www.seagate.com
OBLcpu benchmark
OBLmemband
OBLload

KEY FINDINGS
 Using the Intel-compiled benchmark, memory throughput on the HP Netserver closely matched that of an AMD system with double clocked DDRAM.
 Using the HP NetRAID controller the system was able to sustain 1600 I/Os per second and maintain a response time of less than 100ms.
 HP TopTools agents for Linux provide basic network performance data; however, analysis must be carried out on a Windows system.

 

To meet these demands, server farms featuring racks of super thin servers with very phat processing capabilities flourish throughout IT. To these server platforms, additional hardware and software features such as RAID, load balancing and high-availability clustering is often added. Within this context, OpenBench Labs began an examination of the HP Netserver LP 1000r running under RedHat Linux v7.2.

Our phat Netserver came equipped with dual Pentium III processors clocked at 1.26 GHz, 512MB of PC133 SDRAM, two embedded 100-Mbit per second NICs, an HP NetRAID-2M controller, and 3 hot-swappable disk bays. These hot CPUs are the new Tualatin processors which sport a .13-micron design core in contrast to the .18-micron Coppermine design. Tualatin CPUs feature a 512KB Level 2 cache on the chip and consume less power than the earlier Coppermine P-IIIs. Nonetheless, popping open the tool-less chassis housing the Netserver LP 1000r easily confirms that everything possible has been done to maximize airflow around these chips.

We started our tests by running the OBLcpu benchmark compiled with GNU C v2.95 and Intel C++ v5.05. CPU performance with the GNU-compiled benchmark suite was right in line with our expectations for previous P-III systems. We anticipated that performance would be about 8-to10% less than 4 times the performance of a 300MHz P-III system running under Windows 2000. Posting a geometric mean of 369, the 1.26GHz Tualatin CPU clocked in right on target.

 
       
 

Perhaps this explains the thunderous yawn that Tualatin has received in the processor arena. The Tualatin processor has been all but lost in the lavish attention paid to the hyperactive Pentium-4 CPU.

Despite all the brouhaha, however, current Pentium-4 CPUs are not enabled for multiprocessor configurations. This leaves Tualatin with its capacious 512KB L2 cache as the de facto low-cost, high-performance, alternative to the Pentium Xeon in dual-processor SMP server configurations such as our HP Netserver LP 1000r.

As a result, we were hoping that we might see an even higher performance increase than normal with the Intel-compiled version of OBLcpu. Such an increase would be an indication that the Intel C++ compiler was able to generate an executable that could exploit the Netserver's dual CPUs via medium-grained auto-parallelization of the source code. The results, however, were quite consistent with all of the results that we have seen in our tests of single processor systems. In particular, the Intel-compiled version of the OBLcpu suite pegged CPU performance at 5.29 times that of our reference platform, which is a 300MHz P-III system running Windows 2000.

The first significant difference in performance exposed by the new benchmark executables on the Netserver  came in our memory bandwidth tests. Running the OBLmemband benchmark as compiled under version 2.95 of GNU C, the results were very much in line with all of the other top-line multiprocessor PC servers that we have tested running under both Linux and Windows 2000.

 
The GNU and Intel versions of our CPU benchmark produced results entirely consistent with our initial tests of OBLcpu on a Pentium-III laptop. There was no indication on any performance improvements derived from exploiting the SMP architecture of our SMP Netserver.
 

For strides of 4 bytes, throughput with a single thread peaked at around 370MB per second. Typical performance of Pentium-III systems with PC133 SDRAM and a 256KB L2 cache is in the range of 250-to-300MB per second for such small strides. Also typical of dual processor SMP systems built using Pentium-III CPUs is a doubling of performance for small 4 byte strides with two threads.

On the HP Netserver LP 1000r, the GNU-compiled version of OBLmemband pegged peak throughput at 670MB per second with two active threads with each thread having a stride of 4 bytes. At 16-byte strides, memory throughput for two threads converged on the performance level of a single thread. This behavior is typical of Pentium-III systems under both Linux with GNU-C and Windows 2000 with MS Visual C++.

When we ran OBLmemband as compiled with the Intel C++ compiler, memory bandwidth increased for all strides. In particular the profile of throughput performance more closely resembled that of an AMD workstation with PC2100 DDRAM, which is the AMD variant of double clocked memory on a 266MHz front side bus.

For small 4-byte strides, throughput with one thread soared to 724MB per second as compared to 370MB per second. With two threads, there was only a modest increase to 736MB per second. Moreover, even when we were paging of of the disk with 512-byte strides, the Intel-compiled benchmark held a slight 10% performance edge with throughput at 30MB per second.

The Intel-compiled memory benchmark behaved differently from other tests on Linux and Windows 2000. Throughput for both 1 and 2 threads did not significantly vary on small (4-to-8 byte) strides. On the Tualatin-based Netserver, throughput closely resembled that of an AMD Athlon-based workstation with double-clocked DDRAM.

         
 
SHAKE RATTLE AND ROLL
Keeping heads on track in fast high-density drives involves a lot more than just the drive.
To house the 4 Ultra160 SCSI drives, we used a Cremax ICY Dock which provides for 4 1-inch hot-swappable drives with 80-pin SCA connectors in a form factor that occupies 3 standard 5.25-inch drive bays. For most end users, disk enclosures may sound like a trivial issue, but disk drive power consumption and heat dissipation are just the tip of the iceberg for environmental requirements. High-performance drives are also very susceptible to vibration, which can come from other drives or components in the system.

With the explosion in storage capacity has come a dramatic increase in the number of data tracks per inch. This in turn lowers the margin of error for keeping the read/write heads on track. For large active RAID arrays, recoverable and non-recoverable error rates can dramatically increase as more drives come on line. Such tracking errors can easily prevent an array from performing at its fullest in a high-volume transaction processing environment. This makes using a proven well-designed enclosure with solid vibration dampening and heat dissipation characteristics essential.

OpenBench Labs configured its test drive array in an ICY Dock, which is designed for use with Ultra160 SCSI drives. The enclosure is all aluminum to improve heat dissipation. In addition, the dock is available with an optional door that has a very effective 92 x 92mm ball bearing cooling fan. There are audio and visual alarms for both fan failure and excessive heat. The temperature alarm can be set for 55°, 65°, or 75°F. More importantly, the ICY Dock uses low-vibration slide rails to mount the drives.

We configured RAID level 5 arrays using both 10K and 15K rpm Seagate Cheetah drives in the Cremax enclosure. We encountered no thermal or vibration resonance problems. Array performance scaled exceptionally well and throughput measurements demonstrated the synergistic effects of a well-designed track cache.

 

We also had the opportunity to test the optional HP NetRAID-2M controller in the Netserver LP 1000r. The NetRAID-2M controller was configured with a 64-MB cache, which had its own on-board battery backup. The NetRAID controller occupied the server’s one optional 64-bit PCI slot. We attached that controller to 4 external Ultra160 SCSI drives from Seagate, which were housed in an ICY Dock from Cremax and configured as a logical RAID5 volume.

When we ran our OBLload benchmark, which stresses the I/O subsystem in a transaction processing simulation, we encountered another intriguing set of results with implications for high-end server I/O. This benchmark simulates a database transaction processing (TP) environment in which a portion of the I/Os are concentrated in a ‘hot spot’ that represents the index tables and the remaining portion of the I/O is randomly distributed across the database. I/O sizes are normally distributed around a mean size of 8KB.

During the test, background thread processes are dispatched that issue their own unique data requests in the same manner. This continues until the average access time exceeds 100 milliseconds. The results of OBLload can be then be evaluated in terms of total I/Os per second versus the total number I/O-issuing processes or in terms of the response time versus the number of I/Os per second being processed.

 
OpenBench Labs detected measurable performance differences in the Intel and GNU compiled versions of its I/O loading benchmark. The benchmark runs multiple parallel threads which simulate user requests for data in a database environment with varying transaction rates. The ability of the Intel version to complete more I/O requests as more I/O process threads are loaded is related to improved memory caching and thread load balancing. Mouse over the chart to change the analysis and view response time versus the number of I/O operations completed.
 
     
 

Analyzing the results of OBLload compiled with both Intel C++ and GNU C compilers produced an unanticipated result: The Intel compiled benchmark code demonstrated a distinct performance edge in TP-oriented I/O performance. Examining the total number of I/Os processed versus the number of processes issuing the I/O requests, we find that with more than 40 process threads, the number of I/Os completed with the Intel-compiled code is about 15% higher. The Intel-compiled code appears to be utilizing the multiprocessor architecture of the HP Netserver LP 1000r more efficiently.

This observation is corroborated when the benchmark results are examined in terms of the average I/O response time versus the number of I/O operations being processed. With a response time of 35ms, performance of the GNU-compiled code hit a wall of 1400 I/Os per second. As the response time increased beyond 35ms, the number of I/O operations that could be completed did not increase.

For a 35ms response time, the I/O completion count for the Intel-compiled benchmark was also 1400 I/O operations per second. Nonetheless, as response time lengthened, the Intel-compiled benchmark continued to complete a growing number of I/O operations. The peak I/O load grew to just under 1600 I/O operations per second before response time exceeded 100ms. In both cases it should be noted, these performance levels are far in excess of the demands of all but the very largest TP operations.

These performance results with the benchmarks compiled with GNU C would in themselves peg the HP Netserver LP 1000r as a superior system. The combination of the Netserver architecture and the Intel-compiled benchmarks raise that performance level to nothing short of stellar. There is still one more element that HP has added to the Netserver LP 1000r: TopTools agents for Linux.

 
         
 

The TopTools agents for Linux rely on SNMP to provide remote management capabilities to a Windows system running the TopTools management console. The management console is not yet available for Linux servers or workstations. TopTools can also be integrated into other network management frameworks such as Tivoli, HP's OpenView and CA's Unicenter.

In this first version, the TopTools agents on Linux are not yet as comprehensive as the agents on Windows, which provide more robust information about the total state of the client through WEBM. Currently, system status information for a Netserver running Linux is limited to just the status of the networking subsystem. Nonetheless, the networking details that include uptime statistics and live NIC traffic volume provide all of the basic essentials needed to monitor if not manage this server.

 
Within the HP TopTools Management console, drilling down to node Netzilla, our Netserver LP 1000r test server, brings up the same cursory information for the Netserver (albeit a generic server image which has nothing in common with 1U rack system) irregardless or whether the server is running Windows 2000 or Linux. Drilling down to the detailed status of the server (mouse over) reveals only network interface details for Linux.
 
     
  On our Omnibook test platform running Windows XP Professional, the performance speedup versus MS Visual C++ 6.0 was not quite as dramatic as when running Linux. Here the improvement was on the order of 35%. This was still enough to keep the performance of the Intel-compiled OBLcpu benchmark on Windows XP Pro marginally faster—about 10%—than the performance of the OBLcpu when compiled under Intel and run on SuSE 7.3 Linux.

That becomes even more interesting when Microsoft Visual Studio.NET is brought into the equation. While still a beta product, tests of the new complier has so far proven it to be about 25% faster than the current MS Visual C++. That puts the final performance numbers for OBLcpu compiled under Intel on Linux and OBLcpu compiled under Visual Studio.NET on Windows 2000 dead even. Furthermore, while it is not a reasonable comparison until the new Microsoft Windows .NET servers are shipping, and Visual Studio .NET is in production, the installation of Intel compiler is far faster and easier than the installation of Visual Studio .NET.

To round out our single processor CPU tests of the Intel compilers, we turned our final attention on a system running with an AMD Athlon CPU. In all previous tests with both the GNU C and MS Visual C++ compilers, AMD Athlon CPUs have consistently performed about 20% faster than a comparable clocked Intel Pentium III CPU—CPU processing power per MHz. When we ran the Intel-compiled version of OBLcpu on the Athlon-powered system the percent improvement was virtually identical to the results run on the Omnibook.

 
       
  CLICK for a comparison of I/O performance using ReiserFS, JFS and ext3FS on RAID.