| THE G3 ON 15K:
OPENBENCH LABS INVESTIGATES HP'S ML350 G3 ENTRY SERVER |
||||
![]() by Jack Fegreus June 18, 2003 |
|
HP’s entry-level system for the realm of enterprise servers is dubbed the ProLiant ML350 G3—for 3rd generation. The server comes in both tower and rack-mountable configurations. OpenBench Labs tested the 5U rack-mount version of the HP ML350 G3, which was configured with two 2.4GHz Intel Xeon processors with Hyper-Threading Technology (HTT) sporting 512KB L2 caches and 533MHz system bus interfaces. All of this CPU fire power resides on a motherboard built upon the ServerWorks Grand Champion LE chipset. For high availability, the memory subsystem of the HP motherboard utilizes a multi-bit memory ECC algorithm, memory scrubbing, Chipkill technology, and supports hot-spare memory configurations to make failures invisible. This, however, is only the start. HP ProLiant servers have an optional hardware and software module that is called the Remote Insight Lights-Out Edition. This option provides an embedded hardware-based graphical remote console, which is OS-independent. With this option installed, systems administrators can take full control of a remote server to shut it down, start it up, or perform any other OS function via a remote display, keyboard, and mouse. To make this scheme work, the Remote Insight Lights-Out Edition board has its own processor, an onboard Fast Ethernet chip, and its own independent Web Server that provides HTML pages to any browser over a Secure Sockets Layer (SSL) connection. Yet even without this option installed, the HP ProLiant ML350 G3 has a number of standard features to help monitor server health. For example, the activity LEDs on the front panel of the server do double duty: They change color from green to yellow when a problem develops in the subsystem that they are used to monitor. There is also a unique blue LED that is used to indicate whether the Lights-Out module is activated. |
|
Nonetheless, when the specifications of the ServerWorks Grand Champion LE chipset are compared to the Intel E7501 chipset, which was used on the Tyan motherboard found in the recently tested Appro 1224Xi server, the ServerWorks chipset comes up short—at least from a theoretical perspective. Like the Appro server, the ProLiant ML350 G3 uses PC2100 registered ECC DDR RAM with its memory double-clocked at 200MHz. The difference between the two can be found in the clock speed of the Front Side Bus (FSB), which provides the processor with an interface to memory. The Intel E7501 chipset provides a 533MHz interface, which matches the FSB speed of the latest Xeon chips. As a result, the E7501 chipset is capable of delivering 4.27 GB/s of data from DDR266 DIMMs. Interestingly, neither server supports PC2600 DDR—double-clocked 166MHz—RAM, which is found in workstations powered by Pentium-4 CPUs having a 566MHz FSB. In contrast, the ServerWorks chipset provides a clock rate of 400MHz—the FSB speed of the earlier generations of Xeon CPUs—for the interface to memory. This pegs the theoretical memory bandwidth of the HP ProLiant ML 350 G3 at 3.2GB per second, That's a full 25% less than the theoretical memory-bandwidth potential of a motherboard using the E7501chipset. Nonetheless that 3.2GB per second FSB bandwidth of the ServerWorks chipset matches the peak I/O bandwidth for the PCI/PCI-X bridges in the E7501 chipset's "NetBurst microarchitecture." Since memory latency has become a major bottleneck for achieving high performance for various applications, this seeming anachronism in the ServerWorks motherboard would inevitably put our oblMemBench benchmark into the limelight. Nonetheless, the issue of overall I/O balance was a strong indicator that the FSB clocking difference might just be a tempest in a teapot. Equally important, we would also test the HP ProLiant ML350 with the highest performing SCSI disk drives available, Ultra320 disks spinning at 15,000 rpm. For our I/O testing, openBench Labs used four Maxtor Atlas 15K Ultra 320 drives. These disk drives sport specs that include a seek time of only 3.2ms, a maximum sustained data rate of 75MB per second, and the ability to fulfill 45% more I/Os per second than Maxtor 10K Ultra320 drives, which were used when testing I/O on the Appro server. To facilitate the I/O demands on an enterprise server, our HP ML350 featured a hot-swap drive bay certified for 6 Ultra320 drives. Interestingly, our server, which was configured on the cusp of an I/O upgrade for the ProLiant ML350 G3, had Ultra320 drives installed. The embedded controllers, however, were Ultra160 SCSI. New configurations of the HP ProLiant ML350 G3 server now feature 2.8GHz Xeon CPUs and have Ultra320 SCSI controllers installed on their motherboards. To get around this problem, we utilized one of the four 100MHz PCI-X expansion slots provided by the HP ProLiant ML350 G3 to install an Adaptec 39320D-R Ultra320 SCSI RAID controller. This controller provides host-based RAID levels 0 or 1 under Windows, however, the current Linux driver does not support this functionality. Any RAID configuration will have to be done from within the operating system. We then disconnected the hot-swap drive cage from the embedded Ultra160 SCSI controller and connected it to the Adaptec Ulra320 controller. |
|
We then proceeded to install SuSE Linux 8.2 on the server. Installation was utterly trivial. The distribution came with all of the drivers to deal with the ServerWorks motherboard including the embedded Broadcom NETXTREME gigabit auto-switching dual-port NIC. In addition, the distribution also includes the latest driver from Adaptec for the 39320 family of Ultra320 SCSI controllers. As a result, we did not need to use the HP SmartStart software package to encapsulate the installation of our OS. Given the intriguing differences in the underlying chipsets that characterize the motherboards of the Appro and HP servers, the immediate question that we needed to answer was whether the difference in FSB frequency would translate into a difference in CPU processing power. To answer that question, we ran our oblCPU benchmark compiled under gcc v3.3 (oblCPU v2.0) and Intel C++ v7.0 (oblCPU v2.0i). The benchmark normalizes all results to the performance of 600MHz Pentium-III (value of 100) and pegs a 1.26MHz P-III in an HP NetServer as 2.07 times as powerful as our 600MHz CPU. |
|
|
As we had found in previous tests of the Appro dual-Xeon server, recompiling our benchmark with GNU C 3.3 and adding the optimization switch -msse2 to the -funroll-loops and -O3 switches produced only a marginal improvement in the test results achieved with GNU C 3.0 or, for that matter, Visual Studio 6.0 on Windows. In both cases, nothing in the compilation process took advantage of HTT-enabled processors. |
|
With our benchmark compiled with GNU C v3.3 r MS Visual C++ v6.0, our 2.4 GHz Xeon processor was crunching numbers no faster than a 1.5GHz Pentium-III CPU. On the other hand, we were able to increase CPU performance by more than 180% by recompiling our benchmark with the Intel 7.0 compiler and implementing the following switch options: -axW, -ip, and -O3. For the Intel C++ compiler, the -ip switch introduces interprocedural optimization, which improves performance by "inlining" the code of frequently called functions to eliminate the overhead associated with branching. The -O3 switch unrolls loops, similar to -funroll-loops in GNU C, and enables the prefetching of data into the copious cache of the Xeon CPU. |
|
|
The biggest jump in performance comes from the -axW switch, which introduces the advanced Streaming-SIMD-Extensions SSE2 for Pentium 4 and Xeon processors. The compiler detects patterns of sequential data accesses by the same instruction and transforms that code for Single Instruction Multiple Data (SIMD) execution by the Xeon CPU. As a result, the performance of the HP Proliant ML350 G3 jumped over 180% when running oblCPU v2.6i. More importantly, CPU performance of the HP server was statistically identical to that of the Appro server, despite the difference in FSB clock speed: 400MHz vs. 566MHz. Having measured no CPU performance difference between the HP and Appro servers, we next focused our attention explicitly on memory bandwidth. To achieve automatic 2-way interleaving of memory on the HP ProLiant ML350 G3 server, DIMMs must be installed in identical pairs. We fully anticipated that to meet the server's memory bandwidth potential of 3.2GB per second, we would have to configure memory 2-way interleaving. |
|
Our hunch was correct. Without 2-way memory interleaving using three 256MB DIMMs, we measured a 28% drop in peak throughput from over 1.8GB per second down to just over 1.3GB per second when performing 2-byte strides through memory. There was an even greater disparity in memory throughput for small strides using a binary of the benchmark compiled with GNU C v3.3 instead of a binary compiled with Intel C++ v7.0. The gap in performance with different compilers peaked at 35%. In both of these test scenarios—2-way memory interleaving and compiler awareness of HTT—performance differences between the memory subsystems collapsed as stride size increased. By the time strides were greater than 32 bytes, there was no statistically relevant difference. Clearly the dual-whammy of failing to run HTT-aware executables in a configuration where DIMMs are not installed in pairs dramatically degrades performance on this server. A similar pattern of stride-size dependence arose when comparing the memory bandwidth of the HP ProLiant ML350 G3 with that of the Appro 1224Xi. Only for 2- and 4-byte strides was there a significant difference. Beyond 8-byte strides, performance differences attributable to variations in FSB clocking disappeared. What's more, for strides larger than 32 bytes, throughput on the HP server surpassed the Appro server. Longer strides induced page swapping and the faster SCSI disks on the HP server became the dominating influence. |
|
|
Next we began testing the server’s capability to handle large amounts of I/O. To this end, we installed an Adaptec 39320 series Ultra320 SCSI controller in one of the server's open PCI-X slots. We then connected the controller to the HP server's hot-swap drive cage. Into the cage we then put 4 Maxtor Atlas 15K Ultra320 drives. These drives feature Maxtor's 2nd-generation of Ultra320 SCSI interface. Like the Atlas 10K drives tested earlier by openBench Labs, these drives incorporate Maxtor's MaxAdapt circuitry, which dynamically attempts to improve signal quality on the SCSI bus by amplifying signal frequencies while simultaneously filtering noise to insure the quality of electrical signals, which are of the utmost importance in an Ultra320 SCSI configuration. |
|
Ultra320 SCSI was designed to do a much better job maximizing bus utilization and minimizing command overhead. One of the more interesting improvements comes in the form of a SCSI packet protocol, which allows multiple commands to be transferred in a single connection along with data and status information. Ultra160 SCSI handles command and status information using slower asynchronous connections. The resulting performance improvement in Ultra320 SCSI over Ultra160 SCSI does have one drawback: the throughput from an Ultra320 SCSI controller can saturate a standard 64-bit PCI bus, which is clocked at 66MHz. To effectively run Ultra320 SCSI, 2Gb Fibre Channel, or Infiniband I/O, you'll need to use a server with a good PCI-X architecture. In the case of the HP ProLiant ML350 G3, there are 4 open PCI-X slots, all of which are clocked at 100MHz, available for expansion cards. What's more, the newest version of this server now comes with an embedded Ultra320 controller. Before testing the Maxtor 15K drives in a RAID configuration, we first ran our disk benchmarks against a single drive and compared the results to a 15K Ultra160 drive connected to one of the server's embedded Ultra 160 SCSI controllers. For a single drive, the results were eye-popping. Compared to our Ultra160 reference drive, which was formatted with the ReiserFS (RFS), streaming throughput literally doubled when formatted with ReiserFS or XFS. In terms of read and write throughput, the performance of these two file systems was statistically identical. Once again the results for all four journaled file systems, ReiserFS, XFS, JFS and ext3 followed the same patterns originally measured when openBench Labs tested an external Ultra160 SCSI subsystem from Adaptec. |
|
|
Equally impressive for a single drive were the results measured with our I/O loading benchmark: oblLoad. This benchmark launches multiple parallel threads to simulate requests for data from multiple users in a database environment. During the test, background thread processes are dispatched to issue unique requests for 8KB of data in a manner that simulates the access patterns of relational database queries. This process continues until the average access time for all threads exceeds 100 milliseconds. |