THE G3 ON 15K:

OPENBENCH LABS INVESTIGATES HP'S ML350 G3 ENTRY SERVER

   
 
by Jack Fegreus

June 18, 2003
   
     
 

HP’s entry-level system for the realm of enterprise servers is dubbed the ProLiant ML350 G3—for 3rd generation. The server comes in both tower and rack-mountable configurations. OpenBench Labs tested the 5U rack-mount version of the HP ML350 G3, which was configured with two 2.4GHz Intel Xeon processors with Hyper-Threading Technology (HTT) sporting 512KB L2 caches and 533MHz system bus interfaces. All of this CPU fire power resides on a motherboard built upon the ServerWorks Grand Champion LE chipset.

For high availability, the memory subsystem of the HP motherboard utilizes a multi-bit memory ECC algorithm, memory scrubbing, Chipkill technology, and supports hot-spare memory configurations to make failures invisible. This, however, is only the start.

HP ProLiant servers have an optional hardware and software module that is called the Remote Insight Lights-Out Edition. This option provides an embedded hardware-based graphical remote console, which is OS-independent. With this option installed, systems administrators can take full control of a remote server to shut it down, start it up, or perform any other OS function via a remote display, keyboard, and mouse. To make this scheme work, the Remote Insight Lights-Out Edition board has its own processor, an onboard Fast Ethernet chip, and its own independent Web Server that provides HTML pages to any browser over a Secure Sockets Layer (SSL) connection.

Yet even without this option installed, the HP ProLiant ML350 G3 has a number of standard features to help monitor server health. For example, the activity LEDs on the front panel of the server do double duty: They change color from green to yellow when a problem develops in the subsystem that they are used to monitor. There is also a unique blue LED that is used to indicate whether the Lights-Out module is activated.

 
         
 
OPENBENCH LABS SCENARIO
UNDER EXAMINATION
Dual-processor Xeon server

WHAT WE TESTED

HP ProLiant ML350 G3 server
Dual 2.4GHz Intel Xeon CPUs
1GB DDR memory
(2) Embedded Ultra160* SCSI controllers
Ultra320 SCSI hot-swap drive cage
(2) Broadcomm 10/100/1000-Mbit NICs
(4) 100MHz PCI-X expansion slots
ATI XL Rage graphics controller

(4) Maxtor Atlas 15K Ultra320 disk drives
3.2ms seek time
72MB/s data throughput


HOW WE TESTED

SuSE Linux Professional v 8.2
Linux kernel 2.4.20
GCC 3.3
KDE 3.1



Intel C++ for Linux v 7.0
Free license for non-commercial development



Adaptec 39320D-R
133MHz PCI-X
Ultra320 SCSI
Host RAID (0,1)



Benchmarks:
oblCPU v2.6 (zip)
oblCPU v2.6i (zip)
oblMemBench v1.0i
oblDisk v1.0i
oblLoad v1.0i

KEY FINDINGS

Performance of the ServerWorks-based HP ML350 G3 is on a par with dual-Xeon systems using the Intel E7501 chipset.
I/O throughput is measurably enhanced with 15K Ultra320 SCSI drives over 10K Ultra320 SCSI drives.
SuSE 8.2 Linux requires no additional drivers to load on this system.
 

Nonetheless, when the specifications of the ServerWorks Grand Champion LE chipset are compared to the Intel E7501 chipset, which was used on the Tyan motherboard found in the recently tested Appro 1224Xi server, the ServerWorks chipset comes up short—at least from a theoretical perspective. Like the Appro server, the ProLiant ML350 G3 uses PC2100 registered ECC DDR RAM with its memory double-clocked at 200MHz.

The difference between the two can be found in the clock speed of the Front Side Bus (FSB), which provides the processor with an interface to memory. The Intel E7501 chipset provides a 533MHz interface, which matches the FSB speed of the latest Xeon chips. As a result, the E7501 chipset is capable of delivering 4.27 GB/s of data from DDR266 DIMMs. Interestingly, neither server supports PC2600 DDR—double-clocked 166MHz—RAM, which is found in workstations powered by Pentium-4 CPUs having a 566MHz FSB.

In contrast, the ServerWorks chipset provides a clock rate of 400MHz—the FSB speed of the earlier generations of Xeon CPUs—for the interface to memory. This pegs the theoretical memory bandwidth of the HP ProLiant ML 350 G3 at 3.2GB per second, That's a full 25% less than the theoretical memory-bandwidth potential of a motherboard using the E7501chipset. Nonetheless that 3.2GB per second FSB bandwidth of the ServerWorks chipset matches the peak I/O bandwidth for the PCI/PCI-X bridges in the E7501 chipset's "NetBurst microarchitecture."

Since memory latency has become a major bottleneck for achieving high performance for various applications, this seeming anachronism in the ServerWorks motherboard would inevitably put our oblMemBench benchmark into the limelight. Nonetheless, the issue of overall I/O balance was a strong indicator that the FSB clocking difference might just be a tempest in a teapot.

Equally important, we would also test the HP ProLiant ML350 with the highest performing SCSI disk drives available, Ultra320 disks spinning at 15,000 rpm. For our I/O testing, openBench Labs used four Maxtor Atlas 15K Ultra 320 drives. These disk drives sport specs that include a seek time of only 3.2ms, a maximum sustained data rate of 75MB per second, and the ability to fulfill 45% more I/Os per second than Maxtor 10K Ultra320 drives, which were used when testing I/O on the Appro server.

To facilitate the I/O demands on an enterprise server, our HP ML350 featured a hot-swap drive bay certified for 6 Ultra320 drives. Interestingly, our server, which was configured on the cusp of an I/O upgrade for the ProLiant ML350 G3, had Ultra320 drives installed. The embedded controllers, however, were Ultra160 SCSI. New configurations of the HP ProLiant ML350 G3 server now feature 2.8GHz Xeon CPUs and have Ultra320 SCSI controllers installed on their motherboards.

To get around this problem, we utilized one of the four 100MHz PCI-X expansion slots provided by the HP ProLiant ML350 G3 to install an Adaptec 39320D-R Ultra320 SCSI RAID controller. This controller provides host-based RAID levels 0 or 1 under Windows, however, the current Linux driver does not support this functionality. Any RAID configuration will have to be done from within the operating system. We then disconnected the hot-swap drive cage from the embedded Ultra160 SCSI controller and connected it to the Adaptec Ulra320 controller.

 
         
 

We then proceeded to install SuSE Linux 8.2 on the server. Installation was utterly trivial. The distribution came with all of the drivers to deal with the ServerWorks motherboard including the embedded Broadcom NETXTREME gigabit auto-switching dual-port NIC. In addition, the distribution also includes the latest driver from Adaptec for the 39320 family of Ultra320 SCSI controllers.  As a result, we did not need to use the HP SmartStart software package to encapsulate the installation of our OS.

Given the intriguing differences in the underlying chipsets that characterize the motherboards of the Appro and HP servers, the immediate question that we needed to answer was whether the difference in FSB frequency would translate into a difference in CPU processing power. To answer that question, we ran our oblCPU benchmark compiled under gcc v3.3 (oblCPU v2.0) and Intel C++ v7.0 (oblCPU v2.0i). The benchmark normalizes all results to the performance of 600MHz Pentium-III (value of 100) and pegs a 1.26MHz P-III in an HP NetServer as 2.07 times as powerful as our 600MHz CPU.

 
Dual Xeon CPUsUp to 8GB PC2100 DDR RAM interleaved4 100MHz PCI-X expansion slotsUltra320 hot-swap drive cage
Featuring a tool-less entry chassis, the HP ProLiant ML350 G3 is quite easy to configure and modify. There are 5 open slots for add-in cards: Four slots support PCI-X cards at 100MHz. In addition, a hot-swap drive cage supports 6 Ultra320 drives and room for 8GB of PC2100 registered ECC DDR RAM.
 
     
 

As we had found in previous tests of the Appro dual-Xeon server, recompiling our benchmark with GNU C 3.3 and adding the optimization switch -msse2 to the -funroll-loops and -O3 switches produced only a marginal improvement in the test results achieved with GNU C 3.0 or, for that matter, Visual Studio 6.0 on Windows. In both cases, nothing in the compilation process took advantage of HTT-enabled processors.

 
         
 

With our benchmark compiled with GNU C v3.3 r MS Visual C++ v6.0, our 2.4 GHz Xeon processor was crunching numbers no faster than a 1.5GHz Pentium-III CPU. On the other hand, we were able to increase CPU performance by more than 180% by recompiling our benchmark with the Intel 7.0 compiler and implementing the  following switch options: -axW, -ip, and -O3.

For the Intel C++ compiler, the -ip switch introduces interprocedural optimization, which improves performance by "inlining" the code of frequently called functions to eliminate the overhead associated with branching. The -O3 switch unrolls loops, similar to -funroll-loops in GNU C, and enables the  prefetching of data into the copious cache of the Xeon CPU.

 
Statistically, the results of our CPU benchmarks on the HP ProLiant ML350 G3 were identical to the results obtained on the Appro 1224Xi server.
 
     
 

The biggest jump in performance comes from the -axW switch, which introduces the advanced Streaming-SIMD-Extensions SSE2 for Pentium 4 and Xeon processors. The compiler detects patterns of sequential data accesses by the same instruction and transforms that code for Single Instruction Multiple Data (SIMD) execution by the Xeon CPU. As a result, the performance of the HP Proliant ML350 G3 jumped over 180% when running oblCPU v2.6i. More importantly, CPU performance of the HP server was statistically identical to that of the Appro server, despite the difference in FSB clock speed: 400MHz vs. 566MHz.

Having measured no CPU performance difference between the HP and Appro servers, we next focused our attention explicitly on memory bandwidth. To achieve automatic 2-way interleaving of memory on the HP ProLiant ML350 G3 server, DIMMs must be installed in identical pairs. We fully anticipated that to meet the server's memory bandwidth  potential of 3.2GB per second, we would have to configure memory 2-way interleaving. 

 
         
 

Our hunch was correct.

Without 2-way memory interleaving using three 256MB DIMMs, we measured a 28% drop in peak throughput from over 1.8GB per second down to just over 1.3GB per second when performing 2-byte strides through memory. There was an even greater disparity in memory throughput for small strides using a binary of the benchmark compiled with GNU C v3.3 instead of a binary compiled with Intel C++ v7.0. The gap in performance with different compilers peaked at 35%.

In both of these test scenarios—2-way memory interleaving and compiler awareness of HTT—performance differences between the memory subsystems collapsed as stride size increased. By the time strides were greater than 32 bytes, there was no statistically relevant difference. Clearly the dual-whammy of failing to run HTT-aware executables in a configuration where DIMMs are not installed in pairs dramatically degrades performance on this server.

 A similar pattern of stride-size dependence arose when comparing the memory bandwidth of the HP ProLiant ML350 G3 with that of the Appro 1224Xi. Only for 2- and 4-byte strides was there a significant difference. Beyond 8-byte strides, performance differences attributable to variations in FSB clocking disappeared. What's more, for strides larger than 32 bytes, throughput on the HP server surpassed the Appro server. Longer strides induced page swapping and the faster SCSI disks on the HP server became the dominating influence. 

 
When the HP ProLiant ML350 G3 was configured to provide memory interleaving, we measured significant differences in throughput between the ServerWorks-based HP ProLiant ML350 G3 and the Intel E7501-based Appro 1224Xi only for 4-byte strides.
 
         
 

Next we began testing the server’s capability to handle large amounts of I/O. To this end, we installed an Adaptec 39320 series Ultra320 SCSI controller in one of the server's open PCI-X slots.  We then connected the controller to the HP server's hot-swap drive cage. Into the cage we then put 4 Maxtor Atlas 15K Ultra320 drives. These drives feature Maxtor's 2nd-generation of Ultra320 SCSI interface. Like the Atlas 10K drives tested earlier by openBench Labs, these drives incorporate Maxtor's MaxAdapt circuitry, which dynamically attempts to improve signal quality on the SCSI bus by amplifying signal frequencies while simultaneously filtering noise to insure the quality of electrical signals, which are of the utmost importance in an Ultra320 SCSI configuration.

 
Open Reader Survey
Does your site use Ultra160 SCSI RAID arrays? Yes No No Answer
Does your site use a 1Gbit SAN? Yes No No Answer
Are you using or planning to add Ultra360 SCSI RAID arrays? Yes No No Answer
Are you using or planning to add a 2Gbit SAN? Yes No No Answer
Click for
Current Tally
 
         
 

Ultra320 SCSI was designed to do a much better job maximizing bus utilization and minimizing command overhead. One of the more interesting improvements comes in the form of a SCSI packet protocol, which allows multiple commands to be transferred in a single connection along with data and status information. Ultra160 SCSI handles command and status information using slower asynchronous connections.

 The resulting performance improvement in Ultra320 SCSI over Ultra160 SCSI does have one drawback: the throughput from an Ultra320 SCSI controller can saturate a standard 64-bit PCI bus, which is clocked at 66MHz. To effectively run Ultra320 SCSI, 2Gb Fibre Channel, or Infiniband I/O, you'll need to use a server with a good PCI-X architecture. In the case of the HP ProLiant ML350 G3, there are 4 open PCI-X slots, all of which are clocked at 100MHz, available for expansion cards. What's more, the newest version of this server now comes with an embedded Ultra320 controller.

Before testing the Maxtor 15K drives in a RAID configuration, we first ran our disk benchmarks against a single drive and compared the results to a 15K Ultra160 drive connected to one of the server's embedded Ultra 160 SCSI controllers. For a single drive, the results were eye-popping.

Compared to our Ultra160 reference drive, which was formatted with the ReiserFS (RFS), streaming throughput literally doubled when formatted with ReiserFS or XFS. In terms of read and write throughput, the performance of these two file systems was statistically identical. Once again the results for all four journaled file systems, ReiserFS, XFS, JFS and ext3 followed the same patterns originally measured when openBench Labs tested an external Ultra160 SCSI subsystem from Adaptec.

 
Throughput tests on the single Ultra320 drive doubled the performance of an Ultra160 drive and rivaled the performance that previously could only be achieved with a multi-drive array—ref: Adaptec Ultra160 SCSI array. Similar results occured with I/O loading. For critical fast response times (less than 25ms) our oblLoad benchmark (Mouse Over) was able to deliver twice the number of I/O operations per second.
 
     
 

Equally impressive for a single drive were the results measured with our I/O loading benchmark: oblLoad. This benchmark launches multiple parallel threads to simulate requests for data from multiple users in a database environment. During the test, background thread processes are dispatched to issue unique requests for 8KB of data in a manner that simulates the access patterns of relational database queries. This process continues until the average access time for all threads exceeds 100 milliseconds.

 
         
 

The results of oblLoad can be evaluated in terms of total I/Os per second versus the total number I/O-issuing processes or in terms of the response time versus the number of I/Os per second being processed. The ability of a system to complete more I/O requests as more I/O process threads are loaded is dependent upon data caching and thread load balancing. In the past, the results of this test were almost universally disappointing on Linux, because of a lack of specialized support in the kernel for asynchronous I/O—a strong point for Windows NT and its derivatives.

On the HP ProLiant ML350 G3 server using the Adaptec Ultra320 controller, we were able to process at critical sub 25ms response times an I/O load (IOPS) that was more than double the I/O load supported by our reference Ultra160 SCSI reference subsystem. In addition, the response time associated for a particular I/O load was typically 35% faster.

In a 4-drive RAID level 0 configuration, the Atlas 15K drives performed measurably better in both streaming throughput and I/O loading tests when compared to a similar array using Atlas 10K drives. Once again, relative performance of the different file systems remained the same with the ReiserFS and XFS in a dead heat for the best performance in streaming I/O and JFS providing the best transaction processing results.

For streaming I/O, reads on the 15K drive-based RAID array posted 35% greater throughput. While not as dramatically improved, write throughput also increased typically on the order of 10%. On I/O loading, performance on our RAID array with 15K drives scaled linearly from a single drive. With average response times that were under 35ms, we were able to deliver over 2,600 I/O operations per second to over 250 simulated users.

That's some entry level server!

 
 
Using the 15K Atlas drives, streaming performance improved on the order of 35% on reads. IOPS scaled linearly (Mouse Over) and we were able to satisfy over 2,600 requests per second from over 250 simulated users with an average access time of less than 35ms.