TECHNICAL WORKAHOLIC
  Doubling down on dual AMD Palomino processors, DDRAM, Ultra160 SCSI, and 2.4 kernel-based Red Hat Linux 7.1, ASL brings to market a first-class technical workstation.    
         
  by Jack Fegreus      
 
 

Dubbed the Marquis K120, this workstation from ASL Inc. is built on a Tyan Thunder K7 motherboard. Back in January of this year, OpenBench Labs reviewed our first AMD-based system running on a beta kernel from Red Hat( see Open, January, 2001, “No Risk RISC”). This system was built on a single 850 MHz Athlon processor sporting a superpipelined, nine-issue superscalar architecture and a double-clocked (2 X 100 MHz) system bus. 

Athlon’s high-speed execution core includes multiple x86 instruction decoders, three independent 10-stage integer pipelines, and three address calculation pipelines. More importantly, the Athlon is distinguished by the first floating point engine in an x86-compatible platform with a 15-stage pipeline with multiple schedulers for superscalar, out-of-order, speculative execution of instructions.

 
   
 
OPENBENCH LABS SCENARIO
UNDER EXAMINATION
AMD Athlon-based workstation performance

WHAT WE TESTED
Marquis K120-S workstation with dual AMD “1.2 GHz” processors 
www.aslab.com

HOW WE TESTED
Red Hat Linux v7.1 running kernel 2.4.2-2smp
www.redhat.com

OBLcpu v1.0 benchmark
OBLmemband v1.0 benchmark
OBLdisk v1.0 benchmark
OBLload v1.0 benchmark
Gnofract4D

KEY FINDINGS
• The AMD-760MP chipset provided superior SMP scaling in a workstation scenario

• The “1.2GHz” CPUs checked in at 1194 MHz and computational performance scaled perfectly from tests of an 850 MHz Athlon CPU.

• AMD’s Athlon provides about a 20% CPU-processing advantage over Pentium III-based systems running at the same clock speed.

• Memory throughput doubled with PC2100 DDRAM as compared to PC133 SDRAM

 

The business case for all of these micro processor pyrotechnics is simply the ability to excel at executing cutting-edge software in such areas as digital content creation for streaming over the Internet, 3D modeling, commercial desktop publishing, speech recognition and large-scale code compilations. All of these applications significantly stress processor and system bandwidth. What’s more, Athlon’s binary compatiblity with existing x86 CPUs insures that commercial software applications that run on Intel-based systems run on Athlon-based systems. 

The AMD-760 MP chipset that powers the Marquis K120 features two-way multiprocessor core logic solution, an enhanced AMD Athlon system bus, support for DDR (Double Data Rate) memory technology, and an AGP-4X graphics interface. Each “1.2GHz” CPU—Red Hat reported both at 1194 MHz—connects to an AMD-762 North Bridge system controller over double-clocked (2 X 133MHz) front side buses. In turn the AMD-762 connects over a 266MHz bus to 4 banks of DDR memory.

 DDRAM (Double Data Ram) evolved from SDRAM as a cost-effective alternative to Rambus memory, which is used in Pentium 4-based workstations. As with Ultra160 SCSI, DDRAM provides two data accesses per clock cycle by utilizing both the upward and downward signal slopes. To implement this scheme, a DDR DIMM module requires 184 pins rather than the 168 pins that mark SDRAM. As a result, the two types of memory are not interchangeable.

The naming convention for DDR DIMM modules proscribed by the JEDEC, the semiconductor engineering standardization body of the Electronic Industries Alliance (EIA), references memory chips by their native speed and DDR DIMM modules by their peak bandwidth in MB per second. Under this scheme 266 MHz DDR memory chips are called DDR266. On the other hand, the DDR DIMMs used in the Marquis K120 are denoted by the maximum amount of data in MB that can be delivered per second. Since the DIMMs are 8 bytes wide, theoretically they deliver 8 bytes X 266 MHz per second—approximately 2.1GB per second. As a result, the DIMMs, which have DDR266 chips, are dubbed PC2100. In particular our workstation sported 4 PC2100 DDR DIMMs with ECC, each with a capacity of 256 MB.

 
     
 

In addition, the AMD-762 North Bridge system controller also connects to the graphics subsystem. In particular, it connects to a 4X AGP slot via a 66 MHz bus. ASL places a Matrox Millennium G450 graphics accelerator with 32 MB of video memory. One of the more interesting features of this card is its ability to support two different monitors simultaneously.

The AMD-766 South Bridge peripherals controller works in tandem with the AMD-762 controller to provide access to 5 64-bit PCI upgrade slots clocked at 33 MHz. This controller integrates with an Adaptec 7899 dual channel Ultra160 SCSI controller. Up to 30 devices can be connected internally via 68-pin Ultra 160 LVD/SE connectors. In addition, this controller connects to two internal 3COM 3c920 10/100 Fast Ethernet controllers.

For many readers, this dazzling array of internal peripherals may seem like a bit of overkill when it comes to a workstation. For SOHO Pentium-IV systems that’s quite true, but for high-powered commercial workstations, which are more often than not Alpha or SPARC-based, this is hardly a case of overkill. It does, however, point to a near-term version of the Marquis K120 that will later be introduced in the form of a server. All that is really necessary is the addition of support for redundant power supplies. 

   
 

To prove the mettle of this prodigious workstation, OpenBench Labs ran an equally prodigious number of benchmarks on the platform. Naturally we started with our OBLcpu benchmark, which runs 34 numerically intensive kernels with a mix of integer and floating point arithmetic done in both single and double precision. While OpenBench Labs benchmark has an obvious relationship to the performance of scientific, mathematical, and engineering workstation applications, there is also a close relationship between floating point calculation performance and high-end graphics performance. 

Extensive floating point arithmetic can be found in MPEG-2 video encoding, speech recognition, financial modeling, and trading applications. These calculations are essential for geometric calculations common in CAD/CAE processing, high-precision mathematical calculations, and 3D graphics applications involving physics, geometry, and triangle setup. With the need to deliver increasing levels of realism and detail when modeling physical objects in motion, higher-quality digital video, and a richer web experience, good floating point performance has become a requirement across the workplace.

   
       
  In our single processor test, each of the CPUs in the Marquis K120 benchmarked as expected from earlier results with an 850 MHz Athlon CPU. Click to enlarge.  
   
 

When we first ran our OBLcpu benchmark on the single 850 MHz Athlon CPU, 3 out of every 4 kernels ran distinctly faster than on a server using a single 866 MHz Pentium-III processor. In a few instances, the differential was as high as 200%-to-300%. Nonetheless, the best way to obtain a single performance number is to take the geometric mean of all of the kernels. This measure is the least susceptible to being overweighted by data points at the extremities. 

For our 866 MHz Intel Pentium III, the geometric mean of our 34 kernels clocked in at a processor index of 254. Our 850 MHz Athlon CPU clocked in at 300. Right on the numbers, the new 1194 MHz Athlon CPU in the Marquis K120 workstation was pegged at 411. OpenBench Labs currently normalizes performance to a standard 300 MHz Pentium III running Windows 2000, which we set to 100. More importantly, within a 95% statistical confidence region, performance for the 850 MHz AMD Athlon was between 275 and 353. In contrast, performance for the new 1194 MHz AMD processor was between 381 and 501. The bottom line for Linux users is the likelihood of a 20% performance edge when running on an Athlon as opposed to a Pentium III of equal speed. 

As the OBLcpu benchmark is single-threaded, the modules only run on one CPU and do not take advantage of the multiprocessor architecture or stress the scalability of the SMP implementation. To test this aspect of the ASL Marquis 120, we scripted the execution of 2, 4 and then 8 simultaneous instances of the OBLcpu program. For a single user workstation, running 8 simultaneous CPU-intensive processes is a likely limit.

   
       
  To test the SMP characteristics of the dual processor workstation under RedHat 7.1, we ran 2, 4, and 8 simultaneous copies of our single-threaded OBLcpu benchmark. In each case we recorded and plotted the geometric mean of the observed execution time. We also plotted (insets) a box-and-whisker statistical analysis of the distribution of the observed geometric means. In each case the geometric means clustered extremely close and corresponded with theoretically perfect scalability.. Click to enlarge.  
   
 

Within this relatively restricted range of 8 simultaneous processes, the Marquis 120 scaled perfectly. With just two benchmarks running, each process had its own AMD-760MP CPU and measured a performance index of 400. In general, with a 2-way SMP system, doubling the number of processes should effectively cut the performance index measured by each process in half. This is exactly what we observed.  More importantly, if the kernel is optimally scheduling the distribution of time slices among competing processes, we should measure very little difference in the performance of each process. This is exactly what occurred. The performance index measured by each process ranged from a low of 206 to a high of 211. 

We then redoubled the number of processes, and once again the performance index measured by each process was halved. Moreover, we measured little variation between the processes. With 8 simultaneous CPU benchmarks running, the lowest performance index measured was 109 and the highest was 112.

To test the effectiveness of the Marquis K120’s DDRAM implementation, we ran our OBLmemband benchmark, which measures memory throughput by using the CPU to move data between memory locations and measuring the time taken. Few implementation factors are as critical to overall system performance as the usable bandwidth between the CPU and memory. Whether data is being moved to/from a device or simply manipulated by the CPU, it ultimately must be brought into memory before it can be used to achieve anything interesting.

The benchmark requires a minimum allocation of 64 MB of memory—we used 756 MB on the Marquis K120—so that a broad range of physical addresses can be used and the address translation caches can be effective without dominating the behavior of the overall system. Oblmemband varies the “stride” value as it proceeds through the test stressing the bandwidth by transferring larger chunks of data.

   
       
  We ran the OBLmemband benchmark to assess the workstations DDRAM throughput as well as the system’s ability to scale memory access over 2 processors. The results were more than a pleasant surprise as the DDRAM doubled throughput over PC133 SDRAM as it theoretically should and memory throughput doubled Click to enlarge.  
   
 

Since the Enlight workstation we tested in January used PC133 EEC SDRAM, our OBLmemband benchmark could directly compare memory bandwidth improvements offered by PC2100 EEC DDRAM. In theory at least, we should measure twice the throughput. In this case,  theory won out. On memory strides ranging from 4-to-64 bytes this is exactly what we measured with throughput in MB per second averaging about 80-to-100% greater For DDRAM compared to SDRAM with AMD processors. In addition, when we ran two threads in parallel, performance for strides ranging from 4-to-64bytes once again virtually doubled. 

In our final tests, we examined the performance of the disk I/O subsystem of the workstation. The use of an embedded Adaptec Ultra160 SCSI controller raised our expectations for the workstation; however, the performance of a single IBM drive absolutely surprised us. 

To defeat normal disk caching, we utilized a 2GB test file. Streaming throughput as we utilized OBLdisk to linearly access the file with varying block sizes remained fairly constant at about 24 MB per second. We then used OBLload to randomly access the disk in a database pattern making 8 KB reads. Under this scenario we were able to reach approximately 350 I/Os per second before exceeding an average access time of 100ms.

   
       
  Streaming throughput off of a 2 GB file from the single IBM Ultra160 SCSI drive in our workstation consistently hovered about the 24 MB per second mark.. Click to enlarge.  
   
 

To get a sense of the workstation’s performance in a more macro scale we ran a few unscientific experiments with Gnofract4D, which is a very robust fractal generation program. With a few deceptively simple clicks of the mouse, we delved deep into our fractal. In fact, it is was so deceptively easy that it was not until we examined the fractal’s properties that we did realize we had passed through more than 16,000 iterations. 

Now that is a workaholic’s workstation. 

   
   
   Click to enlarge.