FROM FLUID DYNAMICS T0
BUSINESS PERFORMANCE

From computational fluid dynamics simulations to analyzing multi-dimensional sales data, the underlying mathematics is the same and that unifies benchmarking for both high performance and business performance computing.

   
 


by Jack Fegreus

August 16, 2006

   
     
 
The quaint notion that change is constant is dead. More than a decade of e-commerce, shrinking product lifecycles, accelerated globalization, intense corporate Darwinism, and rampant technology innovation, has business thought leaders talking of "continuous discontinuities" and warning that businesses must be ready to respond as change accelerates. In a demand-driven business world, the common CEO mandate is to recast old supply chains into demand-driven supply networks. To fulfill that mandate, IT infrastructure must be ready to support computationally intensive tasks such as the creation of demand projections, supply plans, and the optimization of both supply and demand in accord with a defined business strategy.

Shrunken product lifecycles make speed-to-market critical for success and drive a need for multi-period forecast plans that proactively shape demand through sales promotions and new product introductions. To generate demand projections, software must perform a number of computationally intensive statistical analysis functions, including analytical modeling of profitability scenarios; what-if analysis for new product introductions; product lifecycle forecasting; seasonality profiles; promotion and rebate modeling; and capacity and material constraint optimization.

The next step of creating a demand-driven supply model is even more complex. Customer, channel, supply, and product variability make the planning process extremely complex, as any plan must entail material needs down to the lowest component of a bill-of-materials. As a result, sophisticated software and robust server technology are needed to aggregate and disaggregate plans: A process that has become a weekly or monthly event at high-performing companies and a daily effort at the world-class retail giant, Wal-Mart.

For IT struggling to to determine "the right size box," to support the activity of a demand-driven supply network, the mathematics that drives this process is exactly the same as that underpinning applications such as business intelligence, structural engineering, and computational fluid dynamics (CFD). Handling the workload of a complex sales and operations planning process is no different from running a CFD application, in terms of CPU, memory, and I/O resource utilization. In particular, all of these applications rely on large matrix algebra operations, which makes them good targets for parallel execution in both SMP and high-performance cluster configurations.

 
         
 
OPENBENCH LABS SCENARIO

 
UNDER EXAMINATION:  Computational Fluid Dynamics Benchmark

 HOW WE TESTED:
FLUENT CFD Benchmark
Nine industrial CFD benchmark problems
Linux executable built with Intel compiler

 HOW WE TESTED:
Appro 3U XtremeServer
(4) Dual-Core AMD 880 CPUs
16GB DDR 400MHz
ECC registered RAM
PCIe: (2)16x, (1)4x; PCI-X: (3)133MHz

nVidia PCIe ASIC
(6) Hot-swap SATA drive bays



SUSE Linux 10.0

GNU Compiler Collection version 4.0




 KEY FINDINGS:

Both supply chain optimization and online analytical processing (OLAP) perform the same matrix calculations as CFD.

The four-way Appro XtremeServer with Dual-Core Opteron processors running SUSE Linux 10 with GCC v4.0 provided higher computational performance and equal parallel processing scalability to clustered systems on problems defined by grids with less than one million cells.

The 4-way Appro XtremeServer had a significant cost advantage over a cluster of HP ProLiant DL585 servers, which utilized an InfiniBand backbone for inter-processor communications.

 

Fluent is the leading supplier of CFD software for simulating complex phenomena involving turbulent, reacting, and multiphase flows. To assess the performance of the FLUENT Flow Modeling application, Fluent provides a suite of nine CFD simulations typical of those found in industry. Detailed results are maintained online to provide comprehensive and fair comparative information of the performance of FLUENT on available hardware platforms.

The performance of the benchmark suite centers on the ability of the underlying operating system and hardware to handle increasingly fine-grained parallel execution of each of the nine problems. For the operating system, the key factor will be compiler runtime optimizations. For the hardware environment, the key factor will be inter-processor communications.

 
     
 

To run Fluent's suite of nine benchmark problems, openBench Labs installed SUSE Linux 10 on a 3U four-way Appro XtremeServer. This system featured Dual-Core AMD Opteron Model 880 processors giving the system eight processor cores, each clocked at 2.4GHz. 

Our goal was not to demonstrate the ability to scale an XtremeServer environment linearly for FLUENT without regard to cost. Rather, our goal was to demonstrate the explicit capabilities of a single Appro 3U XtremeServer to scale with respect to parallel processing. In this regard, we ran up to a maximum of eight instances of Fluent in parallel on the single XtremeServer. We then compared our results running the Fluent benchmark suite on a single server with benchmark results published by Fluent for a high-performance cluster that was powered by AMD Opteron processors clocked at 2.4GHz. This cluster was built around HP ProLiant DL585 servers running Red Hat Enterprise Linux v3 (RHEL3).

For the XtremeServer, its ability to scale in parallel execution of the FLUENT benchmark problems would be dependent on the motherboard's implementation of HyperTransport technology. Unlike the 20-year old PC shared-bus architecture, HyperTransport technology utilizes paired unidirectional point-to-point links in the form of high-bandwidth, low-latency, flow-through tunnels that pass data at up to 12.8GB per second. Connecting bridges to tunnels provides a mechanism to create switch and star HyperTransport topologies. This interconnect architecture has profound effects on standard I/O connections, as well as for memory and inter-processor message passing.

AMD Opteron processors utilize HyperTransport links for all inter-processor communications. On a four-way server, such as the Appro XtremeServer, each AMD Opteron processor Model 880 sports three 6.4GB-per-second inter-processor links. What’s more, fine-grained parallel applications, such as modeling, simulation, and other applications like our CFD benchmark, generate significant inter-process communications as they scale via parallelization. These applications are coded using a message-passing library specification that has become the de facto standard and is dubbed MPI. Typically this type of communications is done in a cluster over a high-speed link, such as InfiniBand or Myranet. In our tests, the XtremeServer’s implementation of AMD’s Direct Connect Architecture would support all high-speed memory and direct processor-to-processor communications within the bounds of our single system.

Each AMD Opteron processor also incorporates its own memory controller—in the case of a dual-core processor the two cores share a controller—for a direct connection to its own local memory, which can be accessed at 6.4GB per second. In this non-uniform memory architecture (NUMA), a processor accesses memory not under its direct control by communicating with the processor attached to that memory and requesting that it access and pass the data stored at that location. As the number of processors scale, this scheme significantly reduces memory access latency and enables memory bandwidth throughput to scale directly with the number of processors and their frequency. On the four-way Appro XtremeServers, that scheme pegs potential memory bandwidth at 25.6GB per second.

For application programmers and PC applications, AMD’s Direct Connect Architecture makes the intricacies of NUMA transparent to the host operating system. At boot time, AMD’s multiprocessor system BIOS calls on each Opteron processor to report its local memory, and maps the local memory pools into a single globally addressable physical memory space. What's more, AMD Opteron processors automatically maintain cache coherency across that global address space. As a result, programs access data belonging to any processor just as they would in a traditional X86 symmetric multiprocessor (SMP) environment with a shared front-side bus, shared memory controller, and single bank of memory.

Each HP ProLiant DL585 was configured as a four-way server with Single-Core AMD Opteron processor Model 850 clocked at 2.4GHz. Each of these servers sported 8GB of DDR-400 memory and an InfiniBand network interface card. The complete cluster configuration also included an InfiniBand switch from Voltaire, which raised infrastructure costs considerably. The InfiniBand switch and interface cards add roughly $8,000 to the cost of the cluster. As a result, the HP cluster solution was nearly four-times more costly than the standalone Appro XtremeServer.

This created a very interesting price-performance dynamic. In terms of processor cores and memory capacity, both systems had identical resources: eight AMD Opteron cores clocked at 2.4GHz and 16GB of DDR-400 memory. However, there was a significant variation in the Linux distributions. By running RHEL3, the HP cluster provided a GNU C v3 runtime environment, where as SUSE Linux 10 provided the first full version 4.0 runtime environment for the GNU Compiler Collection, which had an extensive overhaul of its language-independent optimizers. Every component of the SUSE Linux 10.0 distribution was compiled and checked with GCC 4.0. In tests of our openBench Labs CPU benchmark compiled with GNU C v3.x, simply running the benchmark without a recompile on SUSE Linux 10.0 improved performance by about 10%.

The results of the Fluent benchmarks on our Appro XtremeServer followed this pattern by regularly outpacing the results of the HP systems running RHEL3 by about 10%. In fact, the results on the Appro XtremeServer were on par with the results obtained with a Sun server using faster ADM Opteron processors.

As a technology, fluid dynamics can be applied to a surprisingly wide range of applications. From the perspective of classical mechanics, the starting point is the observation that liquids and gasses behave in the same manner when they are moving, for physicists this means they obey the same laws of motion. As a result, hydrodynamics and aerodynamics are particular instances of fluid dynamics.

This means fluid dynamics can be equally applied to gas flow problems, such as the air flow around a Formula-1 Ferrari or the flow of crude oil in a pipeline. What’s more, fluid dynamics can be readily extended to solving heat transfer problems, which extends applicable problem domains to server cooling issues and weather pattern forecasting. With some clever engineering analogies, fluid dynamics has even been applied to modeling the flow of traffic.

Nonetheless, the usefulness of a CFD benchmark does not come from any extension of the physics of the problem to a larger set of issues, but from an observation that the mathematics used to solve the problem computationally are the same as those used to solve many business-processing applications from process optimization to multidimensional database analysis.

The mathematics at the heart of CFD comes out of classical Newtonian mechanics and the laws for the conservation of mass, momentum, and energy. Getting to a set of mathematical equations that can be readily solved starts with a physical description of the problem. That description can either focus on a fixed volume of fluid as it moves from point to point in space or it can focus on a fixed point in space and observe control volumes of fluid passing by. Out of that latter approach comes the Eulerian transport theorem, which deals with how the control volumes change over time and represents the problem as a control volume integral with respect to time, which in turn can be readily approximated to solve numerically.

Using a technique known as the finite volume method, Fluent creates a numerical approximation by first deconstructing the control volume integral into surface area integrals and then into a discrete number of finite approximations. Normally, none of this methodology would be of any consequence to IT; however, by resolving the CFD problem in this manner, Fluent ends up performing the same matrix algebra calculations that are used in resolving operations research problems. That makes the performance of Fluent in a CFD benchmark a bellwether for high-level business operational planning, supply chain optimization, and multi-dimensional on-line analytical processing (OLAP), which is popularly dubbed “business performance computing.”

 
         
 

To compare the performance of different hardware platforms running Fluent's flow solver application, Fluent devised a series of nine problems, which are increasingly more complex and difficult to solve. This complexity is reflected by the increasingly larger grid structures that are used to define each problem. These nine problems are further broken down into three groups: small problems, which have fewer than 100,000 cells in their defining grids; medium problems, which have fewer than 500,000 cells in their defining grids; and large problems which have more than 800,000 cells in their defining grids. It is important to note that the greater CPU power and larger memory resources that typify today's high performance computing systems has greatly diminished the usefulness of the small benchmarks, which will likely be dropped from Fluent's suite.

The primary metric used in representing performance associated with the Fluent Benchmark suite is dubbed the “Performance Rating.” The rating of a system represents the number of benchmarks that can be run sequentially on a given machine in a 24 hour period. It is computed by dividing the number of seconds in a day by the number of seconds required to run the benchmark. This somewhat artificial construct provides a performance measure that increases with faster performance and thereby provides a more natural comparison for general audiences.

Performance rating also provides a good graphical clue for the notion of parallel “speedup.” Speedup is the ratio of the time taken to solve a problem with multiple parallel processes, compared to using a single process, which makes the rating for n processes in parallel equal to the rating for one process multiplied by the speedup factor for n processes. To insure that all speedup factors are maximized, Fluent’s parallel processing benchmarks always associate each Fluent process with a unique physical processor. 

That 1-to-1 process-to-processor association makes the ideal speedup in performance linear—it can even be super-linear with data caching. To achieve linear scaling, these benchmarks are consistently run on clusters or blade servers that use InfiniBand or MyriNet® to speed inter-processor communications. To further minimize internal bottlenecks, cluster members are limited to running two, or at most four, Fluent processes.

While the performance of the Appro XtremeServer is rather remarkable on its own merits, what is even more remarkable is the capacity of the XtremeServer to maintain a dramatic price-performance edge, running highly parallelizable CFD benchmarks vis à vis high-performance computing clusters. For small to medium sized problems, the Appro XtremeServer was able to scale rating performance using up to eight parallel processes. Moreover, the Appro XtremeServer was able to hold a slight, but distinct, advantage in rating performance over a cluster of multiple 4-way HP ProLiant DL585 servers.

 
In the Fluent medium size benchmark group, the M1 benchmark models 500 coal particles that are entrained in a stream of air and injected into an industrial boiler to be burned. The M2 benchmark models turbulent flow in an automotive valve port using a hybrid zonal grid or mesh. The M3 benchmark models the injection an burning of natural gas—methane (CH4)—in a high velocity gas burner. In the Fluent large size benchmark group (mouse over) The L1 benchmark models transonic air flow around the surface of a research model combat aircraft. The L2 benchmark, models exterior air flow around a simplified model of a passenger sedan. The L3 benchmark, models the turbulent flow of air through a duct.
 
     
 

Only as the size of the problems—measured by the number of cells in the defining surface grid—began to approach and then exceed one million cells did we begin to encounter problems supporting parallel execution across all eight processor cores. What we encountered in the largest CFD problems was a reduction in the number of parallel processes that we could support and a drop off in the linearity of performance scalability at the highest number of supportable processes.

Within the context of a price-performance evaluation, however, the performance leadership of the Appro XtremeServer is exceeded by its price-performance leadership. In the most unfavorable situation with the most complex model, the Appro server was able to scale to four parallel processes in a near-linear fashion. That matched the performance of each of the cluster members within the competing configuration. Nonetheless, while the cluster was maintaining a two-to-one advantage in scalability as measured in rating performance, the inherent value proposition of the Appro server was still maintaining a near two-to-one advantage in terms of price-performance.