|
FROM FLUID DYNAMICS T0 BUSINESS PERFORMANCE From computational fluid dynamics simulations to analyzing multi-dimensional sales data, the underlying mathematics is the same and that unifies benchmarking for both high performance and business performance computing. |
||||
|
|
|
Fluent is the leading supplier of CFD software for simulating complex phenomena involving turbulent, reacting, and multiphase flows. To assess the performance of the FLUENT Flow Modeling application, Fluent provides a suite of nine CFD simulations typical of those found in industry. Detailed results are maintained online to provide comprehensive and fair comparative information of the performance of FLUENT on available hardware platforms. The performance of the benchmark suite centers on the ability of the underlying operating system and hardware to handle increasingly fine-grained parallel execution of each of the nine problems. For the operating system, the key factor will be compiler runtime optimizations. For the hardware environment, the key factor will be inter-processor communications. |
|
To run Fluent's suite of nine benchmark problems, openBench Labs installed SUSE Linux 10 on a 3U four-way Appro XtremeServer. This system featured Dual-Core AMD Opteron Model 880 processors giving the system eight processor cores, each clocked at 2.4GHz. Our goal was not to demonstrate the ability to scale an XtremeServer environment linearly for FLUENT without regard to cost. Rather, our goal was to demonstrate the explicit capabilities of a single Appro 3U XtremeServer to scale with respect to parallel processing. In this regard, we ran up to a maximum of eight instances of Fluent in parallel on the single XtremeServer. We then compared our results running the Fluent benchmark suite on a single server with benchmark results published by Fluent for a high-performance cluster that was powered by AMD Opteron processors clocked at 2.4GHz. This cluster was built around HP ProLiant DL585 servers running Red Hat Enterprise Linux v3 (RHEL3).
AMD Opteron processors utilize HyperTransport links for all inter-processor communications. On a four-way server, such as the Appro XtremeServer, each AMD Opteron processor Model 880 sports three 6.4GB-per-second inter-processor links. What’s more, fine-grained parallel applications, such as modeling, simulation, and other applications like our CFD benchmark, generate significant inter-process communications as they scale via parallelization. These applications are coded using a message-passing library specification that has become the de facto standard and is dubbed MPI. Typically this type of communications is done in a cluster over a high-speed link, such as InfiniBand or Myranet. In our tests, the XtremeServer’s implementation of AMD’s Direct Connect Architecture would support all high-speed memory and direct processor-to-processor communications within the bounds of our single system. Each AMD Opteron processor also incorporates its own memory controller—in the case of a dual-core processor the two cores share a controller—for a direct connection to its own local memory, which can be accessed at 6.4GB per second. In this non-uniform memory architecture (NUMA), a processor accesses memory not under its direct control by communicating with the processor attached to that memory and requesting that it access and pass the data stored at that location. As the number of processors scale, this scheme significantly reduces memory access latency and enables memory bandwidth throughput to scale directly with the number of processors and their frequency. On the four-way Appro XtremeServers, that scheme pegs potential memory bandwidth at 25.6GB per second. For application programmers and PC applications, AMD’s Direct Connect Architecture makes the intricacies of NUMA transparent to the host operating system. At boot time, AMD’s multiprocessor system BIOS calls on each Opteron processor to report its local memory, and maps the local memory pools into a single globally addressable physical memory space. What's more, AMD Opteron processors automatically maintain cache coherency across that global address space. As a result, programs access data belonging to any processor just as they would in a traditional X86 symmetric multiprocessor (SMP) environment with a shared front-side bus, shared memory controller, and single bank of memory. Each HP ProLiant DL585 was configured as a four-way server with Single-Core AMD Opteron processor Model 850 clocked at 2.4GHz. Each of these servers sported 8GB of DDR-400 memory and an InfiniBand network interface card. The complete cluster configuration also included an InfiniBand switch from Voltaire, which raised infrastructure costs considerably. The InfiniBand switch and interface cards add roughly $8,000 to the cost of the cluster. As a result, the HP cluster solution was nearly four-times more costly than the standalone Appro XtremeServer.
The results of the Fluent benchmarks on our Appro XtremeServer followed this pattern by regularly outpacing the results of the HP systems running RHEL3 by about 10%. In fact, the results on the Appro XtremeServer were on par with the results obtained with a Sun server using faster ADM Opteron processors.
This means fluid dynamics can be equally applied to gas flow problems, such as the air flow around a Formula-1 Ferrari or the flow of crude oil in a pipeline. What’s more, fluid dynamics can be readily extended to solving heat transfer problems, which extends applicable problem domains to server cooling issues and weather pattern forecasting. With some clever engineering analogies, fluid dynamics has even been applied to modeling the flow of traffic. Nonetheless, the usefulness of a CFD benchmark does not come from any extension of the physics of the problem to a larger set of issues, but from an observation that the mathematics used to solve the problem computationally are the same as those used to solve many business-processing applications from process optimization to multidimensional database analysis. The mathematics at the heart of CFD comes out of classical Newtonian mechanics and the
laws for the conservation of mass, momentum, and energy. Getting to a set of mathematical equations that can be
readily solved starts with a physical description of the problem. That description can either focus on a fixed
volume of fluid as it moves from point to point in space or it can focus on a fixed point in space and observe
control volumes of fluid passing by. Using a technique known as the finite volume method, Fluent creates a numerical approximation by first deconstructing the control volume integral into surface area integrals and then into a discrete number of finite approximations. Normally, none of this methodology would be of any consequence to IT; however, by resolving the CFD problem in this manner, Fluent ends up performing the same matrix algebra calculations that are used in resolving operations research problems. That makes the performance of Fluent in a CFD benchmark a bellwether for high-level business operational planning, supply chain optimization, and multi-dimensional on-line analytical processing (OLAP), which is popularly dubbed “business performance computing.” |
|
To compare the performance of different hardware platforms running Fluent's flow solver application, Fluent devised a series of nine problems, which are increasingly more complex and difficult to solve. This complexity is reflected by the increasingly larger grid structures that are used to define each problem. These nine problems are further broken down into three groups: small problems, which have fewer than 100,000 cells in their defining grids; medium problems, which have fewer than 500,000 cells in their defining grids; and large problems which have more than 800,000 cells in their defining grids. It is important to note that the greater CPU power and larger memory resources that typify today's high performance computing systems has greatly diminished the usefulness of the small benchmarks, which will likely be dropped from Fluent's suite. The primary metric used in representing performance associated with the Fluent Benchmark suite is dubbed the “Performance Rating.” The rating of a system represents the number of benchmarks that can be run sequentially on a given machine in a 24 hour period. It is computed by dividing the number of seconds in a day by the number of seconds required to run the benchmark. This somewhat artificial construct provides a performance measure that increases with faster performance and thereby provides a more natural comparison for general audiences. Performance rating also provides a good graphical clue for the notion of parallel “speedup.” Speedup is the ratio of the time taken to solve a problem with multiple parallel processes, compared to using a single process, which makes the rating for n processes in parallel equal to the rating for one process multiplied by the speedup factor for n processes. To insure that all speedup factors are maximized, Fluent’s parallel processing benchmarks always associate each Fluent process with a unique physical processor. That 1-to-1 process-to-processor association makes the ideal speedup in performance linear—it can even be super-linear with data caching. To achieve linear scaling, these benchmarks are consistently run on clusters or blade servers that use InfiniBand or MyriNet® to speed inter-processor communications. To further minimize internal bottlenecks, cluster members are limited to running two, or at most four, Fluent processes. While the performance of the Appro XtremeServer is rather remarkable on its own merits, what is even more remarkable is the capacity of the XtremeServer to maintain a dramatic price-performance edge, running highly parallelizable CFD benchmarks vis à vis high-performance computing clusters. For small to medium sized problems, the Appro XtremeServer was able to scale rating performance using up to eight parallel processes. Moreover, the Appro XtremeServer was able to hold a slight, but distinct, advantage in rating performance over a cluster of multiple 4-way HP ProLiant DL585 servers. |
|
|
Only as the size of the problems—measured by the number of cells in the defining surface grid—began to approach and then exceed one million cells did we begin to encounter problems supporting parallel execution across all eight processor cores. What we encountered in the largest CFD problems was a reduction in the number of parallel processes that we could support and a drop off in the linearity of performance scalability at the highest number of supportable processes. Within the context of a price-performance evaluation, however, the performance leadership
of the Appro XtremeServer is exceeded by its price-performance leadership. In the most unfavorable situation
with the most complex model, the Appro server was able to scale to four parallel processes in a near-linear fashion.
That matched the performance of each of the cluster members within the competing configuration. Nonetheless, while
the cluster was maintaining a two-to-one advantage in scalability as measured in rating performance, the
inherent value proposition of the Appro server was still maintaining a near two-to-one advantage in terms of
price-performance. |