NAS parallel benchmark results

Eric Barszcz

Outline

NAS parallel benchmark results

Eric Barszcz

1993, Parallel & Distributed Technology: Systems & Applications, IEEE

visibility

…

description

8 pages

Abstract

Benchmark results for the Numerical Aerodynamic Simulation (NAS) Program at NASA Ames Research Center, which is dedicated to advancing the science of computational aerodynamics are presented. The benchmark performance results are for the Y-MP, Y-MO EL, and C-90 systems from Cray Research; the TC2000 from Bolt Baranek and Newman; the Gamma iPSC/860 from Intel; the CM-2, CM-200, and CM-5 from Thinking Machines; the CS-1 from Meiko Scientific; the MP-1 and MP-2 from MasPar Computer; ...

NAS Parallel Benchmark Results D. H. Bailey L. Dagum E. Barszcz H. D. Simon NAS Applied Research Branch Computer Sciences Corp. NASA Ames Research Center NASA Ames Research Center Moffett Field, CA 94035 Moffett Field, CA 94035 Abstract ture. There is not even a generally accepted bench- mark strategy for highly parallel supercomputers. The NAS Parallel Benchmarks have been developed In our view, the best benchmarking approach for at NASA Ames Research Center to study the perfor- highly parallel supercomputers is the “paper and pen- mance of parallel supercomputers. The eight bench- cil” benchmark. The idea is to specify a set of prob- mark problems are specified in a “pencil and paper” lems only algorithmically. Even the input data must fashion. This paper presents performance results of be specified only on paper. Naturally, the problem has various systems using the NAS Parallel Benchmarks. to be specified in sufficient detail that a unique solu- These results represent the best results that have been tion exists, and the required output has to be brief yet reported to us for the specific systems listed. They detailed enough to certify that the problem has been represent implementation efforts performed by per- solved correctly. But the details of the implementation sonnel in both the NAS Applied Research Branch of should be left to the programmer as far as possible. NASA Ames and in other organizations. To this end, we have devised the NAS Parallel Benchmarks (NPB). These are a set of eight bench- 1. Introduction mark problems, each of which focuses on some impor- The Numerical Aerodynamic Simulation (NAS) tant aspect of highly parallel supercomputing for aero- Program, which is based at NASA Ames Research physics applications. Some extension of Fortran or C Center, is dedicated to advance the science of com- is required for implementations, and reasonable limits putational aerodynamics. One key goal of the NAS are placed on the usage of assembly code and the like, organization is to demonstrate by the year 2000 an but otherwise programmers are free to utilize language operational computing system capable of simulating constructs that give the best performance possible on an entire aerospace vehicle system within a computing the particular system being studied. The choice of time of one to several hours. It is currently projected data structures, processor allocation and memory us- that the solution of this grand challenge problem will age are generally left open to the discretion of the require a computer system that can perform scien- implementer. tific computations at a sustained rate approximately The eight problems consist of five “kernels” and one thousand times faster than 1990 generation su- three “simulated computational fluid dynamics (CFD) percomputers. Most likely such a computer system applications”. Each of these is defined fully in [3]. The will employ hundreds or even thousands of processors five kernels are relatively compact problems, each of operating in parallel. which emphasizes a particular type of numerical com- At the present time, there are several commer- putation. Compared with the simulated CFD appli- cial highly parallel systems available with computing cations, they can be implemented fairly readily and power roughly competitive with conventional super- provide insight as to the general levels of performance computers (even greater on some special problems). that can be expected on these specific types of numer- Unfortunately, there is little reliable data on the per- ical computations. formance of such systems on state-of-the-art compu- The simulated CFD applications, on the other tational aerophysics problems. In general, the science hand, usually require more effort to implement, but of performance evaluation has not kept pace with ad- they are more indicative of the types of actual data vances in parallel computer hardware and architec- movement and computation required in state-of-the- art CFD application codes. For example, in an iso- culated in this manner for the current fastest imple- lated kernel a certain data structure may be very effi- mentation on one processor of the Cray Y-MP. cient on a certain system, and yet this data structure With the exception of the Integer Sort benchmark, would be inappropriate if incorporated into a larger these standard flop counts were determined by using application. By comparison, the simulated CFD ap- the hardware performance monitor on a Cray Y-MP, plications require data structures and implementation and we believe that they are close to the minimal techniques that are more typical of real CFD applica- counts required for these problems. In the case of tions. the Integer Sort benchmark, which does not involve Space does not permit a complete description of floating-point operations, we selected a value approx- these benchmark problems. A more detailed descrip- imately equal to the number of integer operations re- tion of these benchmarks, together with the rules and quired, in order to permit the computation of perfor- restrictions associated with the benchmarks, may be mance rates analogous to megaflops rates. We reserve found in [2]. The full specification of the benchmarks the right to change these standard flop counts in the is given in [3]. future if deemed necessary. Sample Fortran programs implementing the NPB Whenever possible, we have tried to credit the on a single processor system are available as an aid to actual individuals and organizations who have con- implementors. These programs, as well as the bench- tributed the performance results cited in the tables. mark document itself, are available from the follow- In these citations, NAS denotes the NAS Applied Re- ing address: NAS Systems Division, Mail Stop 258- search Branch at NASA Ames (including both NASA 8, NASA Ames Research Center, Moffett Field, CA civil servants and Computer Science Corp. contrac- 94035, attn: NAS Parallel Benchmark Codes. The tors); RIACS denotes the parallel systems division of sample codes are provided on Macintosh floppy disks the Research Institute for Advanced Computer Sci- and contain the Fortran source codes, “README” ence, which is located at NASA Ames; BBN denotes files, input data files, and reference output data files Bolt, Beranek and Newman; Boeing denotes Boeing for correct implementations of the benchmark prob- Computer Services, Inc.; CRI denotes Cray Research, lems. These codes have been validated on a number Inc.; Intel denotes the Supercomputer Systems Di- of computer systems ranging from conventional work- vision of Intel Corp.; Maspar denotes Maspar Com- stations to supercomputers. puter Corp.; Meiko denotes Meiko Scientific Corp.; and TMC denotes Thinking Machines, Inc. Where no In the following, each of the eight benchmarks will individual citation is made for a specific model, the be briefly described, and then the best performance results are due to vendor staff. results we have received to date for each computer Unfortunately, the limited space in this report does system will be given in Tables 2 through 9. These not permit discussion of the methods used in any of tables include memory requirements, run times and these implementations. However, we have included performance ratios. The performance ratios compare references to technical papers describing these meth- individual timings with the current best time on that ods whenever such papers are available. Readers are benchmark achieved on one processor of a Cray Y-MP. referred to these documents for full details. The run times in each case are elapsed time of day fig- This report includes a number of new results not ures, measured in accordance with the specifications previously published. The Cray C-90, Cray Y-MP EL, given in [3]. Memory requirements are currently avail- the Maspar MP-1 and MP-2, and the Meiko CS-1 re- able for only some of these implementations. We hope sults in particular have not previously been disclosed. to have complete information for these columns in fu- In quite a few other instances, results are improved ture editions of this paper. from previous listings, reflecting improvements both in Note that performances rates are not cited in compilers and implementations. Efforts are currently millions of floating point operations per second underway to port the NAS Parallel Benchmarks on (megaflops) in these tables. We suggest instead that other systems, and we hope to have some results in the actual run times (or, equivalently, the performance the future. ratios) be examined when comparing different systems and implementations. For those who wish to compute 2. The Embarrassingly Parallel Benchmark megaflops figures for the NAS Parallel Benchmarks on The first of the five kernel benchmarks is an “em- any system, we insist that they be computed using the barrassingly parallel” problem. In this benchmark, standard floating point operation (flop) counts given two-dimensional statistics are accumulated from a in Table 1. Table 1 also contains megaflops rates cal- large number of Gaussian pseudorandom numbers, which are generated according to a particular scheme that is well-suited for parallel computation. This problem is typical of many “Monte-Carlo” applica- tions. Since it requires almost no communication, in Benchmark Operation Y-MP some sense this benchmark provides an estimate of the Name Abbr. Count Rate upper achievable limits for floating point performance Emb. Parallel EP 2.668 × 1010 211 on a particular system. Multigrid MG 3.905 × 1009 176 Results for the embarrassingly parallel benchmark Conjugate Gradient CG 1.508 × 1009 127 are shown in Table 2. Not all systems exhibit high 3-D FFT PDE FT 5.631 × 1009 196 rates on this problem. This appears to stem from the Integer Sort IS 7.812 × 1008 68 fact that this benchmark requires references to several LU Sim. CFD Appl. LU 6.457 × 1010 194 mathematical intrinsic functions, such as the Fortran SP Sim. CFD Appl. SP 1.020 × 1011 216 routines AINT, SQRT, and LOG, and evidently these BT Sim. CFD Appl. BT 1.813 × 1011 229 functions are not highly optimized on some systems. The memory requirement for this benchmark was min- Table 1: Standard Operation Counts and Current Y- imal on all systems. MP/1 Megaflops Rates Intel iPSC/860 results are due to J. Baugh of Intel. CM-2 and CM-200 results are due to J. Richardson of TMC. Maspar results are due to J. MacDonald of Maspar. 3. The Multigrid Benchmark The second kernel benchmark is a simplified multi- grid kernel, which solves a 3-D Poisson PDE. This Computer No. Memory Time Ratio to problem is simplified in the sense that it has constant System Proc. (mwords) (sec.) Y-MP/1 rather than variable coefficients as in a more realis- Y-MP 1 4.9 126.2 1.00 tic application. This code is a good test of both short 8 4.9 15.87 7.95 and long distance highly communication, although the Y-MP EL 1 4.9 550.5 0.23 communication patterns are highly structured (as op- 4 4.9 141.2 0.89 posed to the conjugate gradient benchmark). C-90 1 4.9 47.60 2.65 Results for this benchmark, for problem size 2563 , 4 4.9 12.37 10.20 are shown in Table 3. Intel results are due to BCS. 16 4.9 3.19 39.56 CM-2 and CM-200 results are due to J. Richardson at TC2000 64 1 284.0 0.44 TMC. iPSC/860 32 1 102.7 1.23 64 1 51.4 2.46 4. The Conjugate Gradient Benchmark 128 1 25.7 4.91 In this benchmark, a conjugate gradient method CM-2 8K 1 126.6 1.00 is used to compute an approximation to the smallest 16K 1 63.9 1.97 eigenvalue of a large, sparse, symmetric positive def- 32K 1 33.7 3.74 inite matrix. This kernel is typical of unstructured 64K 1 18.8 6.71 grid computations in that it tests irregular long dis- CM-200 8K 1 76.9 1.64 tance communication and employs sparse matrix vec- 16K 1 39.2 3.22 tor multiplication. 32K 1 20.7 6.10 The irregular communication requirement of this 64K 1 10.9 11.58 benchmark is evidently a challenge for all systems. Re- CS-1 16 116.8 1.08 sults, for problem size 2.0 × 106 , are shown in Table 4. MP-1 4K 248 0.51 Intel results are due to BCS. CM-2 results are due to 16K 88 1.43 J. Richardson of TMC. Table 2: Results of the Embarrassingly Parallel (EP) 5. The 3-D FFT PDE Benchmark Benchmark In this benchmark a 3-D partial differential equa- tion is solved using FFTs. This kernel performs the essence of many “spectral” codes. It is a good test of Computer No. Memory Time Ratio to System Proc. (mwords) (sec.) Y-MP/1 Computer No. Memory Time Ratio to Y-MP 1 42.9 28.77 1.00 System Proc. (mwords) (sec.) Y-MP/1 8 42.9 4.19 6.87 Y-MP 1 56.7 22.22 1.00 Y-MP EL 1 42.9 122.6 0.23 8 56.7 2.96 7.51 4 42.9 34.9 0.82 Y-MP EL 1 56.7 89.19 0.25 C-90 1 42.9 10.28 2.80 4 56.7 32.11 0.69 4 42.9 2.58 11.2 C-90 1 56.7 8.65 2.57 16 42.9 0.91 31.6 4 56.7 2.42 9.18 iPSC/860 64 20.93 1.37 16 56.7 0.96 23.14 128 9.72 2.96 iPSC/860 128 8.61 2.58 CM-2 16K 37.0 0.78 CM-2 16K 45.8 0.49 32K 18.2 1.58 32K 26.0 0.85 64K 11.4 2.52 64K 14.1 1.58 CM-200 8K 45.6 0.63 CM-200 16K 30.2 0.74 CS-1 16 170.0 0.17 32K 17.2 1.29 MP-1 16K 19.6 1.47 CS-1 16 42.8 0.52 MP-1 16K 13.1 1.70 Table 5: Results of the 3-D FFT PDE (FT) Bench- mark Table 3: Results of the Multigrid (MG) Benchmark long-distance communication performance. The rules of the NAS Parallel Benchmarks specify that assembly-coded, library routines may be used to perform matrix multiplication and one-dimensional, two-dimensional or three-dimensional FFTs. Thus this benchmark is somewhat unique in that compu- Computer No. Memory Time Ratio to tational library routines may be legally employed. System Proc. (mwords) (sec.) Y-MP/1 Results, for problem size 2562 × 128, are shown in Y-MP 1 10.4 11.92 1.00 Table 5. Intel results are due to E. Kushner of Intel. 8 10.4 2.38 5.01 CM-2 and CM-200 results are due to J. Richardson of Y-MP EL 1 10.4 65.35 0.18 TMC. 4 10.4 23.91 0.50 C-90 1 10.4 4.56 2.61 6. The Integer Sort Benchmark 4 10.4 1.51 7.89 This benchmark tests a sorting operation that is 16 10.4 0.58 20.55 important in “particle method” codes. This type of TC2000 40 51.4 0.23 application is similar to “particle in cell” applications iPSC/860 128 8.61 1.38 of physics, wherein particles are assigned to cells and CM-2 8K 25.6 0.47 may drift out. The sorting operation is used to reas- CM-2 16K 14.1 0.85 sign particles to the appropriate cells. This benchmark CM-2 32K 8.8 1.35 tests both integer computation speed and communica- CM-200 8K 15.0 0.79 tion performance. CS-1 16 67.5 0.18 This problem is unique in that floating point arith- MP-1 4K 64.5 0.18 metic is not involved. Significant data communication, 16K 14.6 0.82 however, is required. Results, for problem size 223 , are shown in Table 6. Intel results are due to to E. Kush- Table 4: Results of the Conjugate Gradient (CG) ner of Intel. CM-2 results are due to L. Dagum of Benchmark NAS. Computer No. Memory Time Ratio to Computer No. Memory Time Ratio to System Proc. (mwords) (sec.) Y-MP/1 System Proc. (mwords) (sec.) Y-MP/1 Y-MP 1 31.1 11.46 1.00 Y-MP 1 32.3 333.5 1.00 8 31.1 1.85 6.19 8 32.3 49.50 6.74 Y-MP EL 1 31.1 153.9 0.07 Y-MP EL 1 32.3 1449 0.23 4 31.1 41.5 0.28 4 32.3 522.3 0.64 C-90 1 31.1 5.20 2.20 C-90 1 32.3 157.6 2.12 4 31.1 1.42 8.07 4 32.3 43.94 7.59 16 31.1 0.57 20.1 16 32.3 17.62 18.93 iPSC/860 32 25.72 0.45 TC2000 62 3032 0.11 64 17.26 0.66 iPSC/860 64 12 690.8 0.48 128 13.59 0.84 128 16 442.5 0.75 CM-2 8K 215.1 0.05 CM-2 8K 14 1307 0.26 16K 111.5 0.10 16K 14 850.0 0.39 32K 56.0 0.20 32K 14 572.0 0.58 MP-1 16K 75 0.15 CS-1 16 2937 0.11 CS-1 16 62.7 0.18 MP-1 4K 1958 0.17 MP-2 4K 658 0.51 Table 6: Results of the Integer Sort (IS) Benchmark Table 7: Results for the LU Simulated CFD Applica- tion 7. The Three Simulated CFD Application Benchmarks The three simulated CFD application benchmarks the communication to computation ratio. are intended to accurately represent the principal com- Performance figures for the three simulated CFD putational and data movement requirements of mod- applications, for problem size 643 , are shown in Tables ern CFD applications. 7, 8 and 9. Timings are cited as complete run times, The first of these is the called the lower-upper di- in seconds, as with the other benchmarks. A complete agonal (LU) benchmark. It does not perform a LU solution of the LU benchmark requires 250 iterations. factorization but instead employs a symmetric suc- For the SP benchmark, 400 iterations are required. cessive over-relaxation (SSOR) numerical scheme to For the BT benchmark, 200 iterations are required. solve a regular-sparse, block (5 × 5) lower and up- Intel and CM-2 results are due to S. Weeratunga, R. per triangular system. This problem represents the Fatoohi, E. Barszcz and V. Venkatakrishnan of NAS, computations associated with a newer class of implicit except that BT and SP results on the Intel are due to CFD algorithms, typified at NASA Ames by the code BCS. “INS3D-LU”. This problem exhibits a somewhat lim- ited amount of parallelism compared to the next two. 8. Other Results The second simulated CFD application is called the As far as we have been able to determine, the tim- scalar pentadiagonal (SP) benchmark. In this bench- ings presented above all represent runs that fully com- mark, multiple independent systems of non-diagonally ply with the rules and restrictions stated in the bench- dominant, scalar pentadiagonal equations are solved. mark document [3]. One of these rules is that except The third simulated CFD application is called the for a short list of mathematical functions, assembly block tridiagonal (BT) benchmark. In this bench- language and assembly-coded library routines may not mark, multiple independent systems of non-diagonally be used for computation. The exceptions include the dominant, block tridiagonal equations with a 5 × 5 standard Fortran intrinsic functions, as well as rou- block size are solved. tines to perform dense matrix multiplication and fast SP and the third simulated CFD application (BT) Fourier transforms. are representative of computations associated with the There are several reasons for these restrictions on implicit operators of CFD codes such as “ARC3D” at assembly code. First of all, without restrictions of NASA Ames. SP and BT are similar in many respects, some sort, an entire benchmark might be implemented but there is a fundamental difference with respect to in assembly-level code. While such performance re- Computer No. Time Ratio to Benchmark System Proc. (sec.) Y-MP/1 Computer No. Memory Time Ratio to IS CM-2 16K 35.8 0.32 System Proc. (mwords) (sec.) Y-MP/1 32K 21.0 0.55 Y-MP 1 9.2 471.5 1.00 64K 14.9 0.77 8 9.2 64.60 7.30 CM-200 64K 5.7 2.01 Y-MP EL 1 9.2 2026 0.23 LU CM-2 16K 868.0 0.38 4 9.2 601.9 0.78 32K 546.0 0.61 C-90 1 9.2 184.7 2.55 SP CM-2 16K 1444 0.33 4 9.2 49.74 9.48 32K 917.0 0.51 16 9.2 13.06 36.10 64K 640.0 0.74 TC2000 112 880.0 0.54 BT CM-2 16K 1118 0.71 iPSC/860 64 667.3 0.71 32K 634.0 1.25 128 449.5 1.05 64K 370.0 2.14 CM-2 8K 3900 0.12 CM-200 16K 832.0 0.95 16K 2104 0.22 32K 601.0 1.32 32K 1080 0.44 CS-1 16 2975 0.16 Table 10: Unofficial TMC Results Using Library Rou- MP-1 4K 1772 0.27 tines MP-2 4K 668 0.71 Table 8: Results for the SP Simulated CFD Applica- sults might be interesting, they would hardly be in- tion dicative of the performance that a scientist could rea- sonably expect on a full-scale application program. One reason that only the above-mentioned routines are allowed is that in our experience only these are generally available on new systems. For more spe- cialized library routines, it is difficult to determine whether they are truly general purpose, i.e. not re- Computer No. Memory Time Ratio to lying on a specific data layout. Furthermore, even if System Proc. (mwords) (sec.) Y-MP/1 an assembly-coded library routine can be utilized for Y-MP 1 42.3 792.4 1.00 an inner computational kernel, this does not help the 8 42.3 114.0 6.95 large mass of additional coding that comprises a full- Y-MP EL 1 42.3 4033 0.20 scale application. In short, the tuning rules for the 4 42.3 1208 0.66 NPB reflect our expectation (and experience) that real C-90 1 42.3 356.9 2.22 scientific applications consist largely of Fortran or C 4 42.3 96.10 8.25 code, and that usage of library routines is restricted to 16 42.3 28.39 27.91 a handful of widely available mathematical functions. TC2000 112 1378 0.58 Nonetheless, some scientists have attempted im- iPSC/860 64 714.7 1.11 plementations of the NPB using library routines be- 128 414.3 1.91 yond the ones allowed in [3]. In particular, Thinking CM-2 16K 3328 0.24 Machines, Inc. has obtained performance results us- 32K 1914 0.41 ing assembly-coded library routines for several of the CS-1 16 2984 0.27 NPB. Their implementation of the IS benchmark, for MP-1 4K 2420 0.33 example, runs more than twice as fast as reported in MP-2 4K 870 0.91 Table 6, and their rates for the BT benchmark are nearly three times as fast as reported in Table 9. Some Table 9: Results for the BT Simulated CFD Applica- of these results are shown in Table 10 [4]. tion 9. Sustained Performance Per Dollar One aspect of the relative performance of these sys- tems has not been addressed so far, namely the differ- processor Cray C-90 system is consistently the high- Computer No. Ratio to Perf. per est performing system tested, far surpassing any of B’mark System Proc. Y-MP/1 million $ the highly parallel systems. The Intel 128 proces- FT C-90 16 31.60 0.87 sor iPSC/860 system and the 32K CM-2 system each Y-MP 8 6.87 0.27 show promise, but they do not yet demonstrate sus- iPSC/860 128 2.96 0.99 tained performance comparable to full Cray systems. CM-2 32K 2.52 0.50 Instead, in both cases their rates appear to be equiva- MP-1 16K 1.47 1.47 lent to about one or, in some cases, two Y-MP proces- CS-1 16 0.17 0.57 sors. When sustained performance rates are normal- LU C-90 16 18.93 0.53 ized by system prices, the situation is somewhat dif- Y-MP 8 6.74 0.27 ferent: the highly parallel systems are approximately iPSC/860 128 0.75 0.25 on a par with the Cray systems. CM-2 32K 0.58 0.12 The Cray NPB performance results uniformly are MP-2 4K 0.51 1.02 large fractions (in some cases over fifty percent) of the CS-1 16 0.11 0.37 theoretical peak performance of these systems. By contrast, the NPB performance rates on the highly Table 11: Approximate Sustained Performance Per parallel systems are typically only two to five per- Dollar cent of the theoretical peak performance of these sys- tems. Reasons for the low sustained-to-peak ratios on the highly parallel systems are not hard to identify: ences in price between these systems. We should not immature compilers, insufficient bandwidth between be too surprised that the Cray C-90 system, for ex- processors and main memory, and insufficient band- ample, exhibits superior performance rates on these width between separate processing nodes. Clearly the benchmarks, since its current purchase price is much challenge of the highly parallel vendors is to alleviate higher than that of the iPSC/860 and the CM-2. these bottlenecks in future editions of their systems. One way to compensate for these price differences Some scientists have suggested that the answer is to compute sustained performance per million dol- to obtaining high performance rates on highly par- lars, i.e. the performance ratio figures shown in Tables allel computers is to substitute alternative algorithms 2 through 9 divided by the purchase price in millions. that have lower interprocessor communication require- Some figures of this type are shown in Table 11 for two ments. However, it has been the experience of the sci- of the benchmarks, the FT benchmark and the LU entists in our research group that a certain amount of benchmark and for five different systems. They are long-distance communication is unavoidable for these based on 36 million, 25 million, 3 million, 5 million, types of applications. Alternative algorithms that 1 million, 500,000 and 300,000 U.S. dollars, respec- have higher computation rates usually require more tively, for the Cray C-90, the Cray Y-MP, the Intel iterations to converge to a solution and thus require iPSC/860, the CM-2, the Maspar MP-1, the Maspar more overall run time. Clearly it is pointless to employ MP-2 and the Meiko CS-1. These are approximate numerically inefficient algorithms merely to exhibit ar- current prices, obtained from vendor personnel, for tificially high performance rates on a particular paral- complete systems with 16, 8, 128, 32K and 16K, 4K lel architecture [1]. and 16 processors, respectively, with one, two, one, four, one, 0.25 and 0.5 gigabytes of main memory, re- spectively, and with a typical set of peripherals. Be- cause of the approximate and changeable nature of these prices, and because the memory sizes, disk ca- pacities and I/O performances of these systems are certainly not equivalent, the figures in the last column of Table 11 should be interpreted as only very rough indications of sustained performance per dollar. 10. Conclusions With some algorithmic experimentation and tun- ing, respectable NPB performance rates have been achieved on several multiprocessor systems. The 16 References [1] D. H. Bailey, “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”, Supercomputing Review, August 1991, p. 54 – 55. Also published in Supercomputer, September 1991, p. 4 – 7. [2] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weer- atunga, “The NAS Parallel Benchmarks”, Intl. Journal of Supercomputer Applications, v. 5, no. 3 (Fall 1991), pp. 63 – 73. [3] D. Bailey, J. Barton, T. Lasinski, and H. Simon, eds., “The NAS Parallel Benchmarks”, Technical Report RNR-91-02, NASA Ames Research Center, Moffett Field, CA 94035, January 1991. [4] G. Bhanot, K. Jordan, J. Kennedy, J. Richardson, D. Sandee and M. Zagha, “Implementing the NAS Parallel Benchmarks on the CM-2 and CM200 Su- percomputers”, Thinking Machines Corp, Cam- bridge, MA 02142. [5] S. Breit, W. Celmaster, W. Coney, R. Foster, B. Gaiman, G. Montry and C. Selvidge, “The Role of Computational Balance in the Implementa- tion of the NAS parallel Benchmarks on the BBN TC2000 Computer”, submitted to Concurrency, April 1991.

References (5)

D. H. Bailey, "Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers", Supercomputing Review, August 1991, p. 54 -55. Also published in Supercomputer, September 1991, p. 4 -7.
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weer- atunga, "The NAS Parallel Benchmarks", Intl. Journal of Supercomputer Applications, v. 5, no. 3 (Fall 1991), pp. 63 -73.
D. Bailey, J. Barton, T. Lasinski, and H. Simon, eds., "The NAS Parallel Benchmarks", Technical Report RNR-91-02, NASA Ames Research Center, Moffett Field, CA 94035, January 1991.
G. Bhanot, K. Jordan, J. Kennedy, J. Richardson, D. Sandee and M. Zagha, "Implementing the NAS Parallel Benchmarks on the CM-2 and CM200 Su- percomputers", Thinking Machines Corp, Cam- bridge, MA 02142.
S. Breit, W. Celmaster, W. Coney, R. Foster, B. Gaiman, G. Montry and C. Selvidge, "The Role of Computational Balance in the Implementa- tion of the NAS parallel Benchmarks on the BBN TC2000 Computer", submitted to Concurrency, April 1991.

About the author

Eric Barszcz

Papers

Followers

View all papers from Eric Barszczarrow_forward

NAS parallel benchmark results

Sign up for access to the world's latest research

Abstract

Related papers

References (5)

Related papers

Related topics