NAS Parallel Benchmark Results
D. H. Bailey L. Dagum
E. Barszcz H. D. Simon
NAS Applied Research Branch Computer Sciences Corp.
NASA Ames Research Center NASA Ames Research Center
Moffett Field, CA 94035 Moffett Field, CA 94035
Abstract ture. There is not even a generally accepted bench-
mark strategy for highly parallel supercomputers.
The NAS Parallel Benchmarks have been developed In our view, the best benchmarking approach for
at NASA Ames Research Center to study the perfor- highly parallel supercomputers is the “paper and pen-
mance of parallel supercomputers. The eight bench- cil” benchmark. The idea is to specify a set of prob-
mark problems are specified in a “pencil and paper” lems only algorithmically. Even the input data must
fashion. This paper presents performance results of be specified only on paper. Naturally, the problem has
various systems using the NAS Parallel Benchmarks. to be specified in sufficient detail that a unique solu-
These results represent the best results that have been tion exists, and the required output has to be brief yet
reported to us for the specific systems listed. They detailed enough to certify that the problem has been
represent implementation efforts performed by per- solved correctly. But the details of the implementation
sonnel in both the NAS Applied Research Branch of should be left to the programmer as far as possible.
NASA Ames and in other organizations. To this end, we have devised the NAS Parallel
Benchmarks (NPB). These are a set of eight bench-
1. Introduction mark problems, each of which focuses on some impor-
The Numerical Aerodynamic Simulation (NAS) tant aspect of highly parallel supercomputing for aero-
Program, which is based at NASA Ames Research physics applications. Some extension of Fortran or C
Center, is dedicated to advance the science of com- is required for implementations, and reasonable limits
putational aerodynamics. One key goal of the NAS are placed on the usage of assembly code and the like,
organization is to demonstrate by the year 2000 an but otherwise programmers are free to utilize language
operational computing system capable of simulating constructs that give the best performance possible on
an entire aerospace vehicle system within a computing the particular system being studied. The choice of
time of one to several hours. It is currently projected data structures, processor allocation and memory us-
that the solution of this grand challenge problem will age are generally left open to the discretion of the
require a computer system that can perform scien- implementer.
tific computations at a sustained rate approximately The eight problems consist of five “kernels” and
one thousand times faster than 1990 generation su- three “simulated computational fluid dynamics (CFD)
percomputers. Most likely such a computer system applications”. Each of these is defined fully in [3]. The
will employ hundreds or even thousands of processors five kernels are relatively compact problems, each of
operating in parallel. which emphasizes a particular type of numerical com-
At the present time, there are several commer- putation. Compared with the simulated CFD appli-
cial highly parallel systems available with computing cations, they can be implemented fairly readily and
power roughly competitive with conventional super- provide insight as to the general levels of performance
computers (even greater on some special problems). that can be expected on these specific types of numer-
Unfortunately, there is little reliable data on the per- ical computations.
formance of such systems on state-of-the-art compu- The simulated CFD applications, on the other
tational aerophysics problems. In general, the science hand, usually require more effort to implement, but
of performance evaluation has not kept pace with ad- they are more indicative of the types of actual data
vances in parallel computer hardware and architec- movement and computation required in state-of-the-
art CFD application codes. For example, in an iso- culated in this manner for the current fastest imple-
lated kernel a certain data structure may be very effi- mentation on one processor of the Cray Y-MP.
cient on a certain system, and yet this data structure With the exception of the Integer Sort benchmark,
would be inappropriate if incorporated into a larger these standard flop counts were determined by using
application. By comparison, the simulated CFD ap- the hardware performance monitor on a Cray Y-MP,
plications require data structures and implementation and we believe that they are close to the minimal
techniques that are more typical of real CFD applica- counts required for these problems. In the case of
tions. the Integer Sort benchmark, which does not involve
Space does not permit a complete description of floating-point operations, we selected a value approx-
these benchmark problems. A more detailed descrip- imately equal to the number of integer operations re-
tion of these benchmarks, together with the rules and quired, in order to permit the computation of perfor-
restrictions associated with the benchmarks, may be mance rates analogous to megaflops rates. We reserve
found in [2]. The full specification of the benchmarks the right to change these standard flop counts in the
is given in [3]. future if deemed necessary.
Sample Fortran programs implementing the NPB Whenever possible, we have tried to credit the
on a single processor system are available as an aid to actual individuals and organizations who have con-
implementors. These programs, as well as the bench- tributed the performance results cited in the tables.
mark document itself, are available from the follow- In these citations, NAS denotes the NAS Applied Re-
ing address: NAS Systems Division, Mail Stop 258- search Branch at NASA Ames (including both NASA
8, NASA Ames Research Center, Moffett Field, CA civil servants and Computer Science Corp. contrac-
94035, attn: NAS Parallel Benchmark Codes. The tors); RIACS denotes the parallel systems division of
sample codes are provided on Macintosh floppy disks the Research Institute for Advanced Computer Sci-
and contain the Fortran source codes, “README” ence, which is located at NASA Ames; BBN denotes
files, input data files, and reference output data files Bolt, Beranek and Newman; Boeing denotes Boeing
for correct implementations of the benchmark prob- Computer Services, Inc.; CRI denotes Cray Research,
lems. These codes have been validated on a number Inc.; Intel denotes the Supercomputer Systems Di-
of computer systems ranging from conventional work- vision of Intel Corp.; Maspar denotes Maspar Com-
stations to supercomputers. puter Corp.; Meiko denotes Meiko Scientific Corp.;
and TMC denotes Thinking Machines, Inc. Where no
In the following, each of the eight benchmarks will individual citation is made for a specific model, the
be briefly described, and then the best performance results are due to vendor staff.
results we have received to date for each computer
Unfortunately, the limited space in this report does
system will be given in Tables 2 through 9. These
not permit discussion of the methods used in any of
tables include memory requirements, run times and
these implementations. However, we have included
performance ratios. The performance ratios compare
references to technical papers describing these meth-
individual timings with the current best time on that
ods whenever such papers are available. Readers are
benchmark achieved on one processor of a Cray Y-MP.
referred to these documents for full details.
The run times in each case are elapsed time of day fig-
This report includes a number of new results not
ures, measured in accordance with the specifications
previously published. The Cray C-90, Cray Y-MP EL,
given in [3]. Memory requirements are currently avail-
the Maspar MP-1 and MP-2, and the Meiko CS-1 re-
able for only some of these implementations. We hope
sults in particular have not previously been disclosed.
to have complete information for these columns in fu-
In quite a few other instances, results are improved
ture editions of this paper.
from previous listings, reflecting improvements both in
Note that performances rates are not cited in compilers and implementations. Efforts are currently
millions of floating point operations per second underway to port the NAS Parallel Benchmarks on
(megaflops) in these tables. We suggest instead that other systems, and we hope to have some results in
the actual run times (or, equivalently, the performance the future.
ratios) be examined when comparing different systems
and implementations. For those who wish to compute 2. The Embarrassingly Parallel Benchmark
megaflops figures for the NAS Parallel Benchmarks on The first of the five kernel benchmarks is an “em-
any system, we insist that they be computed using the barrassingly parallel” problem. In this benchmark,
standard floating point operation (flop) counts given two-dimensional statistics are accumulated from a
in Table 1. Table 1 also contains megaflops rates cal- large number of Gaussian pseudorandom numbers,
which are generated according to a particular scheme
that is well-suited for parallel computation. This
problem is typical of many “Monte-Carlo” applica-
tions. Since it requires almost no communication, in Benchmark Operation Y-MP
some sense this benchmark provides an estimate of the Name Abbr. Count Rate
upper achievable limits for floating point performance Emb. Parallel EP 2.668 × 1010 211
on a particular system. Multigrid MG 3.905 × 1009 176
Results for the embarrassingly parallel benchmark Conjugate Gradient CG 1.508 × 1009 127
are shown in Table 2. Not all systems exhibit high 3-D FFT PDE FT 5.631 × 1009 196
rates on this problem. This appears to stem from the Integer Sort IS 7.812 × 1008 68
fact that this benchmark requires references to several LU Sim. CFD Appl. LU 6.457 × 1010 194
mathematical intrinsic functions, such as the Fortran SP Sim. CFD Appl. SP 1.020 × 1011 216
routines AINT, SQRT, and LOG, and evidently these BT Sim. CFD Appl. BT 1.813 × 1011 229
functions are not highly optimized on some systems.
The memory requirement for this benchmark was min- Table 1: Standard Operation Counts and Current Y-
imal on all systems. MP/1 Megaflops Rates
Intel iPSC/860 results are due to J. Baugh of Intel.
CM-2 and CM-200 results are due to J. Richardson
of TMC. Maspar results are due to J. MacDonald of
Maspar.
3. The Multigrid Benchmark
The second kernel benchmark is a simplified multi-
grid kernel, which solves a 3-D Poisson PDE. This Computer No. Memory Time Ratio to
problem is simplified in the sense that it has constant System Proc. (mwords) (sec.) Y-MP/1
rather than variable coefficients as in a more realis- Y-MP 1 4.9 126.2 1.00
tic application. This code is a good test of both short 8 4.9 15.87 7.95
and long distance highly communication, although the Y-MP EL 1 4.9 550.5 0.23
communication patterns are highly structured (as op- 4 4.9 141.2 0.89
posed to the conjugate gradient benchmark). C-90 1 4.9 47.60 2.65
Results for this benchmark, for problem size 2563 , 4 4.9 12.37 10.20
are shown in Table 3. Intel results are due to BCS. 16 4.9 3.19 39.56
CM-2 and CM-200 results are due to J. Richardson at TC2000 64 1 284.0 0.44
TMC. iPSC/860 32 1 102.7 1.23
64 1 51.4 2.46
4. The Conjugate Gradient Benchmark 128 1 25.7 4.91
In this benchmark, a conjugate gradient method CM-2 8K 1 126.6 1.00
is used to compute an approximation to the smallest 16K 1 63.9 1.97
eigenvalue of a large, sparse, symmetric positive def- 32K 1 33.7 3.74
inite matrix. This kernel is typical of unstructured 64K 1 18.8 6.71
grid computations in that it tests irregular long dis- CM-200 8K 1 76.9 1.64
tance communication and employs sparse matrix vec- 16K 1 39.2 3.22
tor multiplication. 32K 1 20.7 6.10
The irregular communication requirement of this 64K 1 10.9 11.58
benchmark is evidently a challenge for all systems. Re- CS-1 16 116.8 1.08
sults, for problem size 2.0 × 106 , are shown in Table 4. MP-1 4K 248 0.51
Intel results are due to BCS. CM-2 results are due to 16K 88 1.43
J. Richardson of TMC.
Table 2: Results of the Embarrassingly Parallel (EP)
5. The 3-D FFT PDE Benchmark Benchmark
In this benchmark a 3-D partial differential equa-
tion is solved using FFTs. This kernel performs the
essence of many “spectral” codes. It is a good test of
Computer No. Memory Time Ratio to
System Proc. (mwords) (sec.) Y-MP/1
Computer No. Memory Time Ratio to Y-MP 1 42.9 28.77 1.00
System Proc. (mwords) (sec.) Y-MP/1 8 42.9 4.19 6.87
Y-MP 1 56.7 22.22 1.00 Y-MP EL 1 42.9 122.6 0.23
8 56.7 2.96 7.51 4 42.9 34.9 0.82
Y-MP EL 1 56.7 89.19 0.25 C-90 1 42.9 10.28 2.80
4 56.7 32.11 0.69 4 42.9 2.58 11.2
C-90 1 56.7 8.65 2.57 16 42.9 0.91 31.6
4 56.7 2.42 9.18 iPSC/860 64 20.93 1.37
16 56.7 0.96 23.14 128 9.72 2.96
iPSC/860 128 8.61 2.58 CM-2 16K 37.0 0.78
CM-2 16K 45.8 0.49 32K 18.2 1.58
32K 26.0 0.85 64K 11.4 2.52
64K 14.1 1.58 CM-200 8K 45.6 0.63
CM-200 16K 30.2 0.74 CS-1 16 170.0 0.17
32K 17.2 1.29 MP-1 16K 19.6 1.47
CS-1 16 42.8 0.52
MP-1 16K 13.1 1.70 Table 5: Results of the 3-D FFT PDE (FT) Bench-
mark
Table 3: Results of the Multigrid (MG) Benchmark
long-distance communication performance.
The rules of the NAS Parallel Benchmarks specify
that assembly-coded, library routines may be used to
perform matrix multiplication and one-dimensional,
two-dimensional or three-dimensional FFTs. Thus
this benchmark is somewhat unique in that compu-
Computer No. Memory Time Ratio to
tational library routines may be legally employed.
System Proc. (mwords) (sec.) Y-MP/1
Results, for problem size 2562 × 128, are shown in
Y-MP 1 10.4 11.92 1.00
Table 5. Intel results are due to E. Kushner of Intel.
8 10.4 2.38 5.01
CM-2 and CM-200 results are due to J. Richardson of
Y-MP EL 1 10.4 65.35 0.18
TMC.
4 10.4 23.91 0.50
C-90 1 10.4 4.56 2.61 6. The Integer Sort Benchmark
4 10.4 1.51 7.89 This benchmark tests a sorting operation that is
16 10.4 0.58 20.55 important in “particle method” codes. This type of
TC2000 40 51.4 0.23 application is similar to “particle in cell” applications
iPSC/860 128 8.61 1.38 of physics, wherein particles are assigned to cells and
CM-2 8K 25.6 0.47 may drift out. The sorting operation is used to reas-
CM-2 16K 14.1 0.85 sign particles to the appropriate cells. This benchmark
CM-2 32K 8.8 1.35 tests both integer computation speed and communica-
CM-200 8K 15.0 0.79 tion performance.
CS-1 16 67.5 0.18 This problem is unique in that floating point arith-
MP-1 4K 64.5 0.18 metic is not involved. Significant data communication,
16K 14.6 0.82 however, is required. Results, for problem size 223 , are
shown in Table 6. Intel results are due to to E. Kush-
Table 4: Results of the Conjugate Gradient (CG) ner of Intel. CM-2 results are due to L. Dagum of
Benchmark NAS.
Computer No. Memory Time Ratio to Computer No. Memory Time Ratio to
System Proc. (mwords) (sec.) Y-MP/1 System Proc. (mwords) (sec.) Y-MP/1
Y-MP 1 31.1 11.46 1.00 Y-MP 1 32.3 333.5 1.00
8 31.1 1.85 6.19 8 32.3 49.50 6.74
Y-MP EL 1 31.1 153.9 0.07 Y-MP EL 1 32.3 1449 0.23
4 31.1 41.5 0.28 4 32.3 522.3 0.64
C-90 1 31.1 5.20 2.20 C-90 1 32.3 157.6 2.12
4 31.1 1.42 8.07 4 32.3 43.94 7.59
16 31.1 0.57 20.1 16 32.3 17.62 18.93
iPSC/860 32 25.72 0.45 TC2000 62 3032 0.11
64 17.26 0.66 iPSC/860 64 12 690.8 0.48
128 13.59 0.84 128 16 442.5 0.75
CM-2 8K 215.1 0.05 CM-2 8K 14 1307 0.26
16K 111.5 0.10 16K 14 850.0 0.39
32K 56.0 0.20 32K 14 572.0 0.58
MP-1 16K 75 0.15 CS-1 16 2937 0.11
CS-1 16 62.7 0.18 MP-1 4K 1958 0.17
MP-2 4K 658 0.51
Table 6: Results of the Integer Sort (IS) Benchmark
Table 7: Results for the LU Simulated CFD Applica-
tion
7. The Three Simulated CFD Application
Benchmarks
The three simulated CFD application benchmarks the communication to computation ratio.
are intended to accurately represent the principal com- Performance figures for the three simulated CFD
putational and data movement requirements of mod- applications, for problem size 643 , are shown in Tables
ern CFD applications. 7, 8 and 9. Timings are cited as complete run times,
The first of these is the called the lower-upper di- in seconds, as with the other benchmarks. A complete
agonal (LU) benchmark. It does not perform a LU solution of the LU benchmark requires 250 iterations.
factorization but instead employs a symmetric suc- For the SP benchmark, 400 iterations are required.
cessive over-relaxation (SSOR) numerical scheme to For the BT benchmark, 200 iterations are required.
solve a regular-sparse, block (5 × 5) lower and up- Intel and CM-2 results are due to S. Weeratunga, R.
per triangular system. This problem represents the Fatoohi, E. Barszcz and V. Venkatakrishnan of NAS,
computations associated with a newer class of implicit except that BT and SP results on the Intel are due to
CFD algorithms, typified at NASA Ames by the code BCS.
“INS3D-LU”. This problem exhibits a somewhat lim-
ited amount of parallelism compared to the next two. 8. Other Results
The second simulated CFD application is called the As far as we have been able to determine, the tim-
scalar pentadiagonal (SP) benchmark. In this bench- ings presented above all represent runs that fully com-
mark, multiple independent systems of non-diagonally ply with the rules and restrictions stated in the bench-
dominant, scalar pentadiagonal equations are solved. mark document [3]. One of these rules is that except
The third simulated CFD application is called the for a short list of mathematical functions, assembly
block tridiagonal (BT) benchmark. In this bench- language and assembly-coded library routines may not
mark, multiple independent systems of non-diagonally be used for computation. The exceptions include the
dominant, block tridiagonal equations with a 5 × 5 standard Fortran intrinsic functions, as well as rou-
block size are solved. tines to perform dense matrix multiplication and fast
SP and the third simulated CFD application (BT) Fourier transforms.
are representative of computations associated with the There are several reasons for these restrictions on
implicit operators of CFD codes such as “ARC3D” at assembly code. First of all, without restrictions of
NASA Ames. SP and BT are similar in many respects, some sort, an entire benchmark might be implemented
but there is a fundamental difference with respect to in assembly-level code. While such performance re-
Computer No. Time Ratio to
Benchmark System Proc. (sec.) Y-MP/1
Computer No. Memory Time Ratio to IS CM-2 16K 35.8 0.32
System Proc. (mwords) (sec.) Y-MP/1 32K 21.0 0.55
Y-MP 1 9.2 471.5 1.00 64K 14.9 0.77
8 9.2 64.60 7.30 CM-200 64K 5.7 2.01
Y-MP EL 1 9.2 2026 0.23 LU CM-2 16K 868.0 0.38
4 9.2 601.9 0.78 32K 546.0 0.61
C-90 1 9.2 184.7 2.55 SP CM-2 16K 1444 0.33
4 9.2 49.74 9.48 32K 917.0 0.51
16 9.2 13.06 36.10 64K 640.0 0.74
TC2000 112 880.0 0.54 BT CM-2 16K 1118 0.71
iPSC/860 64 667.3 0.71 32K 634.0 1.25
128 449.5 1.05 64K 370.0 2.14
CM-2 8K 3900 0.12 CM-200 16K 832.0 0.95
16K 2104 0.22 32K 601.0 1.32
32K 1080 0.44
CS-1 16 2975 0.16 Table 10: Unofficial TMC Results Using Library Rou-
MP-1 4K 1772 0.27 tines
MP-2 4K 668 0.71
Table 8: Results for the SP Simulated CFD Applica- sults might be interesting, they would hardly be in-
tion dicative of the performance that a scientist could rea-
sonably expect on a full-scale application program.
One reason that only the above-mentioned routines
are allowed is that in our experience only these are
generally available on new systems. For more spe-
cialized library routines, it is difficult to determine
whether they are truly general purpose, i.e. not re-
Computer No. Memory Time Ratio to lying on a specific data layout. Furthermore, even if
System Proc. (mwords) (sec.) Y-MP/1 an assembly-coded library routine can be utilized for
Y-MP 1 42.3 792.4 1.00 an inner computational kernel, this does not help the
8 42.3 114.0 6.95 large mass of additional coding that comprises a full-
Y-MP EL 1 42.3 4033 0.20 scale application. In short, the tuning rules for the
4 42.3 1208 0.66 NPB reflect our expectation (and experience) that real
C-90 1 42.3 356.9 2.22 scientific applications consist largely of Fortran or C
4 42.3 96.10 8.25 code, and that usage of library routines is restricted to
16 42.3 28.39 27.91 a handful of widely available mathematical functions.
TC2000 112 1378 0.58 Nonetheless, some scientists have attempted im-
iPSC/860 64 714.7 1.11 plementations of the NPB using library routines be-
128 414.3 1.91 yond the ones allowed in [3]. In particular, Thinking
CM-2 16K 3328 0.24 Machines, Inc. has obtained performance results us-
32K 1914 0.41 ing assembly-coded library routines for several of the
CS-1 16 2984 0.27 NPB. Their implementation of the IS benchmark, for
MP-1 4K 2420 0.33 example, runs more than twice as fast as reported in
MP-2 4K 870 0.91 Table 6, and their rates for the BT benchmark are
nearly three times as fast as reported in Table 9. Some
Table 9: Results for the BT Simulated CFD Applica- of these results are shown in Table 10 [4].
tion
9. Sustained Performance Per Dollar
One aspect of the relative performance of these sys-
tems has not been addressed so far, namely the differ-
processor Cray C-90 system is consistently the high-
Computer No. Ratio to Perf. per est performing system tested, far surpassing any of
B’mark System Proc. Y-MP/1 million $ the highly parallel systems. The Intel 128 proces-
FT C-90 16 31.60 0.87 sor iPSC/860 system and the 32K CM-2 system each
Y-MP 8 6.87 0.27 show promise, but they do not yet demonstrate sus-
iPSC/860 128 2.96 0.99 tained performance comparable to full Cray systems.
CM-2 32K 2.52 0.50 Instead, in both cases their rates appear to be equiva-
MP-1 16K 1.47 1.47 lent to about one or, in some cases, two Y-MP proces-
CS-1 16 0.17 0.57 sors. When sustained performance rates are normal-
LU C-90 16 18.93 0.53 ized by system prices, the situation is somewhat dif-
Y-MP 8 6.74 0.27 ferent: the highly parallel systems are approximately
iPSC/860 128 0.75 0.25 on a par with the Cray systems.
CM-2 32K 0.58 0.12 The Cray NPB performance results uniformly are
MP-2 4K 0.51 1.02 large fractions (in some cases over fifty percent) of the
CS-1 16 0.11 0.37 theoretical peak performance of these systems. By
contrast, the NPB performance rates on the highly
Table 11: Approximate Sustained Performance Per parallel systems are typically only two to five per-
Dollar cent of the theoretical peak performance of these sys-
tems. Reasons for the low sustained-to-peak ratios on
the highly parallel systems are not hard to identify:
ences in price between these systems. We should not immature compilers, insufficient bandwidth between
be too surprised that the Cray C-90 system, for ex- processors and main memory, and insufficient band-
ample, exhibits superior performance rates on these width between separate processing nodes. Clearly the
benchmarks, since its current purchase price is much challenge of the highly parallel vendors is to alleviate
higher than that of the iPSC/860 and the CM-2. these bottlenecks in future editions of their systems.
One way to compensate for these price differences Some scientists have suggested that the answer
is to compute sustained performance per million dol- to obtaining high performance rates on highly par-
lars, i.e. the performance ratio figures shown in Tables allel computers is to substitute alternative algorithms
2 through 9 divided by the purchase price in millions. that have lower interprocessor communication require-
Some figures of this type are shown in Table 11 for two ments. However, it has been the experience of the sci-
of the benchmarks, the FT benchmark and the LU entists in our research group that a certain amount of
benchmark and for five different systems. They are long-distance communication is unavoidable for these
based on 36 million, 25 million, 3 million, 5 million, types of applications. Alternative algorithms that
1 million, 500,000 and 300,000 U.S. dollars, respec- have higher computation rates usually require more
tively, for the Cray C-90, the Cray Y-MP, the Intel iterations to converge to a solution and thus require
iPSC/860, the CM-2, the Maspar MP-1, the Maspar more overall run time. Clearly it is pointless to employ
MP-2 and the Meiko CS-1. These are approximate numerically inefficient algorithms merely to exhibit ar-
current prices, obtained from vendor personnel, for tificially high performance rates on a particular paral-
complete systems with 16, 8, 128, 32K and 16K, 4K lel architecture [1].
and 16 processors, respectively, with one, two, one,
four, one, 0.25 and 0.5 gigabytes of main memory, re-
spectively, and with a typical set of peripherals. Be-
cause of the approximate and changeable nature of
these prices, and because the memory sizes, disk ca-
pacities and I/O performances of these systems are
certainly not equivalent, the figures in the last column
of Table 11 should be interpreted as only very rough
indications of sustained performance per dollar.
10. Conclusions
With some algorithmic experimentation and tun-
ing, respectable NPB performance rates have been
achieved on several multiprocessor systems. The 16
References
[1] D. H. Bailey, “Twelve Ways to Fool the Masses
When Giving Performance Results on Parallel
Computers”, Supercomputing Review, August
1991, p. 54 – 55. Also published in Supercomputer,
September 1991, p. 4 – 7.
[2] D. H. Bailey, E. Barszcz, J. T. Barton, D. S.
Browning, R. L. Carter, L. Dagum, R. A. Fatoohi,
P. O. Frederickson, T. A. Lasinski, R. S. Schreiber,
H. D. Simon, V. Venkatakrishnan, and S. K. Weer-
atunga, “The NAS Parallel Benchmarks”, Intl.
Journal of Supercomputer Applications, v. 5, no.
3 (Fall 1991), pp. 63 – 73.
[3] D. Bailey, J. Barton, T. Lasinski, and H. Simon,
eds., “The NAS Parallel Benchmarks”, Technical
Report RNR-91-02, NASA Ames Research Center,
Moffett Field, CA 94035, January 1991.
[4] G. Bhanot, K. Jordan, J. Kennedy, J. Richardson,
D. Sandee and M. Zagha, “Implementing the NAS
Parallel Benchmarks on the CM-2 and CM200 Su-
percomputers”, Thinking Machines Corp, Cam-
bridge, MA 02142.
[5] S. Breit, W. Celmaster, W. Coney, R. Foster,
B. Gaiman, G. Montry and C. Selvidge, “The
Role of Computational Balance in the Implementa-
tion of the NAS parallel Benchmarks on the BBN
TC2000 Computer”, submitted to Concurrency,
April 1991.