14. Benchmarking¶
How fast does the N-body treecode run? To what degree does optimization/vectorizing help? When do programs become I/O dominated? Some of the numbers quoted below should be taken with great care, since a lot of other factors can go into the timing result.
A number of programs in NEMO have a command line parameter such as
nmodel=N
, nbench=N
or iter=N
(N normally set to 1)
but together with help=c
or prefixing with /usr/bin/time
will
give an accurate measurement how long
the code takes to execute N
loops of a particular algorithm. For
some programas their respective man pages discuss a particular benchmark.
On the top level we have make bench5
and make bench
, the latter
dynamically controlled with the scattered Benchfile
’s
14.1. N-body integration¶
The standard NEMO benchmark of the treecode integration is to
hackcode1
without any parameters. It will generate a spherical
stellar system in virial equilibrium with 128 particles, and integrate
it for 64 timesteps (tol=1 eps=0.05
). In the table below the
amount of CPU (in seconds) needed for one timestep is listed in
column 2. When not otherwise mentioned, the code used is the standard
NEMO hackcode1
with default compilation on the machine
quoted. Note that one can often obtain significant performance
increase by studying the native compiler and in particular its
optimization options.
Modern machines are too fast to measure the 1986 example (where a single step would be around 5 sec) so we integrate longer and normalize to measure a single step. For example
/usr/bin/time hackcode1 out=. freq=100 tstop=1000 > /dev/null
5.88user 0.04system 0:05.93elapsed 99%CPU
would compute to an entry in the table below of 0.000059 sec, or around 100,000 times that of the 1986 computers.
Since the development machine (a Sun 3/60) ran at 20 MHz, with current (2022) speeds around 5GHz, this amounts to a 250x clock speed. But the code runs another 400x faster. Part of that is the improved instruction cycle, but part of this no doubt (probably smaller) factor is due to improved compiler technology.
Machine |
cpu-sec/step |
code |
comments |
---|---|---|---|
i9-12900K @ 5.2 GHz |
0.000059 |
hackcode1 |
2022 desktop |
i5-1135G6 @ 4.2 GHz |
0.000089 |
hackcode1 |
2020 laptop |
i7-8550U @ 4 GHz |
0.000178 |
hackcode1 - |
2018 laptop |
core 2 duo @ 2.0 GHz |
0.0012 |
hackcode1 |
2007 laptop |
Sun Ultra-140 |
0.012 |
hackcode1 |
-xO4 -xcg92 -dalign -xlibmil |
G3 PowerPC 250Mhz |
0.026 |
hackcode1 |
-O |
486DX4-100 |
0.068 |
hackcode1 |
(~1995 linux) |
Sun-4/60 Sparcstation 1 |
0.420 |
||
Sun-3/60 |
5.400 |
-fswitch (orig development) |
|
3b1 (10Mhz 68010) |
49.000 |
||
386SX (16Mhz) |
87.000 |
(linux) software floating point |
The rubbish below are from the old latex table, TBD which ones make it into the new table
i7-3630QM @ 3.4 GHz & 0.000177 & hackcode1 & 2014 laptop \ i70-870 @ 2.93 GHz & 0.00030 & hackcode1 & 2010 desktop \
Dec-alpha & 0.0042 & hackcode1 & -O4 -fast \ Dec-alpha & 0.0048 & hackcode1 & default \ CRAY X/YMP48 & 0.0060 & TREECODE V3 & estimate (1989) \ Onyx-2 & 0.0088 & hackcode1 & default (1996) \ ETA-10 & 0.010 & TREECODE V2 & estimate (1987) \
Sun 20/62 & 0.013 & hackcode1 & default (1994) \ Cyber 205 & 0.018 & TREECODE V2 & estimate (1986) \ Sun 20/61 & 0.020 & hackcode1 & \ HP/UX 700 & 0.020 & hackcode1 & \ Sun Ultra-140 & 0.024 & hackcode1 & default \ Sun 20/?? & 0.024 & hackcode1 & -xO4 -xcg92 -dalign -xlibmil \
Sun 10/51 & 0.029 & hackcode1 & -O -fast -fsingle \ Cray-2 & 0.029 & TREECODE2 & REAL - Pitt, oct 91\ % SGI ??? & 0.030 & hackcode1 & John Wangs machine DEC DS3000/400 alpha & 0.036 & hackcode1 & default compilation \ Pentium-100 & 0.038 & hackcode1 & default \ SGI Indigo & 0.045 & hackcode1 & default compilation \ CRAY YMP & 0.059 & hackcode1 & default compilation \ % bootes: Sparc-10 & 0.063 & hackcode1 & using {tt acc -cg92} \
486DX2-66 (linux) & 0.093 & hackcode1 & -DSINGLEPREC \ Sparc-2 & 0.099 & gravsim V1 & \ IBM R/6000 & 0.109 & hackcode1 & default cc compiler \ Dec 5000/200 & 0.116 & hackcode1 & \ Sparc-2 & 0.130 & hackcode1 & -DSINGLEPREC -fsingle \ Sparc-2 & 0.180 & hackcode1 & \ Multiflow 14/300 & 0.190 & hackcode1 & \ Convex C220 & 0.290 & & \ NeXT & 0.240 & & [ganymede 68040, nov 91]\ Sparcstation1+ & 0.340 & & \
Alliant FX?? & 0.430 & gravsim V1 & \ Alliant FX4/w 3 proc’s & 0.590 & & \ VAX workstation 3500 & 0.970 & & \ Sun-4/60 Sparcstation 1 & 1.040 & treecode2 & cf. C-code @ 0.420 \ Sun-3/110 & 1.660 & hackcode1 & fpa.il \
14.1.1. Nbody0¶
The program is Aarseth’s simplest nbody code (contained in Binney and Tremaine, 1987, no regularization or nearest neighbors). The input is a Hubble expanding cartesian lattice, w/ 925 pts, GMtot=1, expansion factor = 6 (omega = 1.2). Long version followed for 60 time units, short version for 5. Results are summarized in table below. First table compiled by D. Richstone.
It seems the input data have been lost.
Machine |
time1 (sec) |
time2 (sec) |
speed |
---|---|---|---|
Sun-4/110(Pele - 8Mb) |
21,753 |
0.41 |
|
Vaxstation 3100(Miffy - M48, 24Meg) |
1302 |
0.65 |
Sparc 1 & & 1023 & 0.83 \
Sparc IPC(Courage - 16 Mb) & 9,015 & 850 & 1.000 \
Sparc 2 & 4,483 & & 2.01 \
Sparc 2’ & & 417 & 2.04 \
Dec 5000/200 & & 318 & 2.67 \
Stardent(ism) & & 211 & 4.03 \
IBM Risc (Juno) & 2,117 & 198 & 4.27 \
IBM Risc (wibm01)& 2,115 & & 4.26 \ Convex & & 172 & 4.94 \ HP/UX 700 & & 26.2 & \
Cray YMP & & 19.1 & 44.5
14.2. Orbit integration¶
Benchmark is taking 100,000 leapfrog steps. For 2D optimized
potentials the timing on
a Sparc-1 station is about 12” for log
or plummer
, and
23” for teusan85
in the core region (orbit remaining within
the body of the bar). See also “make bench5”, where one of the
benchmarks computes an orbit. Here we take about 80M steps, in
5 seconds, or 200M in the same amount as a sparcstation-1, or
about 2000x faster, or about 20,000x faster than a Sun 3/60.