BENCH(5NEMO) manual page

HTML automatically generated with rman
Table of Contents

Synopsis

/usr/bin/time $NEMO/src/scripts/nemo.bench
make -f Benchfile clean all
nemobench COMMAND

Description

There are several methods to compare the performance of NEMO on different processors or environments:

1.

The command make bench from the $NEMO directory runs a standard benchmark of about a dozen selected NEMO programs. It should run in about 30 seconds on a ~2020 computer. A short table (see below in RESULTS) lists results from some recent computers.

2.

Various Benchfile filles scattered through the $NEMO source tree contain benchmarks. There is no wrapper like "make check" has for Testfile’s

3.

Selected man pages have some examples. Over time these should be moved into the Benchfile in the source directory where this program is maintained. Some are mentioned below in this man page.

Bench

Here are the results of the "make bench" benchmark. The time is the user CPU time. If two values are listed after the machine name, these are the GeekBench5 values.

Intel Core i5-1135G7    15.4    XPS13 1460 /
AMD EPYC 7302 @ 3.0GHz    23.9    lma 1000/ 28000
Intel Core i9-9900X    25.8    geminid 1214 / 9467
i5-10210U CPU @ 1.60GHz    29.4     X1Y4  1029 / 3194
Xeon(R) Silver 4114 @ 2.20GHz    37.1    terra 777 / 10213
i7-8550U CPU @ 1.80GHz    40.6     T480 (1600) 696 / 1935
Xeon E-2186G 3.80GHz    52.6     aurora 1313 / 6721
i7-3820 CPU @ 3.60GHz    ??/67.5     dante 833 / 3693
Xeon(R) CPU E3-1280 @ 3.50GHz    ??/70.3     chara 828 / 3117
i7-3630QM CPU @ 2.40GHz    76.2     T530 (1200) 762 / 3006 
Xeon(R) E5-2623 v4 @ 2.60GHz    42.1/79.4     kraken 782 / 5800
Xeon(R) X5550  @ 2.67GHz    115.4    sdp 566 / 3938

Keep in mind for most of these a default compilation was used. Some benchmarks are known to be able to improved by up to a factor of two with selected compiler and options changes.

The following tasks are run in the standard NEMO bench, for details see the src/scripts/nemo.bench script.

directcode nbody=3072 out=d0 seed=123 
hackcode1 nbody=10240  out=h0 seed=123 
mkplummer p0 10240 seed=123 
gyrfalcON p0 p1 kmax=6 tstop=2 eps=0.05
potcode p0 p2 freqout=10 freq=1000 tstop=2 potname=plummer
mkspiral s0 $nbody3 nmodel=40 seed=123 
ccdmath "" c0 ’ranu(0,1)’ size=256 seed=123
ccdpot c0 c1 
mkorbit o0 x=1 e=1 lz=1 potname=log
orbint o0 o1 nsteps=10000000 dt=0.001 nsave=100000

In addition each data file that is produced is checksummed and compared to a baseline version using bsf(1NEMO) if the argument bsf=1 is added.

The NEMOBENCH5 benchmark (cd $NEMO;make bench5) runs a number (currently 6) N-body codes all calibrated to take 5.0 seconds on an i5-1135G7 in turbo mode (4.2GHz). The NEMOBENCH5 rating is defined as 5000 divided by the average of those CPU times. Thus the i5-1135G7 has a score of 1000. This definition can be adjusted to 10s runs when the time comes that processors run too fast, there is a parameter internal to the nemo.bench script.

i9-12900H        -    n/a 1851
M1-Max            -    n/a 1785
i7-12800H        -    n/a 1654
i9-11980HK        -    n/a 1616
M1 (Macbook Air)        1269    n/a 1750 / 7xxxx
i7-11800H        -    n/a 1474
Intel Core i5-1135G7    1000    XPS13 1460 /
AMD threadripper 3960x    896    AlexBox
AMD EPYC 7302 @ 3.0GHz    630    lma 1000/ 28000
Intel Core i9-9900X    -    geminid 1214 / 9467
Xeon Gold 5218 @ 2.30GHz    761    unity/node99 1100 / 8610
i5-10210U CPU @ 1.60GHz    732     X1Y4  1029 / 3194
Xeon(R) Silver 4114 @ 2.20GHz    -    terra 777 / 10213
i7-8550U CPU @ 1.80GHz    548     T480 (1600) 696 / 1935
Xeon E-2186G 3.80GHz    -     aurora 1313 / 6721
i7-3820QM @ 2.70 GHz    498    MacBookPro (Retina, Mid 2012)
i7-3820 CPU @ 3.60GHz    414     dante 833 / 3693
Xeon(R) CPU E3-1280 @ 3.50GHz    -     chara 828 / 3117
i7-3630QM CPU @ 2.40GHz    429     T530 (1200) 762 / 3006 
Xeon E5-2623 v4 @ 2.60GHz    -    kraken 782 / 5800
Xeon X5550  @ 2.67GHz    256    sdp 566 / 3938

Bench10

Not really implemented, but this will be benchmarks orchestrated via the Benchfile’s found in the source tree.

Oldest Bench

At the inception of NEMO in 1986 there was no real benchmark, so for a while (as computers were relatively slow still) we used the default hackcode1(1NEMO) setting, where 128 particles in virial equilibrium are integrated for 64 timesteps:

      /usr/bin/time hackcode1 tstop=2  > /dev/null

On a Sun 3/50 (our development machine) this took about 5 seconds per step. Now, nearly 35 years later, my laptop runs this about 50,000 times faster. Looking in more detail at the original NEMO manual:

                       cpu/steps
sun 3/60:  20 MHz    2.28        
i5-1135G7: 4200 MHz    0.0000875

Despite that the cpu was 210 times faster, the code ran 26,000 faster. A very impressive factor of 120 improvement in chip and possibly some compiler technology. The NEMO users guide has an appendix in which this benchmark is listed for a variety of computers since 1986.

Caveats

Defining and running a benchmark can be very tricky stuff. It might be important to separate disk I/O from CPU usage. The unix time(1) command can be a help. The output from bash::time is a bit different form csh::time, and yet different from /usr/bin/time. Unless you find a special one, we prefer the csh::time, since the output clearly separates user, system and wall clock time, and also reports the I/O, viz.

   % time ls 
   0.012u 0.068s 0:00.77 9.0%    0+0k 8376+0io 0pf+0w
   2.324u 1.080s 0:09.25 36.7%    0+0k 1049384+2097160io 2pf+0w
   1.876u 0.788s 0:03.63 73.0%    0+0k 0+2097160io 0pf+0w

On linux the command

   echo 1 > > /proc/sys/vm/drop_caches

will clear the disk cache in memory, so your program will be forced to read from disk, with all possible interference from other programs

In NEMO another useful addition to the benchmark is that the output can be turned off easily, by using out=., viz.

    % sudo $NEMO/src/scripts/clearcache
    % time ccdsmooth n1 . dir=x
    0.852u 1.068s 0:12.41 15.3%    0+0k 2098312+0io 6pf+0w
    0.812u 0.400s 0:01.21 100.0%    0+0k 0+0io 0pf+0w
    0.820u 0.380s 0:01.20 100.0%    0+0k 0+0io 0pf+0w

where the last two instances were just re-running the same command, but now clearly showing the effect of reading the file from memory instead of disk. By repeating this whole series a few times, an lower bound to the wall clock time is more likely to properly account for the I/O overhead time.

Rule of thumb: always run a benchmark a few times to see if a hot CPU slows down the benchmark. If I/O is cached. Other tasks are interfering.

"others"

A few other man pages in NEMO also maintain their own list how its program compares under different compilers/options/cpu options:

CGS(1NEMO)
scfm(1NEMO)

Other industry benchmarks:

    Geekbench 5 (very wide variety of compute workloads - baseline is i3-8100)
    Linpack   (focus on floating point operations - Gflops)
    SPEC CPU 2017 ($$$) benchmark -

Tabbench

The table I/O benchmark uses a 100M row dataset with 3 columns, representing X,Y,Z of which the radius R=sqrt(X^2+Y^2+Z^2) is computed. This table is about 2.7 GB in size. Of course reading the table is all dependent on the HDD/SDD, but in the case described here this was a fast SSD, and took 2 sec to read, or just over 1000 MB/sec.

    /usr/bin/time tabgen tab3 100000000 3
    /usr/bin/time tabbench2 . mode=-1
    

this bench will need to be repeated for mode=0,1,2,3 to estimate the different
components as they
are added to the workflow. The tabgen(1NEMO) is dominated by
drawing random numbers and writing them using printf(3) , which is slow.

    80s   writing, using tabgen
     2s   reading in tabbench2
    22s   parsing in numbers  [np.loadtxt takes 748 sec!!!]
     6s   using fie(3NEMO) to compute radii
     1s   using np.sqrt(), and presumably C’s sqrt() as well

Parallel

The GNU parallel(1) tool can be of great use if your tasks are pure single core and you have enough cores (most laptops have at least 4 these days) and memory to fit your tasks. As an example, here is something contrived using mkplummer(1NEMO) that does not write to disk, so it should be highly parallizable:

    nbody=10000000
    /usr/bin/time mkplummer . $nbody
    2.80user 0.45system 0:03.26elapsed 99%CPU
    
    echo mkplummer . $nbody  > run.txt
    echo mkplummer . $nbody >> run.txt
    /usr/bin/time parallel --jobs 1 < run.txt
    5.89user 0.83system 0:06.71elapsed 100%CPU
    
    /usr/bin/time parallel --jobs 2 < run.txt
    6.00user 0.79system 0:03.44elapsed 197%CPU

which follows Amdahl’s law close to 100%!

$NEMO/src/scripts/nemo.bench    Script uses by make bench/bench5/bench10
$NEMO/data       standard repository area for (small) data files.
Benchfile    A Makefile that can orchestrate series of benchmarks
/tmp/nemobench.log    The nemobench keeps logfile

Update History

12-may-97    created      PJT
26-nov-03    finally added some data        PJT
17-feb-04    added bench0 comparison      PJT
31-mar-05    added some cygwin numbers, fixed input    PJT
6-may-11    added i7 and SHMEM/HDD comparison    PJT
27-sep-13    added caveats    PJT
6-jan-2018    updated for V4, more balanced benchmarks     PJT
27-dec-2019    nemo.bench; updated with potcode and orbint    PJT
26-jul-2020    added timings / added geekbench5     PJT

Table of Contents

Name
Synopsis
Description
Bench
Nemobench5
Bench10
Oldest Bench
Caveats
"others"
Tabbench
Parallel
Considerations
See Also
Author
Files
Update History