OpenMP test on i7

Here's a simple piece of c-code (try zipped version) for testing how to parallelize code with OpenMP. It compiles with
gcc -fopenmp -lm otest.c

The CPU-load while running looks like this:

cpuload

Looks like two logical CPUs never get used (two low lines beyond "5" in the chart). It outputs some timing information:

running with 1 threads: runtime = 17.236827 s clock=17.230000
running with 2 threads: runtime = 8.624231 s clock=17.260000
running with 3 threads: runtime = 5.791805 s clock=17.090000
running with 4 threads: runtime = 5.241023 s clock=20.820000
running with 5 threads: runtime = 4.107738 s clock=20.139999
running with 6 threads: runtime = 4.045839 s clock=20.240000
running with 7 threads: runtime = 4.056122 s clock=20.280001
running with 8 threads: runtime = 4.062750 s clock=20.299999

which can be plotted like this:
chart
I'm measuring the clock-cycles spent by the program using clock(), which I hope is some kind of measure of how much work is performed. Note how the amount of work increases due to overheads related to creating threads and communication between them. Another plot shows the speedup:
speedup

The i7 uses Hyper Threading to present 8 logical CPUs to the system with only 4 physical cores. Anyone care to run this on a real 8-core machine ? 🙂

Next stop is getting this to work from a Boost Python extension.

6 thoughts on “OpenMP test on i7”

  1. on a Q9300 the results look like this:
    running with 1 threads: runtime = 27.132321 s clock=26.980000
    running with 2 threads: runtime = 13.214080 s clock=26.270000
    running with 3 threads: runtime = 8.820247 s clock=26.049999
    running with 4 threads: runtime = 7.251426 s clock=26.010000
    running with 5 threads: runtime = 7.349411 s clock=26.330000
    running with 6 threads: runtime = 6.725370 s clock=26.219999
    running with 7 threads: runtime = 6.689425 s clock=26.150000
    running with 8 threads: runtime = 6.897340 s clock=25.959999

  2. The office has a quad Intel(R) Xeon(R) CPU X7350 @ 2.93GHz (16 total cores, no HT) running 64-bit Ubuntu 8.04. On it, I measure a speedup of 8.00 using 8 threads and 15.91 using 16 threads.

    1 thread runtime: 21.69 cpu time: 21.69.
    16 thread runtime: 1.36 cpu time: 21.72.

  3. Thanks for the test Jeff. This trivially parallel code seems to scale well.
    A friend of mine did some testing on Windows with VS2008 and found the program to run much faster. Can the gcc implementation of OpenMP really be that bad? Comparing gcc with and without OpenMP also leads to confusing results:
    gcc -lm -O3 -o otest otest.c
    ./otest
    running with 1 threads: runtime = 4.988194 s clock=4.970000
    running with 2 threads: runtime = 4.983637 s clock=4.970000

    gcc -lm -O3 -o otest otest.c -fopenmp
    ./otest
    running with 1 threads: runtime = 27.958443 s clock=27.959999
    running with 2 threads: runtime = 14.037088 s clock=28.030001

    very strange.

  4. I played with the code some more too. By now my test code is fairly different from yours, so I've put a copy at http://media.unpythonic.net/emergent-files/sandbox/otest.c

    I did not find a big difference in performance by turning off -fopenmp, but I did find that specifying -fno-math-errno provided a huge speedup with and without -fopenmp (base runtime is a bit higher probably because I changed the matrix size to be 128000 instead of 10000):
    $ gcc -std=c99 -fopenmp -O3 otest.c -lm && ./a.out
    #threads wall cpu sum-of-c
    1 28.975817 28.970000 7659783.500000
    2 14.487500 28.950000 7659783.500000
    4 7.255789 28.960000 7659783.500000
    8 3.652288 29.020000 7659783.500000
    16 1.927230 28.990000 7659783.500000
    $ gcc -std=c99 -fopenmp -fno-math-errno -O3 otest.c -lm && ./a.out
    1 5.955010 5.960000 7659783.500000
    2 2.977407 5.950000 7659783.500000
    4 1.494880 5.950000 7659783.500000
    8 0.760135 5.910000 7659783.500000
    16 0.411194 5.970000 7659783.500000
    $ gcc -std=c99 -O3 otest.c -lm && ./a.out
    1 28.970354 28.970000 7659783.500000
    $ gcc -std=c99 -O3 otest.c -fno-math-errno -lm && ./a.out
    1 5.952739 5.950000 7659783.500000

  5. I tested original code with dual 2.0Ghz Opteron under Ubuntu 9.04 and -fno-math-errno made only small difference but with Jeff's version there was
    huge speedup.

    Original version:

    jamse@jamse-desktop:~$ gcc -fopenmp -lm -O3 -o otest otest.c
    jamse@jamse-desktop:~$ ./otest
    running with 1 threads: runtime = 28.173516 s clock=28.020000
    running with 2 threads: runtime = 14.892576 s clock=28.200001

    jamse@jamse-desktop:~$ gcc -fno-math-errno -fopenmp -lm -O3 -o otest otest.cjamse@jamse-desktop:~$ ./otest
    running with 1 threads: runtime = 27.736684 s clock=27.480000
    running with 2 threads: runtime = 13.913317 s clock=27.450001

    Jeff's version:

    jamse@jamse-desktop:~$ gcc -std=c99 -fopenmp -lm -O3 -o otest2 otest2.cjamse@jamse-desktop:~$ ./otest2
    1 36.001104 35.940000 7659783.500000
    2 18.074438 35.860000 7659783.500000

    jamse@jamse-desktop:~$ gcc -std=c99 -fno-math-errno -fopenmp -lm -O3 -o otest2 otest2.c
    jamse@jamse-desktop:~$ ./otest2
    1 3.425272 3.410000 7659783.500000
    2 1.720572 3.430000 7659783.500000

  6. I tested the original code on my dual socket Intel E520 @ 2.85Ghz:

    twebb@saraswati:~/tmp/tmp2%gcc -fopenmp -lm -O3 otest.c
    twebb@saraswati:~/tmp/tmp2%./a.out
    running with 1 threads: runtime = 14.035480 s clock=14.020000
    running with 2 threads: runtime = 7.038971 s clock=14.080000
    running with 3 threads: runtime = 4.702512 s clock=14.070000
    running with 4 threads: runtime = 3.524094 s clock=14.050000
    running with 5 threads: runtime = 2.891979 s clock=14.030000
    running with 6 threads: runtime = 2.400413 s clock=13.970000
    running with 7 threads: runtime = 2.059968 s clock=14.060000
    running with 8 threads: runtime = 1.781473 s clock=14.100000

Comments are closed.