Hello Forum members,
I run identical multi threaded (openMP) fortran programs on an 'old' windows machine using the Visual Fortran Compiler XE 184.108.40.206 [Intel(R) 64] run on Windows 7 and a 'new' red hat linux server using Fortran Composer_xe_2013.1.117 (run on Red Hat Server 6.3).
When I run my code with 1 thread only the linux code is faster (as expected since it's a newer and faster machine). However, as I increase the thread count to about 20 the windows machine executes the code faster than the linux machine. My guess is that I made a mistake somewhere when calling the compiler under linux. The Linux box should be faster at whatever thread count is enabled.
Here are some more details about these puzzling results. Run time in seconds (W is for the windows box and L for the linux box)
iter = 1 (parallelized section gets executed only once)
1 Threads: W = 292, L = 202 (Linux beats windows, as expected)
2 Threads: W = 242, L = 152 (Linux beats windows, as expected)
3 Threads: W = 208, L = 132 (Linux beats windows, as expected)
4 Threads: W = 196, L = 123 (Linux beats windows, as expected)
10 Threads: W = 109, L = 91 (Linux beats windows, as expected)
20 Threads: W = 82, L = 80 (why is windows faster?)
When increasing the iterations more of the parallelized codes get executed. So now we have:
iter = 2
4 Threads: W = 324, L = 225 (Linux beats windows, as expected)
10 Threads: W = 202, L = 166 (Linux beats windows, as expected)
20 Threads: W = 138, L = 143 (Why??)
iter = 3
4 Threads: W = 471, L = 332 (Linux beats windows, as expected)
10 Threads: W = 294, L = 246 (Linux beats windows, as expected)
20 Threads: W = 192, L = 205 (why??)
iter = 5
10 Threads: W = 473, L = 395 (Linux beats windows, as expected)
20 Threads: W = 304, L = 350 Why??
40 Threads: W = N/A, L = 307
60 Threads: W = N/A, L = 310
Here are the exact details about how I compile the code on the two machines.
Does anybody know as to why the more powerful Linux machine is slower than a 2 year old Windows box when the thread count goes close to 20? Which compiler options do i need to set here to make the Linux version run faster?
Thanks a lot.
The difference in speeds of my openMP code on 20 threads was due to private allocatable vectors inside the parallel loop. For some reason the old compiler 220.127.116.11 with flag -O3 under Windows handles this better than Fortran_Composer_xe_2013.1.117 with -O1, -O2 or -O3 flag under Linux. After I replaced the allocatable vectors with alternate code not requiring allocatable arrays, the speed differences disappeared. I'm not sure whether this large speed difference with allocatable arrays inside an openMP loop is due to the linux OS or due to the newer compiler version? In any case, thanks for all your comments. J.