- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We've built it with the following command using IFV 10.1.021 on a Core 2 Quad:
ifort -openmp -o parallel paralle.f
- parallel.f ------------------------
program parallel
use omp_lib
integer i,count
parameter(count=200000)
real x(count),y(count),z(count),t
COMMON/x/x,y,z
integer nt,tid
nt=omp_get_max_threads()
write(*,*) 'threads=',nt
C$OMP PARALLEL PRIVATE(tid)
tid=omp_get_thread_num()
write(*,*) 'loop1=',tid
C$OMP DO
do i=1,count
x(i)=i
y(i)=i
end do
C$OMP END PARALLEL
C$OMP PARALLEL
C$OMP DO
do i=1,count
do j=1,count
z(j)=x(i)+y(j)
end do
end do
C$OMP END PARALLEL
t=0.0
C$OMP PARALLEL REDUCTION(+:t)
C$OMP DO
do i=1,count
t=t+z(i)
end do
C$OMP END PARALLEL
write(*,*) t
end
--------------------------------------------
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
fjeske,
The do loops you are using are performing very little computation between reads and/or writes. This results in each thread being capable of saturating the memory bus.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How do we then prove to ourselves that we saturating the bus and not uselessly spinning the CPUs? Thread profiling?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Figure out which level of the memory hierarchy you are running from and look up the theoretical bandwidth from that level to the FPU for your hardware.
Then calculate the effective memory bandwidth for your computation: (total bytes read + total bytes written) / seconds elapsed.
If the two numbers are comparable, you're saturating.
For the small arrays you're using, though, you ought to be able to run out of the L2. Do you have multiple CPU chips (sockets) or just multiple cores? If all your processors are sharing one L2 that won't help much and it may not fit.
If the job will fit in 4x the L1 cache size but not in any one L1 you might want to force an explicit schedule choice and set processor afinity so each stripe through the arrays stays in the L1 of one or another processor.
Also, with such small arrays your thread creation overhead will be a significant fraction of the running time of the OMP job. Using one OMP PARALLEL directive would help there, or at least it should.
Another thing to do is to use VTune to see where the pipelines are stalling. Interpreting the results takes some experience but if it's all memory I/O that should be pretty obvious.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
fjeske:
C$OMP PARALLEL
C$OMP DO
do i=1,count
do j=1,count
z(j)=x(i)+y(j)
end do
end do
C$OMP END PARALLEL
After further review, this is the only loop that should be impacting the performance of the cut-down example. It has an architectural problem: the inner loop is controlling the index into the destination array so all threads are writing it. This could easily run slower than the single-threaded version due to sharing dirty cache lines.
Invert the loops and each thread will have an essentially private output buffer. Make sure the scheduling options in effect for the loop give each thread a nice big chunk to work on; if it interleaves the j values it will still be slow.
Eliminate the OMP directives on the other loops entirely and I bet it speeds up slightly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now, unfortunately, I have to figure out what's wrong with the real program.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page