- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I am trying to use OPENMP parallel construct. The running time, however, is the same as a code without parallel. I then checked the number of threads used in the parallel construct. I find that only the master thread is used. I have a 6-core machine, shouldn't the number of threads be 6? I tried the command call OMP_SET_NUM_THREADS(6) , but it does not do anything.
Is there anything I should change in order to use openmp? If so, would you mind telling me how in Microsoft Visual Studio?
Thanks a lot.
I am trying to use OPENMP parallel construct. The running time, however, is the same as a code without parallel. I then checked the number of threads used in the parallel construct. I find that only the master thread is used. I have a 6-core machine, shouldn't the number of threads be 6? I tried the command call OMP_SET_NUM_THREADS(6) , but it does not do anything.
Is there anything I should change in order to use openmp? If so, would you mind telling me how in Microsoft Visual Studio?
Thanks a lot.
Link Copied
- « Previous
-
- 1
- 2
- Next »
27 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your advice. Before i do what you suggested, I did a very very simple test,
!$OMP PARALLEL private(i)
!$OMP do
do i=1,100000000
temp=matmul(temp2,temp1)
enddo
!$OMP ENDDO
!$OMP end PARALLEL
The result is shocking: without parallel, it takes 0.14 minutes. With paralleling, it takes 2.9 minutes.
!$OMP PARALLEL private(i)
!$OMP do
do i=1,100000000
temp=matmul(temp2,temp1)
enddo
!$OMP ENDDO
!$OMP end PARALLEL
The result is shocking: without parallel, it takes 0.14 minutes. With paralleling, it takes 2.9 minutes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The compiler realized that this program did nothing and eliminated most of it. With parallel, it eliminated less.
You've made a very common error in writing performance tests - coding it such that the compiler can remove large chunks of it since the output is never used.
You've made a very common error in writing performance tests - coding it such that the compiler can remove large chunks of it since the output is never used.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Jim and all others,
I tried the test you suggested as follows:
integer:: sanity(12) ! I have 12 threads
!$OMP parallel
sanity=0
!$omp ATOMIC
sanity(i)=sanity(i)+1
IF(sanity(i)/=1) print*,'wrong',i
!$omp end parallel
I got an error message: " #1 of the array of SANITY is -1, lower than the lower bound 1"
Can you tell me what is wrong here?
Also, I assume that if I parallel the following
!$omp parallel do private(i,j)
iloop: do i=1,N
jloop: do j=1,Z
...
enddo jloop
enddo iloop
!$omp end parallel do
Then each thread should take some work from the iloop and for each i, the responsible thread do the entire jloop serially? That is, what other threads do in jloop should not matter for the current thread's work in jloop?
Am I wrong?
Thank you so much for your advice.
I tried the test you suggested as follows:
integer:: sanity(12) ! I have 12 threads
!$OMP parallel
sanity=0
!$omp ATOMIC
sanity(i)=sanity(i)+1
IF(sanity(i)/=1) print*,'wrong',i
!$omp end parallel
I got an error message: " #1 of the array of SANITY is -1, lower than the lower bound 1"
Can you tell me what is wrong here?
Also, I assume that if I parallel the following
!$omp parallel do private(i,j)
iloop: do i=1,N
jloop: do j=1,Z
...
enddo jloop
enddo iloop
!$omp end parallel do
Then each thread should take some work from the iloop and for each i, the responsible thread do the entire jloop serially? That is, what other threads do in jloop should not matter for the current thread's work in jloop?
Am I wrong?
Thank you so much for your advice.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe the idea was to add such a test in a parallel loop in your own code, with sanity initialized outside, e.g.
sanity=0
!$OMP parallel
!$omp do
do i=1,size(sanity)
!$omp ATOMIC
sanity(i)=sanity(i)+1
IF(sanity(i)/=1) print*,'wrong',i
end do
!$omp end parallel
Apparently, you never initialized i, while you asked each thread to zero the entire sanity array.
For your case of nested loops, yes, you want each inner loop to be independent of the others, so that threads don't interfere with each other. In Fortran openmp, the do counters (j in your example) are automatically private, so that each thread has its own copy. Explicit private (firstprivate/lastprivate) declarations are needed for other variables/arrays which aren't shared.
sanity=0
!$OMP parallel
!$omp do
do i=1,size(sanity)
!$omp ATOMIC
sanity(i)=sanity(i)+1
IF(sanity(i)/=1) print*,'wrong',i
end do
!$omp end parallel
Apparently, you never initialized i, while you asked each thread to zero the entire sanity array.
For your case of nested loops, yes, you want each inner loop to be independent of the others, so that threads don't interfere with each other. In Fortran openmp, the do counters (j in your example) are automatically private, so that each thread has its own copy. Explicit private (firstprivate/lastprivate) declarations are needed for other variables/arrays which aren't shared.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
Thank you all for your helpful advice. I just found out why my parallel program is "slower" than the serial program. I used call cpu_time, instead of OMP_get_wtime(). The former somehow calculated the timing wrong!
Now my simple program runs faster with parallel, I will try the more complicated ones.
Thanks.
Thank you all for your helpful advice. I just found out why my parallel program is "slower" than the serial program. I used call cpu_time, instead of OMP_get_wtime(). The former somehow calculated the timing wrong!
Now my simple program runs faster with parallel, I will try the more complicated ones.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
cpu_time is probably designed to total up the times used by all threads. It would be exceptional if this were to decrease with parallelization. omp_get_wtime() should give elapsed time, as would system_time(), except that the latter doesn't have as good an implementation on windows x64.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now, I am running my whole program, but I got this message:
"insufficient virtual memory". My machine has 24GB, 12 threads. I am not sure whether this is really about memory. Thanks for hints.
"insufficient virtual memory". My machine has 24GB, 12 threads. I am not sure whether this is really about memory. Thanks for hints.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »