- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
========================================================
program test
use omp_lib
implicit none
integer i, j, k, n, np, npm
real(8) t1
real, allocatable :: a(:,:,:), b(:,:,:), dt(:)
n = 500
allocate( a(n,n,n), b(n,n,n) )
a = 1.0
b = 2.0
npm = omp_get_max_threads()
allocate( dt(npm) )
do np=1,npm
call omp_set_num_threads(np)
t1 = omp_get_wtime()
call sum_mat( n, a, b )
dt(np) = omp_get_wtime() - t1
enddo
do np=2,npm
print *, np, dt(1)/dt(np)
enddo
end
subroutine sum_mat( n, a, b )
implicit none
integer n
real a(n,n,n), b(n,n,n)
integer i, j, k
!$omp parallel do
do k=1,n
do j=1,n
do i=1,n
a(i,j,k) = a(i,j,k) + b(i,j,k)
enddo
enddo
enddo
end
========================================================
I use IFC 10 and make exe file by the next command:
ifort /nologo /O2 /Qfpp2 /Qopenmp s5.f90
When I run this program several times on my Dual Core processor, I get the next results:
2 0.9937300
2 0.9861304
2 1.004650
Why I cannot get speed up about 2? What is incorrect in the program?
コピーされたリンク
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
You also may be experiencing a cache loading problem. Try running the same test inside the same application
do i=1,5
print *, 'Iteration ',i
do np=1,npm
call omp_set_num_threads(np)
t1 = omp_get_wtime()
call sum_mat( n, a, b )
dt(np) = omp_get_wtime() - t1
enddo
do np=2,npm
print *, np, dt(1)/dt(np)
enddo
end do
The 1st iteration generally is longer as it populates the cache.
If one of the other test runs is longer than the others this may indicate the operating system interfering with the program.
Jim Dempsey
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
JimDempseyAtTheCove:
You also may be experiencing a cache loading problem. Try running the same test inside the same application
The 1st iteration generally is longer as it populates the cache.
If one of the other test runs is longer than the others this may indicate the operating system interfering with the program.
Thank you for advice, but I get the same results.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
>>In order to see a warm cache effect, where you get parallel speedup on the 2nd repetition, the big arrays would have to be made small enough to fit in cache.
That is correct (array sizes were not listed in users original post).
On the flip side, if you know you have larger than cache size arrays then break up the problem into chunks that fit within cache, then you can (potentially) work your way through the large array(s) and improve cache utilization. There are some system calls to get the sizes of L1, L2 (and L3 if present) cache. This technique has various names one of which is called striping.
Jim Dempsey
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Array sizes were listed in the original post: 500**3 = 125 million 4-byte reals * 2 arrays is a solid gigabyte. He's reading half a gig twice and writing it once in almost exactly 1 second, presumably on a farily recent Intel single-socket desktop machine. Hence, he's within a factor of 2 of the memory controller's maximum sustained bandwidth.
But he ought to be able to get closer to theoretical than that for this operation, even with one core ...
Oh, yeah: he's also allocating and initializing the arrays in that second. If the OS is zero-filling the memory that's two more GB transferred and we're at 3.5 GB/sec. You can't do better than that on his hardware no matter how many threads you use.