- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve, Thank you for your suggestion. Now I could allocate a two dimensional 60,000x60,000 real array. But I met a new problem that it took a long time for the intialize(e.g. when i==j, array(i,j)=1, else array(i,j)=0), So I am thinking about using Open MP. I read the IVF 9.1 user's guide as well as you previous posts about OpenMP. But my open mp code has no effect. here is my code:
module test_modinteger, parameter :: array_Index =40000real,allocatable::array(:,:) end module test_modprogram
Parallel_Testuse dflibuse omp_libuse test_modimplicit nonereal(8) :: res, runtime, begtime, endtime,timefinteger error,i,jwrite(*,*)" Program started...... "Cif(.not.ALLOCATED(array)) then allocate(array(array_Index,array_Index),stat=error)if(error.ne.0) thenwrite(911,*) 'allocate array error - insufficient memory'stopendif endifbegtime = timef()
!$OMP PARALLEL DOdo i=1,array_Indexdo j=1,array_Indexif(i==j) thenarray(i,j)=1.0
elsearray(i,j)=0.0
endif enddoenddo!$OMP END PARALLEL DO
endtime = timef(); runtime=endtime-begtime
write(*,'(A20,F9.4)')" Execution time is :"C,runtimewrite(*,*)" Program terminated..."Cend program Parallel_TestThe execution time of sequential and parallel for array size
20,000x20,000 are all around 46sec.
Could you please help me? Thank you!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
wisperless,
I haven't run your test program on my 4-way server but I can see an optimization problem.
First this is you want the faster sequencing index to be on the left side of the double subscript. This is done for the following reasons: a) the adjacent memory cells are likely to be in the same cache line. b) the compiler can vectorize the store, c) less cache interference amongst processors.
Also, since you are only filling in the diagonal with 1's, it may be faster to zero the array slice then 1 fill the diagonal.
!$OMP PARALLEL DO
do j=1,array_Index
do i=1,array_Index
array(i,j)=0.0
enddo
array(j,j)=1.0
enddo
!$OMP END PARALLEL DO
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Thank you for your briliant idea! I just read Miguel Hermanns's paper "Parallel Programming in Fortran 95 using OpenMP" and he presented the same idea as yours.
I tested your method and it dramatically reduced the program running time from 1160 seconds to current around 160 seconds. But I am still thinking it is possible I could further reduce the running time. I was trying to the following way,
!$OMP PARALLEL
!$OMP SECTIONS
!$OMP SECTION
array(:,1:20000)=0
!$OMP SECTION
array(:,20001:array_index)=0
!$OMP END SECTIONS
!$OMP END PARALLEL DO
But it does not work. Do you have any suggestion that the code could further reduce the running time since my application has real time restrictions.
Thank you again for your help!
Rachel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BTW, I used the default setting in my IVF9.1 comiler's Linker, but in the Fortran->Language, I set "Process OPENMP directives" as "Generate Parallel Code". So my final command line in the Fortran is
/nologo /Zi /Qopenmp /module:"$(INTDIR)/" /object:"$(INTDIR)/" /traceback /check:bounds
/libs:dll /threads /dbglibs /c
Is there any place that I need to change my settings?
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OpenMP won't help unless you have multiple CPUs or multiple cores, and you set the environment variables
OMP_NUM_THREADS=
and
KMP_AFFINITY=[compact|scatter]
If you are using all cores, compact should be OK.
Why aren't you setting a vectorization option, such as -QxW ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
Thank you for your suggestions. My computer has 4 CPUs, the CPU load is 100% when I run Jim's code. I remeber in the user's guide, it also mentioned that the number of cores can be set as environmental variables. But I don't exactly know how to do it. Do I need to set those variables from Control Panel->System->Advacned, set enviroment variables? But how to determine the number of cores?
Where can I set the vectorization option in the Fortran or Linker command line?
Thank you again!
Rachel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You don't need to set environment variables. OpenMP will default to match the number of threads with the number of execution units (cores, processors, etc.)
The option you want is Fortran..Optimization..Require Processor Extensions.. and then choose the highest option applicable to your processor. What is the exact processor model you have? If you aren't sure, download and run the Intel Processor Identification Utility.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve, Thanks. My processor is intel Xeon CPU 3.20GHZ,with Intel MMX technology, supporting Streaming SIMD extension 2/3. So I chose "Intel Pentium 4 Processor with Streaming SIMD Extensions 3 (SSE3)" instead of "Intel Pentium 4 and compatible Intel processors" (the default). Am I right?
However, the tested running time is 177 seconds, which is a little bit longer than my previous run. What else I could do in order to improve the running speed?
Thank you for your help! ^_^
Rachel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
real, parameter :: ZERO = 0.0, ONE = 1.0
array = ZERO ! all elements, let compilier / OMP do its parallel thing
forall(i=1:array_index) array(i,i) = ONE ! parallel construct
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Rachel,
The initialization code has virtually no computation. It essentially stores 0's into a big hunk of memory. Since there is relatively little computation the memory bus is being saturated. So throwing more processors at the problem won't improve performance. With one processor, even with HT the initialization loop may run slower with multiple threads. You may get better performance (at the expense of less portability) by tuning the wipe to your memory archetecture. i.e. figguring out if and how the memory is interlieved, figgering out how the cache is configgured, then with either one or two threads populate the array in an interlieved manner at cache line widths. Did I mention at the expense of less portability?
Also, do not assume that performance effects of threading on wipe loop will be an indicator of the performance effects of threading on a computational intensive loop.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey, John, Thank you for your method. It works fast. It tooks only 14 sec when the array_Index=40,000!!! However, it tooks like forever after I increased the array_Index=50,000. BTW, I did NOT use any OMP directive with your method. Do I need to?
Thank you! ^_^
Rachel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Rachel,
Regarding real time restrictions.
One of the general requirements of (near) real time programming is to reduce latencies. In some sense this is a little bitdifferent than reducing processing time. Let's look at the difference in the technique in your initialization code.
Reduced processing time (in pseudo code)
! wipe array
parallel do j=1,jmax
! wipe row of array
do i=1,imax
array(i,j) =0.
end do
array(j,j) = 1.
end parallel do
Reduced latency (assuming your application can process this way)
! Process array row by row
parallel do j=1,jmax
! wipe row of array
do i=1,imax
array(i,j) =0.
end do
array(j,j) = 1.! begin processing on row
do i=1, imax
call doRow(array, j)
end do
end parallel do
The above assumes processing of a row can be performed independent of other rows and out of sequence. If there is row dependencies, e.g. using closest neighbor, then the code would have to be modified to take that into account.
The above is not a solution, rather it is a design suggestion. What you want to do is to make the best use out of your processing resources. The wipe loop (as you have observed) can run as fast on one processor as on four processors. Even with specialized tuning code it is likely to be very close to being as fast. So this is a case of not optimizing a functional step to be as fast as it can. Rather it is a case that you have three processors that are waiting (or interfering with each other) in trying to do something. Your best return is to identify these situations and to find other work for the available processing power.
Also, keep this in mind.
Do not look at processor utilization (e.g. from Task Manager) as a measure of how well the program is parallelized. This is a false indication. In the array initialization case you will likely notice that 100% of four processors will accomplish what 100% of one processor can do. The true measure is performance (time to completion of application) or latency (delay time to begin process within application).
Jim Dempsey
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page