- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I have 2 questions ?
1)
I haven been working one a big code for space trajectory optimization
application in Intel Fortran in VS-2005.
I have recently, learnt little about vectorization of loops feature in
I.fortran , I can notice that when I compile my code most of my loops are not
getting vectorized. While the loops like direct whole array assignment are
getting vectorized.
I was wondering if I can get some tips on how to structure loops so that are
vectorized successfully ?
2)
I have an Intel Core 2 duo processor .
While on runtime I can see that only 50% of my processor is getting used up?
Can I use both the cores simultaneously? I felt that this will greatly increase
the speed of execution which is one of the main concerns of my research work.
Thank you all
Nittin Arora
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A 50% use of your dual core system means, that you run a single threaded programm. Windows shifts the thread equally to both cores, so they use only 50% of their capability. In a quad core system it would be 25%.
Try OpenMP foreasy multi threading. It is not difficult threading do loops. But be sure that you access your arrays from the inner to the outer rank, when you have a 2 or 3 rank array:
!$OMP DO
do y=1,ny
do x=1,nx
result(x,y)=factor(x,y)*...
end do
end do
Im not very familiar with other OpenMP commands, but that should give you a good boost without changing your code. The ! before $OMP is necessary. But you have to be sure, that the calculations in your do loop are independent. result(x,y)=result(x-1,y)... wont work because it is possible, that x-1 wont be calculated, when x need it (race condition).
Markus
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As Marcus suggests, OpenMP will help but do not discount changes to your code that improve the vectorization.
In Fortran, the left-most index varies fastest (in C/C++ it it the right-most index)i.e. In Fortran adjacent cells (left-most index wise) are in adjacent memory locations. When cells in adjacent memory are referenced per next interation in a loop then the loop is a candidate for vectorization.
*** Bad form
do I=1,nI
do J=1,nJ
A(I,J) = B(I,J) * scalar + C(I,J)
end do
end do
***Good form
do J=1,nJ
do I=1,nI
A(I,J) = B(I,J) * scalar + C(I,J)
end do
end do
***Better form
!$OMP PARALLEL DO PRIVATE(I) DEFAULT(SHARED)
do J=1,nJ
do I=1,nI
A(I,J) = B(I,J) * scalar + C(I,J)
end do
end do
!$OMP END PARALLEL DO
Note, in order to use OpenMP you must add USE OMP_LIB and use the compiler option to compile OpenMP threaded code and link in the OpenMP library.
First work on the vectorization as the benefit carries through to multi-threaded coding.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim and Steve,
Slightly OT, have a look at the 'real world' presentation at
http://openmp.org/wp/2008/05/tutorial-slides-from-iwomp-2008/
in particular the Windows/Linux Server shootout using IVF and ThreadChecker on 2- and 4- cored processors.
Gerry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is the "OpenMP in the real world" presentation, yes? A few slides at the end. It was interesting to see the display from Intel Thread Profiler (not Checker) shown but no attribution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the part of the presentation called OpenMP on Windows, Case Study, p. 30 (.NOT.Powerpoint)
p.30,mentions IVF 10.1
p.31,it mentions Intel Thread Checker
Gerry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The p.11 slide shows the imbalance for a C/C++ code under MPI as opposed to a Fortran code under OpenMP. I could never get ITP to do anything for f95 so I irrevocably ditched it last year.
Gerry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OpenMP in the Real World slide 11 is from Intel Trace Analyzer (an MPI analysis tool), or its predecessor Vampir. It could be used for hybrid MPI/OpenMP applications such as are mentioned in the presentation, but the OpenMP performance issues are buried. Assuming that each MPI process runs under OpenMP, thread profiling would be relevant within the OpenMP.
Referring to slide 32, the issues between Windows and linux don't appear to be mentioned. Among them is greater dependence of Windows on user settable options for pinning to cores (ifort environment variable KMP_AFFINITY). Thread profiling is more likely to produce useful results when such options are set.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gerry,
Good find. Looking at the chart on page 32 illustrates good scalability, but what is more interesting is what is on the prior page (page 32). Assuming "your" code is in F90 already, and fairly large, then incorporating only 5 parallel regions (relatively little effort considering ~91,000 lines of code) yielded reasonably good scalability. Additional attention might produce better scalability.
It would be interesting to determine what accounted for the 7% difference. Without seeing the application (and platforms) it is mere conjecture. If the hardware platforms were different then the Linux vs Windows is an apples and oranges comparrison. In particular even with sameCPU and RAM/FSB speeds,if one platform motherboarduses integrated video and the other does not then the one using the integrated video tends to suffer (depending on video I/O).
Everybody's "Real World" will be different (IMHO).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also one more thing came to my mind is that; if I do multi- threading do I need to restructure my code compltely so as to take advantage of a dual core processor, or vectorization and parallelizing(like Open MP ) would help me out more ? i am sorry but knowledege is very limted in this field hence my Question's, might seem little stupid to some ( :p )
My Application is dealing with large volume data and complex numerical simualtion ,hence a lot of number crunching is invovled with preserving very high precision(upto atleast 16 digits) , hence i was wondring how slow would it be if I use a quad precision variable instead of a double precision variable ,( that would greatly increase the precision, i guess).
I ALSO have a 45nm new XEON machine at my disposal if need so.
Thanks again in advance.
Nittin arora
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page