VECTORIZATION,DuAL core implementation

TinTin_9 · ‎06-18-2008

Hi all,
I have 2 questions ?

1)
I haven been working one a big code for space trajectory optimization application in Intel Fortran in VS-2005.

I have recently, learnt little about vectorization of loops feature in I.fortran , I can notice that when I compile my code most of my loops are not getting vectorized. While the loops like direct whole array assignment are getting vectorized.

I was wondering if I can get some tips on how to structure loops so that are vectorized successfully ?

2)

I have an Intel Core 2 duo processor .

While on runtime I can see that only 50% of my processor is getting used up? Can I use both the cores simultaneously? I felt that this will greatly increase the speed of execution which is one of the main concerns of my research work.

Thank you all

Nittin Arora

onkelhotte · ‎06-19-2008

A 50% use of your dual core system means, that you run a single threaded programm. Windows shifts the thread equally to both cores, so they use only 50% of their capability. In a quad core system it would be 25%.

Try OpenMP foreasy multi threading. It is not difficult threading do loops. But be sure that you access your arrays from the inner to the outer rank, when you have a 2 or 3 rank array:

!$OMP DO
do y=1,ny
do x=1,nx
result(x,y)=factor(x,y)*...
end do
end do

Im not very familiar with other OpenMP commands, but that should give you a good boost without changing your code. The ! before $OMP is necessary. But you have to be sure, that the calculations in your do loop are independent. result(x,y)=result(x-1,y)... wont work because it is possible, that x-1 wont be calculated, when x need it (race condition).

Markus

jimdempseyatthecove · ‎06-19-2008

As Marcus suggests, OpenMP will help but do not discount changes to your code that improve the vectorization.

In Fortran, the left-most index varies fastest (in C/C++ it it the right-most index)i.e. In Fortran adjacent cells (left-most index wise) are in adjacent memory locations. When cells in adjacent memory are referenced per next interation in a loop then the loop is a candidate for vectorization.

*** Bad form

do I=1,nI
do J=1,nJ
A(I,J) = B(I,J) * scalar + C(I,J)
end do
end do

***Good form

do J=1,nJ
do I=1,nI
A(I,J) = B(I,J) * scalar + C(I,J)
end do
end do

***Better form

!$OMP PARALLEL DO PRIVATE(I) DEFAULT(SHARED)
do J=1,nJ
do I=1,nI
A(I,J) = B(I,J) * scalar + C(I,J)
end do
end do
!$OMP END PARALLEL DO

Note, in order to use OpenMP you must add USE OMP_LIB and use the compiler option to compile OpenMP threaded code and link in the OpenMP library.

First work on the vectorization as the benefit carries through to multi-threaded coding.

Jim Dempsey

g_f_thomas · ‎06-20-2008

Jim and Steve,

Slightly OT, have a look at the 'real world' presentation at

http://openmp.org/wp/2008/05/tutorial-slides-from-iwomp-2008/

in particular the Windows/Linux Server shootout using IVF and ThreadChecker on 2- and 4- cored processors.

Gerry

Steven_L_Intel1 · ‎06-20-2008

Gerry,

This is the "OpenMP in the real world" presentation, yes? A few slides at the end. It was interesting to see the display from Intel Thread Profiler (not Checker) shown but no attribution.

g_f_thomas · ‎06-20-2008

In the part of the presentation called OpenMP on Windows, Case Study, p. 30 (.NOT.Powerpoint)

p.30,mentions IVF 10.1

p.31,it mentions Intel Thread Checker

Gerry

Steven_L_Intel1 · ‎06-20-2008

Ok - the title shown on the web page is "OpenMP in the Real World". Yes, Thread Checker is mentioned, but the illiustration on p11 is from Thread Profiler.

g_f_thomas · ‎06-20-2008

The p.11 slide shows the imbalance for a C/C++ code under MPI as opposed to a Fortran code under OpenMP. I could never get ITP to do anything for f95 so I irrevocably ditched it last year.

Gerry

TimP · ‎06-20-2008

OpenMP in the Real World slide 11 is from Intel Trace Analyzer (an MPI analysis tool), or its predecessor Vampir. It could be used for hybrid MPI/OpenMP applications such as are mentioned in the presentation, but the OpenMP performance issues are buried. Assuming that each MPI process runs under OpenMP, thread profiling would be relevant within the OpenMP.

Referring to slide 32, the issues between Windows and linux don't appear to be mentioned. Among them is greater dependence of Windows on user settable options for pinning to cores (ifort environment variable KMP_AFFINITY). Thread profiling is more likely to produce useful results when such options are set.

jimdempseyatthecove · ‎06-20-2008

Gerry,

Good find. Looking at the chart on page 32 illustrates good scalability, but what is more interesting is what is on the prior page (page 32). Assuming "your" code is in F90 already, and fairly large, then incorporating only 5 parallel regions (relatively little effort considering ~91,000 lines of code) yielded reasonably good scalability. Additional attention might produce better scalability.

It would be interesting to determine what accounted for the 7% difference. Without seeing the application (and platforms) it is mere conjecture. If the hardware platforms were different then the Linux vs Windows is an apples and oranges comparrison. In particular even with sameCPU and RAM/FSB speeds,if one platform motherboarduses integrated video and the other does not then the one using the integrated video tends to suffer (depending on video I/O).

Everybody's "Real World" will be different (IMHO).

Jim Dempsey

TinTin_9 · ‎06-20-2008

Thanks :) for all your quick response, this would get me get going with learning vectorization and OpenMp ( I have no idea as for now, what it is , I just program in fortran ) .

Also one more thing came to my mind is that; if I do multi- threading do I need to restructure my code compltely so as to take advantage of a dual core processor, or vectorization and parallelizing(like Open MP ) would help me out more ? i am sorry but knowledege is very limted in this field hence my Question's, might seem little stupid to some ( :p )

My Application is dealing with large volume data and complex numerical simualtion ,hence a lot of number crunching is invovled with preserving very high precision(upto atleast 16 digits) , hence i was wondring how slow would it be if I use a quad precision variable instead of a double precision variable ,( that would greatly increase the precision, i guess).

I ALSO have a 45nm new XEON machine at my disposal if need so.

Thanks again in advance.

Nittin arora

TinTin_9 · ‎06-23-2008

Thanks very much, i applied this but some - how my computational time increased instead of being decreased.But I guess it mite be due to my bad strcuture of the do loop maybe . but still now i knw some parallel programing :) thanks again