Re: Same code, same compiler options, performance is poorer in

joex26 · ‎10-13-2009

Hi,

After reading so much stuff about how Intel C++ is better then others I decided to test it (I have some real code to optimize)

I was trying many options combinations (O1,O2,Ox, SSE2,SSE3, SSE4.1,SSE4.2, data alignment, IPO, auto parallelization with loop level set and not set) . I have C0x enabled, and I am using restrict keyword for Intel (Visual does not recognize it). I have optimization diagnostic level set to 3. I am compiling for X64.

And, after a dozen or so checks I can say that Intel is running 0.3 fps (which is about 4.8%) slower than Visual.

Auto-parallelizer actually makes things slower than linear (half slower to be exact). I think it is because my functions are small but called very often.

Obviously OpenMP had similar performance to auto-parallerizer.

gcc 3.4.6 is about 30% slower. But I will do the tests also on gcc 4.x.x and Open64.

Do you have any ideas what else could improve performance? Or why is it working still slower than Visual?

I am using Intel v.11 and Visual Studio 2005.

Thanks for help.

jimdempseyatthecove · ‎10-13-2009

Joe,

Have you run a profiler to help determine where the problem is?

The problem may not be related to the code you compile. For example, if you are compiling x64 code but using x32 library for graphics then the x64 will perform an operation called a "thunk" each time a call is made to the library (to transition and pass args from x64 to x32 and back when necessary).

Data and code alignment may affect the performance (usually does).

Don't blame the compiler when the circumstances are beyond its control.

Jim Dempsey

aazue · ‎10-13-2009

Quoting - joex26

Hi,
After reading so much stuff about how Intel C++ is better then others I decided to test it (I have some real code to optimize)

I was trying many options combinations (O1,O2,Ox, SSE2,SSE3, SSE4.1,SSE4.2, data alignment, IPO, auto parallelization with loop level set and not set) . I have C0x enabled, and I am using restrict keyword for Intel (Visual does not recognize it). I have optimization diagnostic level set to 3. I am compiling for X64.

And, after a dozen or so checks I can say that Intel is running 0.3 fps (which is about 4.8%) slower than Visual.

Auto-parallelizer actually makes things slower than linear (half slower to be exact). I think it is because my functions are small but called very often.

Obviously OpenMP had similar performance to auto-parallerizer.

gcc 3.4.6 is about 30% slower. But I will do the tests also on gcc 4.x.x and Open64.

Do you have any ideas what else could improve performance? Or why is it working still slower than Visual?

I am using Intel v.11 and Visual Studio 2005.

Thanks for help.

Hi
Use WMI to give (public) caracteristics of your machine before to decrease product not justified.
Compiler require several different test specific hardware appropriated machine before an generalised evaluation.
About Gnu compiler better that you use recently upgrade with an version for penguin operating system, Bill side ,as if you eat soup with the fork.
Some people sell the tooth brushes to birds ,and give council as result effective
benefit.
The Penguin (no an older) is an bird with tooth ? maybe require a flag (- bite) for appreciate performance ..

Example compiler with an result flag optimized:
C: Program Files Bill_Icc /SSE** / bite optimized_pig.cpp

objective answer ????
Kind regards

joex26 · ‎10-13-2009

Dear all,
Thank you for your responses. Please don't take my statements as disregard to the Intel's compiler.
I just posted real results from my tests which were taken as an average from 12 runs for each option set.

I am not using any 32-bit libraries so thunking should not be the case here. I will send WMI tomorrow when I will be back at work.

I did profiling using AQTime profiler. I am aware of hotspots in the software. The code is pretty much optimized actually, so maybe there is not much to do for Intel compiler there?

I am developing software based on this:
http://research.nokia.com/research/mobile3D

JenniferJ · ‎10-13-2009

Quoting - joex26

I am developing software based on this:
http://research.nokia.com/research/mobile3D

See if you could get a testcase so we can investigate.

Jennifer

aazue · ‎10-13-2009

Quoting - joex26

Dear all,
Thank you for your responses. Please don't take my statements as disregard to the Intel's compiler.
I just posted real results from my tests which were taken as an average from 12 runs for each option set.

I am not using any 32-bit libraries so thunking should not be the case here. I will send WMI tomorrow when I will be back at work.

I did profiling using AQTime profiler. I am aware of hotspots in the software. The code is pretty much optimized actually, so maybe there is not much to do for Intel compiler there?

I am developing software based on this:
http://research.nokia.com/research/mobile3D

Hi Joex26
Seriously without the joke.
I think that better you make test with couple ICC 11.. and VC2008 or last version.
probably you discover approximative same result if you having code has not wrote specifically accorded Icc favorable side

See if you can change some (while) to (for) with correct (private local pointer) for possible vectorization also
see TBB Openmp used for side (SCTP) ext....
Also If you can use last type same I7 core, Atom, or ULV processor is better.
You are welcome here with free choice of your appreciation,also with your favorite compiler.
With or without Icc compiler, success to all member your team.
Computer sized same phone have well potential success now or tomorrow , i hope.
Kind regards
After reflection add this exchange (with flowers),
Precaution if tomorrow I buy the Nokia phone I can not find (optimized to me) penguin inserted for bites my ear;

joex26 · ‎10-14-2009

Bustaf:

Your utterance is basically bubbling. What did you smoke, man?

Please read my text slowly, then you can notice what compilers I did use.

Rolling back loops to let them be unrolled automatically? It must be next century technology. I did not know it.

The only two words which make sense in your text are OpenMP and TBB.

I tried OpenMP but it works slower, similarly to automatic parallelization. And TBB works similar to OpenMP so I skipped it.

And I don't have any connection with Nokia company so basically I care about your attitude to Nokia phones by the same means as for last year's snow.

And yeah you are great - you have brown belt!

aazue · ‎10-14-2009

Quoting - joex26

Bustaf:

Your utterance is basically bubbling. What did you smoke, man?

Please read my text slowly, then you can notice what compilers I did use.

Rolling back loops to let them be unrolled automatically? It must be next century technology. I did not know it.

The only two words which make sense in your text are OpenMP and TBB.

I tried OpenMP but it works slower, similarly to automatic parallelization. And TBB works similar to OpenMP so I skipped it.

And I don't have any connection with Nokia company so basically I care about your attitude to Nokia phones by the same means as for last year's snow.

And yeah you are great - you have brown belt!

Hi
I tried OpenMP but it works slower, similarly to automatic parallelization.
Require you learn how must be correctly wrote source code ..
About while() to for()
With while compiler have not information where can be cut or divided chunks
as started pair, impair or two side ++ & --.
About Tbb if you working side C++ is very nice lib to reduct or simplify source;
Also OpenMp is not only reserved loop to use other side example modify old socket with new SCTP
create group dummy virtual address and allow divided to asynchronous chunk and observe result.( out as gateway default)
is not -4 % is + 25/30% improved.
About phone as Nokia or other marks i think serious subject with this type object
can be used now as computer is an opportunity to out effect financial crisis.
(require an respect)
An deception that you not discerning difference fun to real..
If you have an sample to show public your level under browser with your httpt server,
I have also, as same , all people can evaluate that of two must return to school.
i not like use same language; just forced with your aggressive answer. (supposed smoker)
Kind regards

jimdempseyatthecove · ‎10-14-2009

Joe,

Would it be possible for you to post sample code showing the problem?

Autoparallelization, OpenMP and TBB have different characteristics. I would not place OpenMP and TBB in the same category. While I am not promoting TBB I suggest you not discount it so quickly as to say it works similar to OpenMP.

You might try profiling the code to see what is happening. Intel has VTune, but you can also run AMD's CodeAnalystn(CA) using timer based sampling on Intel processors. CA is a free download and IMHO is simpler to use. Also, Intel has a demo Parallel Advisor that might provide insight as to the bottleneck.

From the symptom description I suspect something else at play than compiler optimizations.

RE: thunk

Several months ago I downloaded the Havok Smoke Demo (32-bit). My system has Windows XP x64. While I can compile and run 32-bit applications the OpenGL display drivers required thunking to transition between x32 and x64. The result was abismial frame rate. When obscuring the display window (or minimizing it), performance was restored. There may be an option in the Performance Monitor to count thunks, that would tell if this is affecting your performance.

Jim Dempsey

joex26 · ‎10-15-2009

Jim,

I used profiler to find bottlenecks in the software. I am working on the encoder side. You can download the software from the link I provided to have a wider view.

The most problematic functions are findSad2_16x and findSad2_8x.

They take 2/3ds of the total time. Unfortunately I cannot be precise about the time because my eval. period for AQTime has finished. I will continue profiling using some other profiler and later on send the data.

These two functions have similar while loops:

#ifdef INTEL
int32 findSad2_8x(u_int8 *restrict orig, u_int8* restrict ref, int w, int blkHeight, int32 sad, int32 bestSad)
#else
int32 findSad2_8x(u_int8 *orig, u_int8* ref, int w, int blkHeight, int32 sad, int32 bestSad)
#endif
{
#ifndef VECTORIZATION
int j;

j = 0;
do {
sad += ABS_DIFF(orig[j*MBK_SIZE+0], ref[j*2*w+0]);
sad += ABS_DIFF(orig[j*MBK_SIZE+1], ref[j*2*w+2]);
sad += ABS_DIFF(orig[j*MBK_SIZE+2], ref[j*2*w+4]);
sad += ABS_DIFF(orig[j*MBK_SIZE+3], ref[j*2*w+6]);
sad += ABS_DIFF(orig[j*MBK_SIZE+4], ref[j*2*w+8]);
sad += ABS_DIFF(orig[j*MBK_SIZE+5], ref[j*2*w+10]);
sad += ABS_DIFF(orig[j*MBK_SIZE+6], ref[j*2*w+12]);
sad += ABS_DIFF(orig[j*MBK_SIZE+7], ref[j*2*w+14]);
j++;
} while (sad < bestSad && j < blkHeight);
#endif
#ifdef VECTORIZATION
int j,i;
for(j=0;j { for(i=0;i<8;++i)
sad += ABS_DIFF(orig[j*MBK_SIZE+i], ref[j*2*w+i*2]);
if (sad >= bestSad)
break;
}
#endif
return sad;
}

where ABS_DIFF is macro fetching results from LUT

#define ABS_DIFF(a, b) ((absDiff+MAX_DIFF)[(int)(a) - (int)(b)])

static const u_int8 absDiff[2*MAX_DIFF+1] = {
255,254,253,252,251,250,249,248,247,246,245,244,243,242,241,240,
......
.....
and so on....

as you can see I tried to change the while to for (to help vectorizer) but it was working slower then.

Maybe the break; instruction is preventing vectorizer to work properly. I don't know.

I was trying to use OpenMP also on 'for' loops (in these two functions), but asblkHeight can either be 8 or 16 there is not much work to be done even for one thread and the function quits so I asume that splitting to few threads gives so much overhead that in result the whole program is working slower.

Since I read that TBB is also data-based parallelism I concluded that it will not help me much in that case.

About thunking.

During next profiling round I will take a look at thunks counter.

Options for Visual:

/Ox /Oi /Ot /GL /I "C:Program Files (x86)boostboost_1_39" /I ".include" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_VC80_UPGRADE=0x0600" /D "_ATL_MIN_CRT" /D "_MBCS" /GF /FD /EHsc /MT /arch:SSE2 /fp:fast /GR- /Fp".Release/MVCEncoder.pch" /Fo".Release/" /Fd".Release/" /W3 /nologo /c /Wp64 /Zi /TP /errorReport:prompt

/OUT:".Release/MVCEncoder.exe" /INCREMENTAL:NO /NOLOGO /LIBPATH:"C:Program Files (x86)boostboost_1_39lib" /MANIFEST /MANIFESTFILE:"x64ReleaseMVCEncoder.exe.intermediate.manifest" /DEBUG /PDB:".Release/MVCEncoder.pdb" /SUBSYSTEM:CONSOLE /OPT:NOWIN98 /LTCG /MACHINE:X64 /ERRORREPORT:PROMPT kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib

And for Intel:

/c /O1 /Og /Oi /Ot /Qipo /GA /I "C:Program Files (x86)boostboost_1_39" /I ".include" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_VC80_UPGRADE=0x0600" /D "_ATL_MIN_CRT" /D "_MBCS" /GF /EHsc /MT /GS /arch:SSE3 /fp:fast /Fo".Release/" /W3 /nologo /Wp64 /Zi /TP /Quse-intel-optimized-headers /Qstd=c++0x /Qrestrict /Qopt-report:3 /Qopt-report-file:"x64Release/MVCEncoder.rep"

kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /OUT:".Release/MVCEncoder.exe" /INCREMENTAL:NO /nologo /LIBPATH:"C:Program Files (x86)boostboost_1_39lib" /MANIFEST /MANIFESTFILE:"x64ReleaseMVCEncoder.exe.intermediate.manifest" /TLBID:1 /DEBUG /PDB:".Release/MVCEncoder.pdb" /SUBSYSTEM:CONSOLE /OPT:NOWIN98 /IMPLIB:"C:UsersmichalprojectsVisualStudio2005MVCTwoThreadsReleaseMVCEncoder.lib" /MACHINE:X64

O1 is working fastest for Intel.

Thanks for your advices.

jimdempseyatthecove · ‎10-15-2009

Download CodeAnalyst from http://developer.amd.com/CPU/CODEANALYST/Pages/default.aspx
Use timer based profiling (works on IA32 and EM64T).

What it looks like the problem is the LUT is not vectorizing due to requiring a gather operation. Therefore I do not think SSE is or can be used effectively until later versions (AVX) supporting scatter/gather using your LUT

However, consider replacing your LUT with SSE instruction using PSADBW and itsintrinsic

__m128i _mm_sad_epu8(__m128i a, __m128i b)

Compute the absolute differences of the 16 unsigned 8-bit values of a and b;

Then use a horizontal add to get increment for sad

Jim Dempsey

JenniferJ · ‎11-04-2009

Just an update to this issue.

I checked the loop in findSad2_16x(). It's not possible for vectorization. But the code generated by icl should not run slower. So I've filed an issue report to the compiler team.

When there's any progress, I'll let you know.

Jennifer

TimP · ‎11-04-2009

Quoting - Jennifer Jiang (Intel)

I checked the loop in findSad2_16x(). It's not possible for vectorization. But the code generated by icl should not run slower. So I've filed an issue report to the compiler team.

32-bit ICL is doing some spills in that loop, and not optimizing out the multiply, which can be corrected by cutting back on source unrolling. gcc can be tricky, according to whether you specified a level of unrolling aggressiveness appropriate to your CPU (different for Penryn and Core i7, for example).

aazue · ‎11-09-2009

Quoting - Jennifer Jiang (Intel)

Just an update to this issue.

I checked the loop in findSad2_16x(). It's not possible for vectorization. But the code generated by icl should not run slower. So I've filed an issue report to the compiler team.

When there's any progress, I'll let you know.

Jennifer

Hi
i don't know ,if relation with low performance (OpenMp subject) mentioned before in specific situation. I observe an problem threading with ICC
I add an easy function for given where problem .

Remark: (Personally i have not smae problem i drive threading manually old school (with lib) g++ or icc used)

Function is to searching relation text with an array of word
you having an loop to allow each word research to separate (chunks).
Situation, you want determinate with text string where you search have relation specific sector.
example array
{"network","wireless","lan","wan","gateway","mask","cid","datagram","stream","tcp","socket" ext..........}
If you add an flag (as shared same an integer) each (chunks) can dispose halted flag.
you deciding that x (nps in smaple) words suffice relation determinate sector relation.
refered as same Googler engine to determinate activity relation for deal (onFishOver isMoneyProsper($,$))

if flag value is x (nps in sample) all chunk have not already find occurrence, can be halted.
(thread not now necessary as obsolete halted to not working for the wind).
(asynchronous side improve is here (random probability matrix) is not the size number off call)
in opposition with g++ that give various time with an differently number call relation (nbs)
Made the test with Icc you observe unchanged with increase or decrease number relation
I think threading not working correctly... logically time must variable with the call
number relation.

except error on my part,function is easy and clear for give as evident, i think .....

SAMPLE:

#include
#include
#include
#include
#include
#include
#include
#include

////////////////////////////////////////////////////////////////////////////////
//VOID COUNT_ARRAY_OCCURS FOR COUNT ARRAY OCCURRENCES ASYNCHRONY IN AN STRING//
////////////////////////////////////////////////////////////////////////////////
int
count_array_occurs (char *a, char **b, int c, int d)
{
// A IS GLOBAL STRING WHERE MUST COUNTED OCCURRENCE
// B IS ARRAY OF WORDS OCCURRENCES AS MUST COUNTED IN A
// C IS SIZE CALLED OF ARRAY
// D IS NUMBER PROBABILITY RELATION REQUIRED TO AN DEDUCTION
int la = strlen (a);
int lc;
int noc;
int pos[la];
int j;
int x;
int k;
int p = 0;
for (int i = 0; i <= c - 1; i++)
{
lc = strlen (b);
noc = 0;
}
//omp_set_nested (c);
#pragma omp parallel shared(a,p) private(j,k,x)
{
for (int i = 0; i <= c - 1; i++)
{
// WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0
// if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask);
// if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");}

if (p > d) //FLAG HALT AS SATISFACT POINT
{
i = c - 1;
}
#pragma omp sections nowait
{

#pragma omp section
for (j = 0; j <= la - 1; j++)
{
x = 0;
if (a == b[0])
{
for (k = j; k <= j + lc - 1; k++)
{
if (a == b && x <= lc - 1)
{
x++;
}
if (x == lc)
{
noc++;
pos[noc] = k - x;
p++;

}
}
}
}
}
}
}
for (int i = 0; i <= c - 1; i++)
{
std::cout << noc << " <-OCC-> " << b << std::endl;
int m = 1;
while (pos != 0)
{
std::cout << pos << " <-POS-> " << b << " <-TO-> " << pos + strlen (b) << std::endl;
m++;
}
}
return (0);
}

// JUST ADDED AS TEST TO VERIFY WITHOUT LOOP (AS DEFINED)
///////////////////////////////////////////////////////////////////////////////
//VOID BADGERS_LOOP_2_OCCURS FOR COUNT 2 OCCURRENCES ASYNCHRONY IN AN STRING//
///////////////////////////////////////////////////////////////////////////////
//A IS STRING WHERE MUST COUNTED OCCURRENCE
//B IS IS FIRST WORD OCCURRENCE COUNTED IN A
//C IS IS SECOND WORD OCCURRENCE COUNTED IN A
int
badgers_loop_2_occurs (char *a, char *b, char *c)
{
int la = strlen (a);
int lb = strlen (b);
int lc = strlen (c);
int j;
int x;
int k;
int noc = 0;
int noc1 = 0;
#pragma omp parallel shared(a) private(j,k,x)
{
#pragma omp sections nowait
{
#pragma omp section
for (j = 0; j <= la - 1; j++)
{
x = 0;
if (a == b[0])
{
for (k = j; k <= j + lb - 1; k++)
{
if (a == b && x <= lb - 1)
{
x++;
}
if (x == lb)
{
noc++;
}
}
}
}
#pragma omp section
for (j = 0; j <= la - 1; j++)
{
x = 0;
if (a == c[0])
{
for (k = j; k <= j + lc - 1; k++)
{
if (a == c && x <= lc - 1)
{
x++;
}
if (x == lc)
{
noc1++;
}
}
}
}
}
}
return noc + noc1; // NOC + NOC1 IS SUM OF OCCURRENCES
}

int
main (int argc, char *argv[])
{
char testocc[4096];
char *strtab[20] = { "smoker", "system", "Run", "quotes", "not", "a", "sample is :", "on", "the", "server", "to", "re", "ll", "faults", "pre", "ch", "na", " is", ",", "." };
int nps = 100;
strcpy (testocc,
" nString testocc sample is:n (Run the system preparation tool on the server to check the client system for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installer.This method have effect to list table, is part of method.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults) open The quote opening character (s).close The quote closing character
(s).If undefined, it will equal open.open2 The quote opening character (s) for quoteswithin quotes.If undefined, it will equal open.Close2 The quote closing character (s) for quotes within quotes.If undefined, it will equal close.Repeat 2 nString testocc sample is:n (Run the system preparation tool on the server to check the client system for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installe
r.This method have effect to list table, is part of method.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults)open The quote opening character (s).close The quote closing character (s).If undefined, it will equal open.open2 The quote opening character (s) for quoteswithin quotes.If undefined, it will equal open.Close2 The quote closing character (s) for quoteswithin quotes.If undefined, it will equal close.Repeat 2 nString testocc sample is:n (Run the system
preparation tool on the server to check the client system for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installer.This method have effect to list table, is part of method.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults) open The quote opening character (s).close The quote closing character (s).If undefined, it will equal open.open2 The quote openi
ng character (s) for quotes within quotes.If undefined, it will equal open.Close2 The quote closing character (s) for quotes within quotes.If undefined, it will equal close.Repeat 2nString testocc sample is:n (Run the system preparation tool on the server to check the client system for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installer.This method have effect to list table, is part of method
.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults) open The quote opening character (s).close The quote closing character (s).If undefined, it will equal open.open2 The quote opening character (s) for quotes within quotes.If undefined, it will equal open.Close2 The quote closing character (s) for quotes within quotes.If undefined, it will equal close.Repeat 2 nString testocc sample is:n (Run the system preparation tool on the server to check the client syste
m for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installer.This method have effect to list table, is part of method.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults) open The quote opening character (s).close The quote closing character (s).If undefined, it will equal open.open2 The quote opening character (s) for quotes within quotes.If undefined,
it will equal open.Close2 The quote closing character (s) for quotes within quotes.If undefined, it will equal close.");

std::cout << testocc << std::endl;
std::cout << "nFunction count_array_occurs n " << std::endl;
int res = count_array_occurs (testocc, &strtab[NULL], 20, nps);
std::cout << " nbadgers_loop_2_occurs (testocc, "a", "b") result :" << std::endl;
std::cout << badgers_loop_2_occurs (testocc, "a", "system") << std::endl;
}

END SAMPLE

Remark: add flag -Wno-write-strings with gnu compiler for disabling string warn(C++)

Information about utility (time) used
On a uniprocessor, the difference between the real time and the total microprocessor time, that is:
real - (user + sys)
is the sum of all of the factors that can delay the program, plus the program's own unattributed costs.
On an SMP, an approximation would be as follows:
real * number_of_processors - (user + sys)

Problem is not situated bad time (as just result wrong working),is the absence of variability
I have also test sized random to i loop to see if an problem pthread_key_t (first last) but same ????

GNU COMPILER nps=100
real 0m0.017s
user 0m0.004s
sys 0m0.000s

GNU COMPILER nps=1000
real 0m0.036s
user 0m0.000s
sys 0m0.000s

ICC COMPILER nps=100
real 0m0.202s
user 0m0.000s
sys 0m0.004s

ICC COMPILER nps=1000
real 0m0.202s
user 0m0.012s
sys 0m0.004s

(library)
Build (shared) ICC (but same result if static)
linux-vdso.so.1 => (0x00007ffff83fe000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00007f8feff2e000)
libimf.so => /opt/intel/Compiler/11.0/081/lib/intel64/libimf.so (0x00007f8fefbd8000)
libsvml.so => /opt/intel/Compiler/11.0/081/lib/intel64/libsvml.so (0x00007f8ff0193000)
libm.so.6 => /lib/libm.so.6 (0x00007f8fef955000)
libiomp5.so => /opt/intel/Compiler/11.0/081/lib/intel64/libiomp5.so (0x00007f8fef7c5000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f8fef4be000)
libintlc.so.5 => /opt/intel/Compiler/11.0/081/lib/intel64/libintlc.so.5 (0x00007f8fef380000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007f8fef169000)
libc.so.6 => /lib/libc.so.6 (0x00007f8feee16000)
libdl.so.2 => /lib/libdl.so.2 (0x00007f8feec12000)
/lib64/ld-linux-x86-64.so.2 (0x00007f8ff014a000)

Build Gnu G++
linux-vdso.so.1 => (0x00007fffe93ff000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00007f50e0fa9000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f50e0ca2000)
libm.so.6 => /lib/libm.so.6 (0x00007f50e0a1f000)
libgomp.so.1 => /usr/lib/libgomp.so.1 (0x00007f50e0817000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007f50e0600000)
libc.so.6 => /lib/libc.so.6 (0x00007f50e02ad000)
/lib64/ld-linux-x86-64.so.2 (0x00007f50e11c5000)
librt.so.1 => /lib/librt.so.1 (0x00007f50e00a4000)

Remark: i have use this machine some older just have kernel and all lib original (no rebuild) default(debian 5) installed
Also origin GNU compiler no an last version snapshots.
(i must make test with core i7 or an other new have 4 cores , also last versions Icc (just require to find times..)
Maybe problem is this machine type ???

Target: x86_64-linux-gnu
gcc version 4.3.2 (Debian 4.3.2-1.1)

Machine used:
debian:/# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel Pentium 4 CPU 3.20GHz
stepping : 3
cpu MHz : 2800.000
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc pebs bts pni monitor ds_cpl est cid cx16 xtpr
bogomips : 6403.60
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel Pentium 4 CPU 3.20GHz
stepping : 3
cpu MHz : 2800.000
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc pebs bts pni monitor ds_cpl est cid cx16 xtpr
bogomips : 6398.20
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:

Parallel to parallel....
You know this metaphor about programmer that want absolute improve also for is already well ?
"For him prove how many his love is big, he has make an kiss so much strongly that he has swallowed her eye."

Kind r..e.g....ar........ds

jimdempseyatthecove · ‎11-09-2009

Joe,

I suggest you change your programming style a little by adding comments to your {}'s such that your scopes are obvious. This way you may eliminate programming errors (or make assumptions you ought not to make).

Example

[cpp]//omp_set_nested (c);
#pragma omp parallel  shared(a,p) private(j,k,x)
{
 for (int i = 0; i <= c - 1; i++)
 {
  // WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0
  // if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask);
  // if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");}

  if (p > d)                //FLAG HALT AS SATISFACT POINT                            
  {
   i = c - 1;
  }
  #pragma omp sections nowait        
  {
   #pragma omp section
   for (j = 0; j <= la - 1; j++)
   {
    x = 0;
    if (a == b[0])
    {
     for (k = j; k <= j + lc - 1; k++)
     {
      if (a == b && x <= lc - 1)
      {
       x++;
      }
      if (x == lc)
      {
       noc++;
       pos[noc] = k - x;
       p++;
      } // if (x == lc)
     } // for (k = j; k <= j + lc - 1; k++)
    } // if (a == b[0])
   } // for (j = 0; j <= la - 1; j++)
   // end #pragma omp section
  } // end #pragma omp sections nowait
 } // for (int i = 0; i <= c - 1; i++)
} // end #pragma omp parallel
// begin serial code
for (int i = 0; i <= c - 1; i++)
{
std::cout << noc << "   <-OCC->   " << b << std::endl;
int m = 1;
while (pos != 0)
{
std::cout << pos << "   <-POS->   " << b << "   <-TO->   " << pos + strlen (b) << std::endl;
m++;
}
}
return (0);
}

[/cpp]

In the above you can now clearly see that you have

omp parallel
omp sections
omp section ?????? one section ???
end omp sections
end omp parallel

The effect of the above is only one thread is doing any productive work

Jim Dempsey

jimdempseyatthecove · ‎11-09-2009

Also are you missing an omp for on your for(i= loop?

If you intend to run each thread using for(i= then you must resolve race conditions with

noc++;
pos[noc] = k - x;

where you have concurrent access to same array location by multiple threads.

Jim

aazue · ‎11-09-2009

Quoting - jimdempseyatthecove

Also are you missing an omp for on your for(i= loop?

If you intend to run each thread using for(i= then you must resolve race conditions with

noc++;
pos[noc] = k - x;

where you have concurrent access to same array location by multiple threads.

Jim

1] First I am not Joe
2]Sorry, I believe that you have absolutely nothing included
As appropriate to your completely suggestions literature as completely outside
Please give the same (your style) code to same functionality with specific results that use Time to reference.
also make your glasses for see as you have not see.
You can make your professor playing with Shakespeare style technical with hidden literature with other but no with me.
Trace all threads and probably you understand how many you are far...

Bustaf

jimdempseyatthecove · ‎11-10-2009

Quoting - bustaf

1] First I am not Joe
2]Sorry, I believe that you have absolutely nothing included
As appropriate to your completely suggestions literature as completely outside
Please give the same (your style) code to same functionality with specific results that use Time to reference.
also make your glasses for see as you have not see.
You can make your professor playing with Shakespeare style technical with hidden literature with other but no with me.
Trace all threads and probably you understand how many you are far...

Bustaf

[cpp]Bustaf,

Sorry about calling you joe (joex26 started this thread)

I will add additional comments to the reformatted version of _your_ code

01.//omp_set_nested (c); 
   // ^^ your comment to turn off nested is OK
   // vv pragma to begin parallel region  
02.#pragma omp parallel  shared(a,p) private(j,k,x)   
03.{  
   // ^^ scoping brace for parallel region
   // -- all threads in team are running through this region
   // vv each thread executes following for loop
04. for (int i = 0; i <= c - 1; i++)   
05. { 
   // -- each thread arrives here with i=0,1,2,...,c-1
   // -- and arrive here at uncontrolable times  
06.  // WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0   
07.  // if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask);   
08.  // if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");}   
09.  
10.  if (p > d)                //FLAG HALT AS SATISFACT POINT                               
11.  {   
12.   i = c - 1;   
13.  }  
   // vv pragma to dividE up current team into sections 
14.  #pragma omp sections nowait           
15.  {   
   // ^^ brace to begin scope of sections
   // (and implicit first section)
   // -- first thread of team reaching sections executes this section
   // *** CAUTION ****
   // *** this sections has nowait AND is enclosed in a loop
   // *** making it possible that a thread may enter this sections
   // *** on iteration n+1 _prior_ to other team member entering
   // *** sections on iteration n. This violates all threads must pass
   // *** through sections (although eventually they will)
   // *** the specification is mute on this as to if an implementation
   // *** can work around this
   // vv due to lack of statements between sections { and pragma omp section
   // vv the following section is the 1st section of the sections
   // vv and therefor redundant
16.   #pragma omp section
   // vv no brace (ok) therefore the following for statement is section 1
17.   for (j = 0; j <= la - 1; j++)   
18.   {   
19.    x = 0;   
20.    if (a == b[0])   
21.    {   
22.     for (k = j; k <= j + lc - 1; k++)   
23.     {   
24.      if (a == b && x <= lc - 1)   
25.      {   
26.       x++;   
27.      }   
28.      if (x == lc)   
29.      {   
30.       noc++;   
31.       pos[noc] = k - x;   
32.       p++;   
33.      } // if (x == lc)   
34.     } // for (k = j; k <= j + lc - 1; k++)   
35.    } // if (a == b[0])   
36.   } // for (j = 0; j <= la - 1; j++)   
37.   // end #pragma omp section   
38.  } // end #pragma omp sections nowait   
39. } // for (int i = 0; i <= c - 1; i++)   
40.} // end #pragma omp parallel   
41.// begin serial code   
42.for (int i = 0; i <= c - 1; i++)

Comments

The sections with nowait inside a loop within a parallel region
(and without barrier) is operating under unspecified rules.


If you were to remove nowait (or add barrier) then only one
thread would perform productive work (as explained earlier)
This would be equivilent to making the sections and section into a single

If you want each thread to enter the for(j loop then you must resolve
the possibility that multiple threads may concurrently execute
"noc++" with the same value for i, and under which case the result
is not determinant. A similar (but not quite same) issue exists with
"pos[noc] = k - x" where multiple theads execute the statemen
at the same time with the same value of i, in which case you would
be using an indeterminant value in noc as a subscript to store
a value of "k-x" which may also differ between threads.

Unless you want jibberish in noc and pos, the above code (as you wrote)
is senseless.

Jim Dempsey

[/cpp]

aazue · ‎11-10-2009

Quoting - jimdempseyatthecove

Quoting - bustaf

1] First I am not Joe
2]Sorry, I believe that you have absolutely nothing included
As appropriate to your completely suggestions literature as completely outside
Please give the same (your style) code to same functionality with specific results that use Time to reference.
also make your glasses for see as you have not see.
You can make your professor playing with Shakespeare style technical with hidden literature with other but no with me.
Trace all threads and probably you understand how many you are far...

Bustaf

[cpp]Bustaf,

Sorry about calling you joe (joex26 started this thread)

I will add additional comments to the reformatted version of _your_ code

01.//omp_set_nested (c); 
   // ^^ your comment to turn off nested is OK
   // vv pragma to begin parallel region  
02.#pragma omp parallel  shared(a,p) private(j,k,x)   
03.{  
   // ^^ scoping brace for parallel region
   // -- all threads in team are running through this region
   // vv each thread executes following for loop
04. for (int i = 0; i <= c - 1; i++)   
05. { 
   // -- each thread arrives here with i=0,1,2,...,c-1
   // -- and arrive here at uncontrolable times  
06.  // WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0   
07.  // if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask);   
08.  // if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");}   
09.  
10.  if (p > d)                //FLAG HALT AS SATISFACT POINT                               
11.  {   
12.   i = c - 1;   
13.  }  
   // vv pragma to dividE up current team into sections 
14.  #pragma omp sections nowait           
15.  {   
   // ^^ brace to begin scope of sections
   // (and implicit first section)
   // -- first thread of team reaching sections executes this section
   // *** CAUTION ****
   // *** this sections has nowait AND is enclosed in a loop
   // *** making it possible that a thread may enter this sections
   // *** on iteration n+1 _prior_ to other team member entering
   // *** sections on iteration n. This violates all threads must pass
   // *** through sections (although eventually they will)
   // *** the specification is mute on this as to if an implementation
   // *** can work around this
   // vv due to lack of statements between sections { and pragma omp section
   // vv the following section is the 1st section of the sections
   // vv and therefor redundant
16.   #pragma omp section
   // vv no brace (ok) therefore the following for statement is section 1
17.   for (j = 0; j <= la - 1; j++)   
18.   {   
19.    x = 0;   
20.    if (a == b[0])   
21.    {   
22.     for (k = j; k <= j + lc - 1; k++)   
23.     {   
24.      if (a == b && x <= lc - 1)   
25.      {   
26.       x++;   
27.      }   
28.      if (x == lc)   
29.      {   
30.       noc++;   
31.       pos[noc] = k - x;   
32.       p++;   
33.      } // if (x == lc)   
34.     } // for (k = j; k <= j + lc - 1; k++)   
35.    } // if (a == b[0])   
36.   } // for (j = 0; j <= la - 1; j++)   
37.   // end #pragma omp section   
38.  } // end #pragma omp sections nowait   
39. } // for (int i = 0; i <= c - 1; i++)   
40.} // end #pragma omp parallel   
41.// begin serial code   
42.for (int i = 0; i <= c - 1; i++)

Comments

The sections with nowait inside a loop within a parallel region
(and without barrier) is operating under unspecified rules.


If you were to remove nowait (or add barrier) then only one
thread would perform productive work (as explained earlier)
This would be equivilent to making the sections and section into a single

If you want each thread to enter the for(j loop then you must resolve
the possibility that multiple threads may concurrently execute
"noc++" with the same value for i, and under which case the result
is not determinant. A similar (but not quite same) issue exists with
"pos[noc] = k - x" where multiple theads execute the statemen
at the same time with the same value of i, in which case you would
be using an indeterminant value in noc as a subscript to store
a value of "k-x" which may also differ between threads.

Unless you want jibberish in noc and pos, the above code (as you wrote)
is senseless.

Jim Dempsey

[/cpp]

aazue · ‎11-10-2009

Quoting - bustaf

Quoting - jimdempseyatthecove

Quoting - bustaf

1] First I am not Joe
2]Sorry, I believe that you have absolutely nothing included
As appropriate to your completely suggestions literature as completely outside
Please give the same (your style) code to same functionality with specific results that use Time to reference.
also make your glasses for see as you have not see.
You can make your professor playing with Shakespeare style technical with hidden literature with other but no with me.
Trace all threads and probably you understand how many you are far...

Bustaf

[cpp]Bustaf, Sorry about calling you joe (joex26 started this thread) I will add additional comments to the reformatted version of _your_ code 01.//omp_set_nested (c); // ^^ your comment to turn off nested is OK // vv pragma to begin parallel region 02.#pragma omp parallel shared(a,p) private(j,k,x) 03.{ // ^^ scoping brace for parallel region // -- all threads in team are running through this region // vv each thread executes following for loop 04. for (int i = 0; i <= c - 1; i++) 05. { // -- each thread arrives here with i=0,1,2,...,c-1 // -- and arrive here at uncontrolable times 06. // WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0 07. // if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask); 08. // if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");} 09. 10. if (p > d) //FLAG HALT AS SATISFACT POINT 11. { 12. i = c - 1; 13. } // vv pragma to dividE up current team into sections 14. #pragma omp sections nowait 15. { // ^^ brace to begin scope of sections // (and implicit first section) // -- first thread of team reaching sections executes this section // *** CAUTION **** // *** this sections has nowait AND is enclosed in a loop // *** making it possible that a thread may enter this sections // *** on iteration n+1 _prior_ to other team member entering // *** sections on iteration n. This violates all threads must pass // *** through sections (although eventually they will) // *** the specification is mute on this as to if an implementation // *** can work around this // vv due to lack of statements between sections { and pragma omp section // vv the following section is the 1st section of the sections // vv and therefor redundant 16. #pragma omp section // vv no brace (ok) therefore the following for statement is section 1 17. for (j = 0; j <= la - 1; j++) 18. { 19. x = 0; 20. if (a == b[0]) 21. { 22. for (k = j; k <= j + lc - 1; k++) 23. { 24. if (a == b && x <= lc - 1) 25. { 26. x++; 27. } 28. if (x == lc) 29. { 30. noc++; 31. pos[noc] = k - x; 32. p++; 33. } // if (x == lc) 34. } // for (k = j; k <= j + lc - 1; k++) 35. } // if (a == b[0]) 36. } // for (j = 0; j <= la - 1; j++) 37. // end #pragma omp section 38. } // end #pragma omp sections nowait 39. } // for (int i = 0; i <= c - 1; i++) 40.} // end #pragma omp parallel 41.// begin serial code 42.for (int i = 0; i <= c - 1; i++) Comments The sections with nowait inside a loop within a parallel region (and without barrier) is operating under unspecified rules. If you were to remove nowait (or add barrier) then only one thread would perform productive work (as explained earlier) This would be equivilent to making the sections and section into a single If you want each thread to enter the for(j loop then you must resolve the possibility that multiple threads may concurrently execute "noc++" with the same value for i, and under which case the result is not determinant. A similar (but not quite same) issue exists with "pos[noc] = k - x" where multiple theads execute the statemen at the same time with the same value of i, in which case you would be using an indeterminant value in noc as a subscript to store a value of "k-x" which may also differ between threads. Unless you want jibberish in noc and pos, the above code (as you wrote) is senseless. Jim Dempsey [/cpp]

Instead of trying to drown the fish with some nonsense literature
Please give the same (your style) code to same functionality with specific results
that use Time to reference for resolve this difference is an reality not literature.

-rwxr-xr-x 1 daemon staff 82094 Nov 7 06:35 ficat5 ICC COMPILED
real 0m0.202s Slow as 82094 must be read by system (is 4 * GNU size)
user 0m0.004s
sys 0m0.000s

-rwxr-xr-x 1 daemon staff 21410 Nov 7 06:37 ficat5 GNU COMPILED
real 0m0.097s
user 0m0.004s
sys 0m0.004s

Sorry,I have other task to make for less a time with you

Bustaf

I forget ...
Just compile exactly i have wrote without flag -parallel to not crossing ( vector)
and read your screen

ipo: remark #11001: performing single-file optimizations
ipo: remark #11005: generating object file /tmp/ipo_icpcKZn3QH.o
ficat5.cc(165): (col. 16) remark: OpenMP DEFINED SECTION WAS PARALLELIZED.
ficat5.cc(165): (col. 16) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
ficat5.cc(46): (col. 1) remark: OpenMP DEFINED SECTION WAS PARALLELIZED.
ficat5.cc(34): (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

also add if you want -par-report=3
ipo: remark #11001: performing single-file optimizations
ipo: remark #11005: generating object file /tmp/ipo_icpcEbE1oi.o
ficat5.cc(165): (col. 16) remark: OpenMP DEFINED SECTION WAS PARALLELIZED.
ficat5.cc(165): (col. 16) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
procedure: main
ficat5.cc(46): (col. 1) remark: OpenMP DEFINED SECTION WAS PARALLELIZED.
ficat5.cc(34): (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
procedure: count_array_occurs
procedure: __sti__$E
Maybe ... Compiler have use the ghosted threads for make ? and wrote an fictive literature
just only for that you are happy ???
Or as showing this compiler , is dummy ?? catastrophic result is not dummy..

I forget . also..

As writed openmp documentation; https://computing.llnl.gov/tutorials/openMP/#Introduction

OMP_IN_PARALLEL
Purpose:

C/C++, it will return a non-zero integer if parallel, and zero otherwise.

add an integer declared ipa

add after they part code:

pos[noc] = k - x;
p++;

(add)

ipa= omp_in_parallel();
std::cout << " <-IPA-> " << ipa << std::endl;

run program and see if 0 or 1.
also move this control before #pragma omp sections nowait
run program and see if 0 or 1.

I think problem is that you make voluntary dummy as you having understand that i want allow each element array
to specific new thread machine I have wrote chunks as part of loop in my language not an individual thread..
I think i am not stupid to take this way that probably can only decrease performance.. also with used barrier side.
I don't know where can be interest to use thread machine aligned size arrays, to use same largely suffice.
to use barrier for same task with all separate thread just you extend each sem_wait receive answer 0 by semaphore
for wind.

You play your literature with my handicap bad contol your language
But i think that all serious programmer have understand that i want give.
and probably several have used already omp_in_parallel();to control as having given is true.

jimdempseyatthecove · ‎11-11-2009

Bustaf,

Make these small changes to your code and run it. You may then begin to understand what is happening.

[cpp]#pragma omp parallel           
{
  int iTeamMember = omp_get_thread_num();
  int nTeamMembers = omp_get_num_threads();
  printf("parallel iTeamMember = %d, nTeamMembers = %dn",iTeamMember,nTeamMembers);
  ...
  for (int i = 0; i <= c - 1; i++)
  {
    printf("for(i= with i=%d, iTeamMember = %dn",i,iTeamMember);
    ...
    printf("prior to sections iTeamMember = %dn",iTeamMember);
    #pragma omp sections nowait           
    {
      #pragma omp section
      { // ** add brace to include print in section
        printf("begin section iTeamMember = %dn",iTeamMember);
        for (j = 0; j <= la - 1; j++)   
        {
          ...
        } // for (j = 0; j <= la - 1; j++)   
        printf("end section iTeamMember = %dn",iTeamMember);
      } // ** add brace to close section
    } // end #pragma omp sections nowait   
    printf("following sections iTeamMember = %dn",iTeamMember);
  } // for (int i = 0; i <= c - 1; i++)   
} // end #pragma omp parallel   
[/cpp]

Jim Dempsey

Same code, same compiler options, performance is poorer in Intel then in Visual