- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Joe,
Have you run a profiler to help determine where the problem is?
The problem may not be related to the code you compile. For example, if you are compiling x64 code but using x32 library for graphics then the x64 will perform an operation called a "thunk" each time a call is made to the library (to transition and pass args from x64 to x32 and back when necessary).
Data and code alignment may affect the performance (usually does).
Don't blame the compiler when the circumstances are beyond its control.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
Use WMI to give (public) caracteristics of your machine before to decrease product not justified.
Compiler require several different test specific hardware appropriated machine before an generalised evaluation.
About Gnu compiler better that you use recently upgrade with an version for penguin operating system, Bill side ,as if you eat soup with the fork.
Some people sell the tooth brushes to birds ,and give council as result effective
benefit.
The Penguin (no an older) is an bird with tooth ? maybe require a flag (- bite) for appreciate performance ..
Example compiler with an result flag optimized:
C: Program Files Bill_Icc /SSE** / bite optimized_pig.cpp
objective answer ????
Kind regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your responses. Please don't take my statements as disregard to the Intel's compiler.
I just posted real results from my tests which were taken as an average from 12 runs for each option set.
I am not using any 32-bit libraries so thunking should not be the case here. I will send WMI tomorrow when I will be back at work.
I did profiling using AQTime profiler. I am aware of hotspots in the software. The code is pretty much optimized actually, so maybe there is not much to do for Intel compiler there?
I am developing software based on this:
http://research.nokia.com/research/mobile3D
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your responses. Please don't take my statements as disregard to the Intel's compiler.
I just posted real results from my tests which were taken as an average from 12 runs for each option set.
I am not using any 32-bit libraries so thunking should not be the case here. I will send WMI tomorrow when I will be back at work.
I did profiling using AQTime profiler. I am aware of hotspots in the software. The code is pretty much optimized actually, so maybe there is not much to do for Intel compiler there?
I am developing software based on this:
http://research.nokia.com/research/mobile3D
Hi Joex26
Seriously without the joke.
I think that better you make test with couple ICC 11.. and VC2008 or last version.
probably you discover approximative same result if you having code has not wrote specifically accorded Icc favorable side
See if you can change some (while) to (for) with correct (private local pointer) for possible vectorization also
see TBB Openmp used for side (SCTP) ext....
Also If you can use last type same I7 core, Atom, or ULV processor is better.
You are welcome here with free choice of your appreciation,also with your favorite compiler.
With or without Icc compiler, success to all member your team.
Computer sized same phone have well potential success now or tomorrow , i hope.
Kind regards
After reflection add this exchange (with flowers),
Precaution if tomorrow I buy the Nokia phone I can not find (optimized to me) penguin inserted for bites my ear;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I tried OpenMP but it works slower, similarly to automatic parallelization.
Require you learn how must be correctly wrote source code ..
About while() to for()
With while compiler have not information where can be cut or divided chunks
as started pair, impair or two side ++ & --.
About Tbb if you working side C++ is very nice lib to reduct or simplify source;
Also OpenMp is not only reserved loop to use other side example modify old socket with new SCTP
create group dummy virtual address and allow divided to asynchronous chunk and observe result.( out as gateway default)
is not -4 % is + 25/30% improved.
About phone as Nokia or other marks i think serious subject with this type object
can be used now as computer is an opportunity to out effect financial crisis.
(require an respect)
An deception that you not discerning difference fun to real..
If you have an sample to show public your level under browser with your httpt server,
I have also, as same , all people can evaluate that of two must return to school.
i not like use same language; just forced with your aggressive answer. (supposed smoker)
Kind regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Joe,
Would it be possible for you to post sample code showing the problem?
Autoparallelization, OpenMP and TBB have different characteristics. I would not place OpenMP and TBB in the same category. While I am not promoting TBB I suggest you not discount it so quickly as to say it works similar to OpenMP.
You might try profiling the code to see what is happening. Intel has VTune, but you can also run AMD's CodeAnalystn(CA) using timer based sampling on Intel processors. CA is a free download and IMHO is simpler to use. Also, Intel has a demo Parallel Advisor that might provide insight as to the bottleneck.
From the symptom description I suspect something else at play than compiler optimizations.
RE: thunk
Several months ago I downloaded the Havok Smoke Demo (32-bit). My system has Windows XP x64. While I can compile and run 32-bit applications the OpenGL display drivers required thunking to transition between x32 and x64. The result was abismial frame rate. When obscuring the display window (or minimizing it), performance was restored. There may be an option in the Performance Monitor to count thunks, that would tell if this is affecting your performance.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
int32 findSad2_8x(u_int8 *restrict orig, u_int8* restrict ref, int w, int blkHeight, int32 sad, int32 bestSad)
#else
int32 findSad2_8x(u_int8 *orig, u_int8* ref, int w, int blkHeight, int32 sad, int32 bestSad)
#endif
{
#ifndef VECTORIZATION
int j;
j = 0;
do {
sad += ABS_DIFF(orig[j*MBK_SIZE+0], ref[j*2*w+0]);
sad += ABS_DIFF(orig[j*MBK_SIZE+1], ref[j*2*w+2]);
sad += ABS_DIFF(orig[j*MBK_SIZE+2], ref[j*2*w+4]);
sad += ABS_DIFF(orig[j*MBK_SIZE+3], ref[j*2*w+6]);
sad += ABS_DIFF(orig[j*MBK_SIZE+4], ref[j*2*w+8]);
sad += ABS_DIFF(orig[j*MBK_SIZE+5], ref[j*2*w+10]);
sad += ABS_DIFF(orig[j*MBK_SIZE+6], ref[j*2*w+12]);
sad += ABS_DIFF(orig[j*MBK_SIZE+7], ref[j*2*w+14]);
j++;
} while (sad < bestSad && j < blkHeight);
#endif
#ifdef VECTORIZATION
int j,i;
for(j=0;j
sad += ABS_DIFF(orig[j*MBK_SIZE+i], ref[j*2*w+i*2]);
if (sad >= bestSad)
break;
}
#endif
return sad;
}
static const u_int8 absDiff[2*MAX_DIFF+1] = {
255,254,253,252,251,250,249,248,247,246,245,244,243,242,241,240,
......
.....
and so on....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Download CodeAnalyst from http://developer.amd.com/CPU/CODEANALYST/Pages/default.aspx
Use timer based profiling (works on IA32 and EM64T).
What it looks like the problem is the LUT is not vectorizing due to requiring a gather operation. Therefore I do not think SSE is or can be used effectively until later versions (AVX) supporting scatter/gather using your LUT
However, consider replacing your LUT with SSE instruction using PSADBW and itsintrinsic
__m128i _mm_sad_epu8(__m128i a, __m128i b)
Compute the absolute differences of the 16 unsigned 8-bit values of a and b;
Then use a horizontal add to get increment for sad
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I checked the loop in findSad2_16x(). It's not possible for vectorization. But the code generated by icl should not run slower. So I've filed an issue report to the compiler team.
Jennifer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I checked the loop in findSad2_16x(). It's not possible for vectorization. But the code generated by icl should not run slower. So I've filed an issue report to the compiler team.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I checked the loop in findSad2_16x(). It's not possible for vectorization. But the code generated by icl should not run slower. So I've filed an issue report to the compiler team.
Jennifer
Hi
i don't know ,if relation with low performance (OpenMp subject) mentioned before in specific situation. I observe an problem threading with ICC
I add an easy function for given where problem .
Remark: (Personally i have not smae problem i drive threading manually old school (with lib) g++ or icc used)
Function is to searching relation text with an array of word
you having an loop to allow each word research to separate (chunks).
Situation, you want determinate with text string where you search have relation specific sector.
example array
{"network","wireless","lan","wan","gateway","mask","cid","datagram","stream","tcp","socket" ext..........}
If you add an flag (as shared same an integer) each (chunks) can dispose halted flag.
you deciding that x (nps in smaple) words suffice relation determinate sector relation.
refered as same Googler engine to determinate activity relation for deal (onFishOver isMoneyProsper($,$))
if flag value is x (nps in sample) all chunk have not already find occurrence, can be halted.
(thread not now necessary as obsolete halted to not working for the wind).
(asynchronous side improve is here (random probability matrix) is not the size number off call)
in opposition with g++ that give various time with an differently number call relation (nbs)
Made the test with Icc you observe unchanged with increase or decrease number relation
I think threading not working correctly... logically time must variable with the call
number relation.
except error on my part,function is easy and clear for give as evident, i think .....
SAMPLE:
#include
#include
#include
#include
#include
#include
#include
#include
////////////////////////////////////////////////////////////////////////////////
//VOID COUNT_ARRAY_OCCURS FOR COUNT ARRAY OCCURRENCES ASYNCHRONY IN AN STRING//
////////////////////////////////////////////////////////////////////////////////
int
count_array_occurs (char *a, char **b, int c, int d)
{
// A IS GLOBAL STRING WHERE MUST COUNTED OCCURRENCE
// B IS ARRAY OF WORDS OCCURRENCES AS MUST COUNTED IN A
// C IS SIZE CALLED OF ARRAY
// D IS NUMBER PROBABILITY RELATION REQUIRED TO AN DEDUCTION
int la = strlen (a);
int lc
int noc
int pos[la]
int j;
int x;
int k;
int p = 0;
for (int i = 0; i <= c - 1; i++)
{
lc = strlen (b);
noc = 0;
}
//omp_set_nested (c);
#pragma omp parallel shared(a,p) private(j,k,x)
{
for (int i = 0; i <= c - 1; i++)
{
// WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0
// if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask);
// if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");}
if (p > d) //FLAG HALT AS SATISFACT POINT
{
i = c - 1;
}
#pragma omp sections nowait
{
#pragma omp section
for (j = 0; j <= la - 1; j++)
{
x = 0;
if (a
{
for (k = j; k <= j + lc - 1; k++)
{
if (a
{
x++;
}
if (x == lc)
{
noc++;
pos[noc] = k - x;
p++;
}
}
}
}
}
}
}
for (int i = 0; i <= c - 1; i++)
{
std::cout << noc << " <-OCC-> " << b << std::endl;
int m = 1;
while (pos
{
std::cout << pos
m++;
}
}
return (0);
}
// JUST ADDED AS TEST TO VERIFY WITHOUT LOOP (AS DEFINED)
///////////////////////////////////////////////////////////////////////////////
//VOID BADGERS_LOOP_2_OCCURS FOR COUNT 2 OCCURRENCES ASYNCHRONY IN AN STRING//
///////////////////////////////////////////////////////////////////////////////
//A IS STRING WHERE MUST COUNTED OCCURRENCE
//B IS IS FIRST WORD OCCURRENCE COUNTED IN A
//C IS IS SECOND WORD OCCURRENCE COUNTED IN A
int
badgers_loop_2_occurs (char *a, char *b, char *c)
{
int la = strlen (a);
int lb = strlen (b);
int lc = strlen (c);
int j;
int x;
int k;
int noc = 0;
int noc1 = 0;
#pragma omp parallel shared(a) private(j,k,x)
{
#pragma omp sections nowait
{
#pragma omp section
for (j = 0; j <= la - 1; j++)
{
x = 0;
if (a
{
for (k = j; k <= j + lb - 1; k++)
{
if (a
{
x++;
}
if (x == lb)
{
noc++;
}
}
}
}
#pragma omp section
for (j = 0; j <= la - 1; j++)
{
x = 0;
if (a
{
for (k = j; k <= j + lc - 1; k++)
{
if (a
{
x++;
}
if (x == lc)
{
noc1++;
}
}
}
}
}
}
return noc + noc1; // NOC + NOC1 IS SUM OF OCCURRENCES
}
int
main (int argc, char *argv[])
{
char testocc[4096];
char *strtab[20] = { "smoker", "system", "Run", "quotes", "not", "a", "sample is :", "on", "the", "server", "to", "re", "ll", "faults", "pre", "ch", "na", " is", ",", "." };
int nps = 100;
strcpy (testocc,
" nString testocc sample is:n (Run the system preparation tool on the server to check the client system for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installer.This method have effect to list table, is part of method.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults) open The quote opening character (s).close The quote closing character
(s).If undefined, it will equal open.open2 The quote opening character (s) for quoteswithin quotes.If undefined, it will equal open.Close2 The quote closing character (s) for quotes within quotes.If undefined, it will equal close.Repeat 2 nString testocc sample is:n (Run the system preparation tool on the server to check the client system for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installe
r.This method have effect to list table, is part of method.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults)open The quote opening character (s).close The quote closing character (s).If undefined, it will equal open.open2 The quote opening character (s) for quoteswithin quotes.If undefined, it will equal open.Close2 The quote closing character (s) for quoteswithin quotes.If undefined, it will equal close.Repeat 2 nString testocc sample is:n (Run the system
preparation tool on the server to check the client system for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installer.This method have effect to list table, is part of method.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults) open The quote opening character (s).close The quote closing character (s).If undefined, it will equal open.open2 The quote openi
ng character (s) for quotes within quotes.If undefined, it will equal open.Close2 The quote closing character (s) for quotes within quotes.If undefined, it will equal close.Repeat 2nString testocc sample is:n (Run the system preparation tool on the server to check the client system for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installer.This method have effect to list table, is part of method
.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults) open The quote opening character (s).close The quote closing character (s).If undefined, it will equal open.open2 The quote opening character (s) for quotes within quotes.If undefined, it will equal open.Close2 The quote closing character (s) for quotes within quotes.If undefined, it will equal close.Repeat 2 nString testocc sample is:n (Run the system preparation tool on the server to check the client syste
m for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installer.This method have effect to list table, is part of method.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults) open The quote opening character (s).close The quote closing character (s).If undefined, it will equal open.open2 The quote opening character (s) for quotes within quotes.If undefined,
it will equal open.Close2 The quote closing character (s) for quotes within quotes.If undefined, it will equal close.");
std::cout << testocc << std::endl;
std::cout << "nFunction count_array_occurs n " << std::endl;
int res = count_array_occurs (testocc, &strtab[NULL], 20, nps);
std::cout << " nbadgers_loop_2_occurs (testocc, "a", "b") result :" << std::endl;
std::cout << badgers_loop_2_occurs (testocc, "a", "system") << std::endl;
}
END SAMPLE
Remark: add flag -Wno-write-strings with gnu compiler for disabling string warn(C++)
Information about utility (time) used
On a uniprocessor, the difference between the real time and the total microprocessor time, that is:
real - (user + sys)
is the sum of all of the factors that can delay the program, plus the program's own unattributed costs.
On an SMP, an approximation would be as follows:
real * number_of_processors - (user + sys)
Problem is not situated bad time (as just result wrong working),is the absence of variability
I have also test sized random to i loop to see if an problem pthread_key_t (first last) but same ????
GNU COMPILER nps=100
real 0m0.017s
user 0m0.004s
sys 0m0.000s
GNU COMPILER nps=1000
real 0m0.036s
user 0m0.000s
sys 0m0.000s
ICC COMPILER nps=100
real 0m0.202s
user 0m0.000s
sys 0m0.004s
ICC COMPILER nps=1000
real 0m0.202s
user 0m0.012s
sys 0m0.004s
(library)
Build (shared) ICC (but same result if static)
linux-vdso.so.1 => (0x00007ffff83fe000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00007f8feff2e000)
libimf.so => /opt/intel/Compiler/11.0/081/lib/intel64/libimf.so (0x00007f8fefbd8000)
libsvml.so => /opt/intel/Compiler/11.0/081/lib/intel64/libsvml.so (0x00007f8ff0193000)
libm.so.6 => /lib/libm.so.6 (0x00007f8fef955000)
libiomp5.so => /opt/intel/Compiler/11.0/081/lib/intel64/libiomp5.so (0x00007f8fef7c5000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f8fef4be000)
libintlc.so.5 => /opt/intel/Compiler/11.0/081/lib/intel64/libintlc.so.5 (0x00007f8fef380000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007f8fef169000)
libc.so.6 => /lib/libc.so.6 (0x00007f8feee16000)
libdl.so.2 => /lib/libdl.so.2 (0x00007f8feec12000)
/lib64/ld-linux-x86-64.so.2 (0x00007f8ff014a000)
Build Gnu G++
linux-vdso.so.1 => (0x00007fffe93ff000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00007f50e0fa9000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f50e0ca2000)
libm.so.6 => /lib/libm.so.6 (0x00007f50e0a1f000)
libgomp.so.1 => /usr/lib/libgomp.so.1 (0x00007f50e0817000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007f50e0600000)
libc.so.6 => /lib/libc.so.6 (0x00007f50e02ad000)
/lib64/ld-linux-x86-64.so.2 (0x00007f50e11c5000)
librt.so.1 => /lib/librt.so.1 (0x00007f50e00a4000)
Remark: i have use this machine some older just have kernel and all lib original (no rebuild) default(debian 5) installed
Also origin GNU compiler no an last version snapshots.
(i must make test with core i7 or an other new have 4 cores , also last versions Icc (just require to find times..)
Maybe problem is this machine type ???
Target: x86_64-linux-gnu
gcc version 4.3.2 (Debian 4.3.2-1.1)
Machine used:
debian:/# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel Pentium 4 CPU 3.20GHz
stepping : 3
cpu MHz : 2800.000
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc pebs bts pni monitor ds_cpl est cid cx16 xtpr
bogomips : 6403.60
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel Pentium 4 CPU 3.20GHz
stepping : 3
cpu MHz : 2800.000
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc pebs bts pni monitor ds_cpl est cid cx16 xtpr
bogomips : 6398.20
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:
Parallel to parallel....
You know this metaphor about programmer that want absolute improve also for is already well ?
"For him prove how many his love is big, he has make an kiss so much strongly that he has swallowed her eye."
Kind r..e.g....ar........ds
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suggest you change your programming style a little by adding comments to your {}'s such that your scopes are obvious. This way you may eliminate programming errors (or make assumptions you ought not to make).
Example
[cpp]//omp_set_nested (c); #pragma omp parallel shared(a,p) private(j,k,x) { for (int i = 0; i <= c - 1; i++) { // WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0 // if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask); // if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");} if (p > d) //FLAG HALT AS SATISFACT POINT { i = c - 1; } #pragma omp sections nowait { #pragma omp section for (j = 0; j <= la - 1; j++) { x = 0; if (aIn the above you can now clearly see that you have== b[0]) { for (k = j; k <= j + lc - 1; k++) { if (a == b && x <= lc - 1) { x++; } if (x == lc) { noc++; pos[noc] = k - x; p++; } // if (x == lc) } // for (k = j; k <= j + lc - 1; k++) } // if (a == b[0]) } // for (j = 0; j <= la - 1; j++) // end #pragma omp section } // end #pragma omp sections nowait } // for (int i = 0; i <= c - 1; i++) } // end #pragma omp parallel // begin serial code for (int i = 0; i <= c - 1; i++) { std::cout << noc << " <-OCC-> " << b << std::endl; int m = 1; while (pos != 0) { std::cout << pos << " <-POS-> " << b << " <-TO-> " << pos + strlen (b) << std::endl; m++; } } return (0); } [/cpp]
omp parallel
omp sections
omp section ?????? one section ???
end omp sections
end omp parallel
The effect of the above is only one thread is doing any productive work
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also are you missing an omp for on your for(i= loop?
If you intend to run each thread using for(i= then you must resolve race conditions with
noc++;
pos[noc] = k - x;
where you have concurrent access to same array location by multiple threads.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also are you missing an omp for on your for(i= loop?
If you intend to run each thread using for(i= then you must resolve race conditions with
noc++;
pos[noc] = k - x;
where you have concurrent access to same array location by multiple threads.
Jim
1] First I am not Joe
2]Sorry, I believe that you have absolutely nothing included
As appropriate to your completely suggestions literature as completely outside
Please give the same (your style) code to same functionality with specific results that use Time to reference.
also make your glasses for see as you have not see.
You can make your professor playing with Shakespeare style technical with hidden literature with other but no with me.
Trace all threads and probably you understand how many you are far...
Bustaf
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1] First I am not Joe
2]Sorry, I believe that you have absolutely nothing included
As appropriate to your completely suggestions literature as completely outside
Please give the same (your style) code to same functionality with specific results that use Time to reference.
also make your glasses for see as you have not see.
You can make your professor playing with Shakespeare style technical with hidden literature with other but no with me.
Trace all threads and probably you understand how many you are far...
Bustaf
[cpp]Bustaf, Sorry about calling you joe (joex26 started this thread) I will add additional comments to the reformatted version of _your_ code 01.//omp_set_nested (c); // ^^ your comment to turn off nested is OK // vv pragma to begin parallel region 02.#pragma omp parallel shared(a,p) private(j,k,x) 03.{ // ^^ scoping brace for parallel region // -- all threads in team are running through this region // vv each thread executes following for loop 04. for (int i = 0; i <= c - 1; i++) 05. { // -- each thread arrives here with i=0,1,2,...,c-1 // -- and arrive here at uncontrolable times 06. // WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0 07. // if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask); 08. // if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");} 09. 10. if (p > d) //FLAG HALT AS SATISFACT POINT 11. { 12. i = c - 1; 13. } // vv pragma to dividE up current team into sections 14. #pragma omp sections nowait 15. { // ^^ brace to begin scope of sections // (and implicit first section) // -- first thread of team reaching sections executes this section // *** CAUTION **** // *** this sections has nowait AND is enclosed in a loop // *** making it possible that a thread may enter this sections // *** on iteration n+1 _prior_ to other team member entering // *** sections on iteration n. This violates all threads must pass // *** through sections (although eventually they will) // *** the specification is mute on this as to if an implementation // *** can work around this // vv due to lack of statements between sections { and pragma omp section // vv the following section is the 1st section of the sections // vv and therefor redundant 16. #pragma omp section // vv no brace (ok) therefore the following for statement is section 1 17. for (j = 0; j <= la - 1; j++) 18. { 19. x = 0; 20. if (a== b[0]) 21. { 22. for (k = j; k <= j + lc - 1; k++) 23. { 24. if (a == b && x <= lc - 1) 25. { 26. x++; 27. } 28. if (x == lc) 29. { 30. noc++; 31. pos[noc] = k - x; 32. p++; 33. } // if (x == lc) 34. } // for (k = j; k <= j + lc - 1; k++) 35. } // if (a == b[0]) 36. } // for (j = 0; j <= la - 1; j++) 37. // end #pragma omp section 38. } // end #pragma omp sections nowait 39. } // for (int i = 0; i <= c - 1; i++) 40.} // end #pragma omp parallel 41.// begin serial code 42.for (int i = 0; i <= c - 1; i++) Comments The sections with nowait inside a loop within a parallel region (and without barrier) is operating under unspecified rules. If you were to remove nowait (or add barrier) then only one thread would perform productive work (as explained earlier) This would be equivilent to making the sections and section into a single If you want each thread to enter the for(j loop then you must resolve the possibility that multiple threads may concurrently execute "noc++" with the same value for i, and under which case the result is not determinant. A similar (but not quite same) issue exists with "pos[noc] = k - x" where multiple theads execute the statemen at the same time with the same value of i, in which case you would be using an indeterminant value in noc as a subscript to store a value of "k-x" which may also differ between threads. Unless you want jibberish in noc and pos, the above code (as you wrote) is senseless. Jim Dempsey [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1] First I am not Joe
2]Sorry, I believe that you have absolutely nothing included
As appropriate to your completely suggestions literature as completely outside
Please give the same (your style) code to same functionality with specific results that use Time to reference.
also make your glasses for see as you have not see.
You can make your professor playing with Shakespeare style technical with hidden literature with other but no with me.
Trace all threads and probably you understand how many you are far...
Bustaf
[cpp]Bustaf, Sorry about calling you joe (joex26 started this thread) I will add additional comments to the reformatted version of _your_ code 01.//omp_set_nested (c); // ^^ your comment to turn off nested is OK // vv pragma to begin parallel region 02.#pragma omp parallel shared(a,p) private(j,k,x) 03.{ // ^^ scoping brace for parallel region // -- all threads in team are running through this region // vv each thread executes following for loop 04. for (int i = 0; i <= c - 1; i++) 05. { // -- each thread arrives here with i=0,1,2,...,c-1 // -- and arrive here at uncontrolable times 06. // WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0 07. // if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask); 08. // if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");} 09. 10. if (p > d) //FLAG HALT AS SATISFACT POINT 11. { 12. i = c - 1; 13. } // vv pragma to dividE up current team into sections 14. #pragma omp sections nowait 15. { // ^^ brace to begin scope of sections // (and implicit first section) // -- first thread of team reaching sections executes this section // *** CAUTION **** // *** this sections has nowait AND is enclosed in a loop // *** making it possible that a thread may enter this sections // *** on iteration n+1 _prior_ to other team member entering // *** sections on iteration n. This violates all threads must pass // *** through sections (although eventually they will) // *** the specification is mute on this as to if an implementation // *** can work around this // vv due to lack of statements between sections { and pragma omp section // vv the following section is the 1st section of the sections // vv and therefor redundant 16. #pragma omp section // vv no brace (ok) therefore the following for statement is section 1 17. for (j = 0; j <= la - 1; j++) 18. { 19. x = 0; 20. if (a== b[0]) 21. { 22. for (k = j; k <= j + lc - 1; k++) 23. { 24. if (a == b && x <= lc - 1) 25. { 26. x++; 27. } 28. if (x == lc) 29. { 30. noc++; 31. pos[noc] = k - x; 32. p++; 33. } // if (x == lc) 34. } // for (k = j; k <= j + lc - 1; k++) 35. } // if (a == b[0]) 36. } // for (j = 0; j <= la - 1; j++) 37. // end #pragma omp section 38. } // end #pragma omp sections nowait 39. } // for (int i = 0; i <= c - 1; i++) 40.} // end #pragma omp parallel 41.// begin serial code 42.for (int i = 0; i <= c - 1; i++) Comments The sections with nowait inside a loop within a parallel region (and without barrier) is operating under unspecified rules. If you were to remove nowait (or add barrier) then only one thread would perform productive work (as explained earlier) This would be equivilent to making the sections and section into a single If you want each thread to enter the for(j loop then you must resolve the possibility that multiple threads may concurrently execute "noc++" with the same value for i, and under which case the result is not determinant. A similar (but not quite same) issue exists with "pos[noc] = k - x" where multiple theads execute the statemen at the same time with the same value of i, in which case you would be using an indeterminant value in noc as a subscript to store a value of "k-x" which may also differ between threads. Unless you want jibberish in noc and pos, the above code (as you wrote) is senseless. Jim Dempsey [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1] First I am not Joe
2]Sorry, I believe that you have absolutely nothing included
As appropriate to your completely suggestions literature as completely outside
Please give the same (your style) code to same functionality with specific results that use Time to reference.
also make your glasses for see as you have not see.
You can make your professor playing with Shakespeare style technical with hidden literature with other but no with me.
Trace all threads and probably you understand how many you are far...
Bustaf
[cpp]Bustaf, Sorry about calling you joe (joex26 started this thread) I will add additional comments to the reformatted version of _your_ code 01.//omp_set_nested (c); // ^^ your comment to turn off nested is OK // vv pragma to begin parallel region 02.#pragma omp parallel shared(a,p) private(j,k,x) 03.{ // ^^ scoping brace for parallel region // -- all threads in team are running through this region // vv each thread executes following for loop 04. for (int i = 0; i <= c - 1; i++) 05. { // -- each thread arrives here with i=0,1,2,...,c-1 // -- and arrive here at uncontrolable times 06. // WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0 07. // if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask); 08. // if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");} 09. 10. if (p > d) //FLAG HALT AS SATISFACT POINT 11. { 12. i = c - 1; 13. } // vv pragma to dividE up current team into sections 14. #pragma omp sections nowait 15. { // ^^ brace to begin scope of sections // (and implicit first section) // -- first thread of team reaching sections executes this section // *** CAUTION **** // *** this sections has nowait AND is enclosed in a loop // *** making it possible that a thread may enter this sections // *** on iteration n+1 _prior_ to other team member entering // *** sections on iteration n. This violates all threads must pass // *** through sections (although eventually they will) // *** the specification is mute on this as to if an implementation // *** can work around this // vv due to lack of statements between sections { and pragma omp section // vv the following section is the 1st section of the sections // vv and therefor redundant 16. #pragma omp section // vv no brace (ok) therefore the following for statement is section 1 17. for (j = 0; j <= la - 1; j++) 18. { 19. x = 0; 20. if (a== b[0]) 21. { 22. for (k = j; k <= j + lc - 1; k++) 23. { 24. if (a == b && x <= lc - 1) 25. { 26. x++; 27. } 28. if (x == lc) 29. { 30. noc++; 31. pos[noc] = k - x; 32. p++; 33. } // if (x == lc) 34. } // for (k = j; k <= j + lc - 1; k++) 35. } // if (a == b[0]) 36. } // for (j = 0; j <= la - 1; j++) 37. // end #pragma omp section 38. } // end #pragma omp sections nowait 39. } // for (int i = 0; i <= c - 1; i++) 40.} // end #pragma omp parallel 41.// begin serial code 42.for (int i = 0; i <= c - 1; i++) Comments The sections with nowait inside a loop within a parallel region (and without barrier) is operating under unspecified rules. If you were to remove nowait (or add barrier) then only one thread would perform productive work (as explained earlier) This would be equivilent to making the sections and section into a single If you want each thread to enter the for(j loop then you must resolve the possibility that multiple threads may concurrently execute "noc++" with the same value for i, and under which case the result is not determinant. A similar (but not quite same) issue exists with "pos[noc] = k - x" where multiple theads execute the statemen at the same time with the same value of i, in which case you would be using an indeterminant value in noc as a subscript to store a value of "k-x" which may also differ between threads. Unless you want jibberish in noc and pos, the above code (as you wrote) is senseless. Jim Dempsey [/cpp]
Instead of trying to drown the fish with some nonsense literature
Please give the same (your style) code to same functionality with specific results
that use Time to reference for resolve this difference is an reality not literature.
-rwxr-xr-x 1 daemon staff 82094 Nov 7 06:35 ficat5 ICC COMPILED
real 0m0.202s Slow as 82094 must be read by system (is 4 * GNU size)
user 0m0.004s
sys 0m0.000s
-rwxr-xr-x 1 daemon staff 21410 Nov 7 06:37 ficat5 GNU COMPILED
real 0m0.097s
user 0m0.004s
sys 0m0.004s
Sorry,I have other task to make for less a time with you
Bustaf
I forget ...
Just compile exactly i have wrote without flag -parallel to not crossing ( vector)
and read your screen
ipo: remark #11001: performing single-file optimizations
ipo: remark #11005: generating object file /tmp/ipo_icpcKZn3QH.o
ficat5.cc(165): (col. 16) remark: OpenMP DEFINED SECTION WAS PARALLELIZED.
ficat5.cc(165): (col. 16) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
ficat5.cc(46): (col. 1) remark: OpenMP DEFINED SECTION WAS PARALLELIZED.
ficat5.cc(34): (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
also add if you want -par-report=3
ipo: remark #11001: performing single-file optimizations
ipo: remark #11005: generating object file /tmp/ipo_icpcEbE1oi.o
ficat5.cc(165): (col. 16) remark: OpenMP DEFINED SECTION WAS PARALLELIZED.
ficat5.cc(165): (col. 16) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
procedure: main
ficat5.cc(46): (col. 1) remark: OpenMP DEFINED SECTION WAS PARALLELIZED.
ficat5.cc(34): (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
procedure: count_array_occurs
procedure: __sti__$E
Maybe ... Compiler have use the ghosted threads for make ? and wrote an fictive literature
just only for that you are happy ???
Or as showing this compiler , is dummy ?? catastrophic result is not dummy..
As writed openmp documentation; https://computing.llnl.gov/tutorials/openMP/#Introduction
OMP_IN_PARALLEL
Purpose:
C/C++, it will return a non-zero integer if parallel, and zero otherwise.
add an integer declared ipa
add after they part code:
pos[noc] = k - x;
p++;
(add)
ipa= omp_in_parallel();
std::cout << " <-IPA-> " << ipa << std::endl;
run program and see if 0 or 1.
also move this control before #pragma omp sections nowait
run program and see if 0 or 1.
I think problem is that you make voluntary dummy as you having understand that i want allow each element array
to specific new thread machine I have wrote chunks as part of loop in my language not an individual thread..
I think i am not stupid to take this way that probably can only decrease performance.. also with used barrier side.
I don't know where can be interest to use thread machine aligned size arrays, to use same largely suffice.
to use barrier for same task with all separate thread just you extend each sem_wait receive answer 0 by semaphore
for wind.
You play your literature with my handicap bad contol your language
But i think that all serious programmer have understand that i want give.
and probably several have used already omp_in_parallel();to control as having given is true.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Bustaf,
Make these small changes to your code and run it. You may then begin to understand what is happening.
[cpp]#pragma omp parallel { int iTeamMember = omp_get_thread_num(); int nTeamMembers = omp_get_num_threads(); printf("parallel iTeamMember = %d, nTeamMembers = %dn",iTeamMember,nTeamMembers); ... for (int i = 0; i <= c - 1; i++) { printf("for(i= with i=%d, iTeamMember = %dn",i,iTeamMember); ... printf("prior to sections iTeamMember = %dn",iTeamMember); #pragma omp sections nowait { #pragma omp section { // ** add brace to include print in section printf("begin section iTeamMember = %dn",iTeamMember); for (j = 0; j <= la - 1; j++) { ... } // for (j = 0; j <= la - 1; j++) printf("end section iTeamMember = %dn",iTeamMember); } // ** add brace to close section } // end #pragma omp sections nowait printf("following sections iTeamMember = %dn",iTeamMember); } // for (int i = 0; i <= c - 1; i++) } // end #pragma omp parallel [/cpp]
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page