openmp generates large overhead in kernel32.dll(SleepEx)

intelbenz · ‎12-16-2010

I'm doing a project about image processing using openmp. I have a simple code as follows. The program ran smoothly on my linux platform with gcc4.3.3. But the program ran incredibly slow on xp platform(visual studio 2005 with Parallel studio 2011). After running some hotspot analysis, the bottleneck was SleepEx in kernel32.dll

any idea ?




unsigned char   **a_data,
                **b_data,
                **c_data,
                *p,
                *p_a,
                *p_b,
                *p_c;
unsigned long   nr,
                nc;
nr = nc = 64;

a_data = (unsigned char **) malloc(nr*sizeof(unsigned char *));
p = (unsigned char *) malloc(nr*nc*sizeof(unsigned char));
for(int i=0; i{
    a_data = p + i*nr;
}
b_data = (unsigned char **) malloc(nr*sizeof(unsigned char *));
p = (unsigned char *) malloc(nr*nc*sizeof(unsigned char));
for(int i=0; i{
    b_data = p + i*nr;
}
c_data = (unsigned char **) malloc(nr*sizeof(unsigned char *));
p = (unsigned char *) malloc(nr*nc*sizeof(unsigned char));
for(int i=0; i{
    c_data = p + i*nr;
}

for(int i=0; i{
    p_a = a_data;
    p_b = b_data;
    p_c = c_data;
#pragma omp parallel for
    for(int j=0; j    {
        p_a = p_b + p_c;
    }
}

jimdempseyatthecove · ‎12-28-2010

Your parallel for loop is too small to perform anything useful in a parallel manner. Theiteration space is nc=64 and the work performed is the addition of 2 char values.

If you enclosed the posted code into a subroutine, then timed many calls to this subroutine, then the preponderance of the time will be in the malloc (preceeding your loop).

Jim Dempsey