Solved: /Qparallel for light user of parallelism.

kruppa_laszlo · ‎04-09-2010

Dear Forum,

I am just a hobbyist programmer. My platform is a HP xw8400 workstation, with one XEON E5335 2.00GHz quad core processor. I use the Intel C++ Compiler Pro version 11.0.066 through a VS2008 IDE. I develop programs for native 64 bit execution hoping that is better for handling big data volumes fast. I use quite basic level of C++ language: mostly FOR loops, arrays and sorting.

Unfortunately or not, lately, my favorite self-developed application (a sort of mathematical analysis tool) has got too far into heavy processing. One single analysis takes close to a week of nonstop processing to complete.

My feeling is that performance can be improved. I read Intels paper of January 14, 2010, about Automatic Parallelization, and followed the instructions therein. Now the work-intensive part of my program contains no WHILE loops, no BREAKs and no GOTOs. True, some function calls remained to access the qsort() function of C. I wonder if it is too many: there are three sortings in the innermost loop. I have specified the /Qparallel option.

Parallelization and vectorization goes virtually smoothly, as the compiler always reports success - several pages of success-messages. Still, the execution time remains just a little (max. 5%) less than that for the .exe produced by the plain MS compiler. As a relatively long-time user on the Intel C++ compilers, I am quite disappointed, because in the earlier days (for my earlier programs_) I used to have 50% of time reduction. Or even more.

Beside slowness, peak CPU usage doesnt exceed 25%, and also that is concentrated mostly in one core of the quad core Xeon. (One core loaded nearly to 100%, and the remaining three just watching the first almost unloaded.) It doesnt look very much like a workload distributed evenly

I would greatly appreciate some advice from a colleague with deeper knowledge in this area. I am enclosing the tremendous command line records, which have been produced by the IDE as a result of my experimentation/guesswork upon the Property pages.

Thanks!

C/C++, Command line:

/c /O3 /Og /Ob2 /Oi /Ot /GT /Qip /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MD /GS /arch:SSE3 /fp:fast /Yu"StdAfx.h" /Fp"x64\\Release/F-PRINT_mp_AutoTest_5.pch" /Fo"x64\\Release/" /W3 /nologo /Zi /QaxSSE3 /QxSSE3 /Qparallel

Linker, Command Line:

kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /OUT:"C:\\Cpp_OwnCode\\Megoldasok\\F_Print\\F-PRINT_mp_AutoTest_5\\x64\\Release/F-PRINT_mp_AutoTest_5.exe" /INCREMENTAL:NO /nologo /MANIFEST /MANIFESTFILE:"x64\\Release\\F-PRINT_mp_AutoTest_5.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /TLBID:1 /DEBUG /PDB:"C:\\Cpp_OwnCode\\Megoldasok\\F_Print\\F-PRINT_mp_AutoTest_5\\x64\\Release\\F-PRINT_mp_AutoTest_5.pdb" /SUBSYSTEM:CONSOLE /STACK:100000000,100000000 /OPT:REF /OPT:ICF /DYNAMICBASE /NXCOMPAT /IMPLIB:"C:\\Cpp_OwnCode\\Megoldasok\\F_Print\\F-PRINT_mp_AutoTest_5\\x64\\Release\\F-PRINT_mp_AutoTest_5.lib" /MACHINE:X64

kfsone · ‎04-12-2010

Hi Laszlo,

Those are indeed excellent loops for parallelization, plenty of workload there :)

One possible approach might be as follows.

What I've done is switch from using doubles to using a 64-bit unsigned long. This saves you losing precision and removes the need for int-to-double transformations, replaced with an efficient bit-shift.

[cpp]static const unsigned int ARRAY_SIZE = 1000000;

void KRU_qsort_1m (int Sorr_Bemenoe[ARRAY_SIZE][2])
{
   unsigned long long SorrG_A[ARRAY_SIZE];

   #pragma omp parallel for shared(SorrG_A, Sorr_Bemenoe)
   for (unsigned int f=0; f[1] << 32 // Key shifted to upper long
                + (unsigned long long)Sorr_Bemenoe[0];
   }

   // You will need to modify "compare" to use unsigned long long instead of double
   qsort((void *)&SorrG_A, ARRAY_SIZE, sizeof(unsigned long long), compare);
 
   //     Unpacking:
   #pragma omp parallel for shared(SorrG_A, Sorr_Bemenoe)
   for (unsigned int f=0; f[1] = (int)(SorrG_A  >> 32);              
        Sorr_Bemenoe[0] = (int)(SorrG_A );                    
   }
}
[/cpp]

I might be wrong about the need for the "shared" on the pragma omp.

I removed the for loop that initialized the SorrG_A to zero, since you are going to be overwriting every value anyway.

Now - I'm assuming that you were converting to doubles to get a large enough value to store the two numbers as a reverse-ordered key? That's what I have achieved here.

Also note that I have provided the compiler with a safety hint for optimization, by specifying that "f" is infact unsigned, so the compiler does not have to worry about negative values :)

Additionally, I wrote "++f", which takes getting used to in terms of readability, but it seems to make the Intel Compiler happier.

However -- this code is still likely to be a problem. The first performance hit comes from

 unsigned long long SorrG_A[ARRAY_SIZE];

which allocates space for that array every time you enter the loop onto the stack, which takes time.

In theory, you should be able to do an in-place qsort on the array you are passing to yourself.

Here is a complete example.

1- I created a "struct" called Bemenoe which contains the two ints you are using. It is internally exactly like an array of two ints but it is easier to work with. There is no overhead to using the struct. (I'm guessing that "Sorr" means array?)

2- I changed the compare function - it is probably not the most efficient way to do the compare, but it will be more efficient than doing the sort out-of-place.

3- I do the sort in-place - that is, I don't need to create an extra work space for the sort, which saves a lot of CPU time.

I test compiled this under Linux with:

icpc -O3 -Wall -openmp -o qsorttest qsorttest.cpp

qsorttest.cpp:

[cpp]#include 
#include 

static const unsigned int ARRAY_SIZE = 8 ;

struct Bemenoe
{
    int value ;
    int key ;
} ;

static int compare(const void* const lhs_ptr, const void* const rhs_ptr)
{
    const Bemenoe* const lhs = (Bemenoe*) lhs_ptr ;
    const Bemenoe* const rhs = (Bemenoe*) rhs_ptr ;

    // Left key is lower
    if ( lhs->key < rhs->key )
        return -1 ;

    // keys are equal
    if ( lhs->key == rhs->key )
    {
        // but left key is lower
        if ( lhs->value < rhs->value )
            return -1 ;
        // values are also equal
        if ( lhs->value == rhs->value )
            return 0 ;
    }

    // right side is lower.
    return 1 ;
}

int main(int argc, char* argv[])
{
    Bemenoe Sorr_Bemenoe[ARRAY_SIZE] = { { 1, 3 }, { 2, 3 }, { 2, 2 }, { 1, 5 }, { 8, 1 }, { 4, 3 }, { 7, 7 } , { 1, 1 } } ;

    qsort(Sorr_Bemenoe, ARRAY_SIZE, sizeof(Bemenoe), compare) ;

    for ( unsigned int i = 0 ; i < ARRAY_SIZE ; ++i )
    {
        printf("%u: %d, %dn", i, Sorr_Bemenoe.value, Sorr_Bemenoe.key) ;
    }

    return 0 ;
}[/cpp]

View solution in original post

TimP · ‎04-09-2010

If you are spending most of the time in qsort(), you will need a parallel version of qsort(), or to run simultaneous qsorts in parallel tasks. /Qparallel won't be able to accomplish this.

aazue · ‎04-09-2010

Hi
Unfortunately I have an doubt that as with only the auto improving compiler ,you can resolve
your problem of time.For an possibility to reduce same half time you must rewrite new code.
Several thread parallel have really effects if you can decrease unnecessary with
condition commons . ( One Thread have find and inform other to stop)
If impossible or difficult , with long time that you have showed , require using MPI
merging several computer. is the better way not expensive for reducing long times wait.

Do not think that other have secret miracle , and you, not having.

Find Mpi or Mpich2 on www ,functionality is very easy.

(One single analysis takes close to a week of nonstop processing to complete.)

If me ; my personal technical choice Mpi with no hesitation...
Best regards

jimdempseyatthecove · ‎04-09-2010

You will have to look at your code for opportunities of parallelization. These opportunities need not be loops.

I suggest you experiment with OpenMP. Without a complet description of your program it will be difficult to offer good advice. OpenMP has parallel sections. Should your one large code section have opportunities for different sections of the code to runconcurrently then parallel sections might provide better performance. Parallel sections are relatively easy to incorporate into a program (but this will require some understanding on your part). With a bit of work you might be able to reduce a run time of one wee down to two days. This is certainly worth the investement in time (say while you are waiting a week for results).

Jim

kruppa_laszlo · ‎04-10-2010

Thanks a lot for the responses! I understand the common message: it is not possible to stick to a convenient light status if I want to get over this problem. (Very bad news. Have to learn proper parallelization.) My secret hope was some glaring error would be pointed out in the command lines content, and my code could be made start rocketing at once.

Actually, I am following Jims advice already the program has been running in the background all the time since I started asking for help. Also, in theory, I am quite ready to change the code to improve efficiency. There are two difficulties. One is the central model the logic of which I have to stick to in order to ensure the results will reflect some real behavior of the system being modeled. The other is the simplicity of elements I have been using by intent. Basically I have FOR loops embedded in five layers: not so complex, just big. That is why I liked first the idea of /Qparallel.

Knowing my options, I think, I will certainly look deeper into the science of OpenMP. (My present processing speed will provide enough time for that_) Just two more questions. First Tims mentioning a parallel version of qsort() any idea how/where to get it? Second is about the uneven distribution of the workload among the cores I reported last time. Is it because of my fault (e.g. I never bothered specifying any threads), or possibly by some other reason?

Thanks a lot, again!

Laszlo

TimP · ‎04-10-2010

My knowledge of parallel sort doesn't go beyond what you should have read already on wikipedia. It must be a big academic topic, and there ought to be some information on companion forum sections, e.g. TBB.

aazue · ‎04-10-2010

Hi

I observe with your answer that you not have understand that i have write.
I you want make as same Professional...
1] Observe how working software example Mathlab or other same type. you can discovering that engine database is used for call under backend.
You not having to rewrite as existing probably
largely better that you have. You can create asynchronous try (sort) with the call database
under the backend.(top level) (used as same, already sufficiently difficult.)
About the time for result that you give, can be probably not acceptable in reality for an professional programming.
Request the people working the programming meteorology or other same hardness tasks sectors
if judicious to use OpenMp or Tbb directly without already existed functions used by backend engine database. probably he burst out laughing.
Also if task is very big is imperative to mount clusters with MPI for to increase side hardware help.
About experience....
Only those who leave the water would tell justly you if it are cold ....
Refer documentation Database Postgresql, Oracle or Db2 for learning the potential functionality already exist appropriated
for help in your task.
for the operate sorting asynchronous is dificult to use OpenMp Tbb with API backend
database, Engine have already internal threading very powerful but with prudence and attention control , you can perfectly make.
The backend of Postgresql (libpq,libpq ) have most power full,
(alignment structure fields record as bottom level) is very important for memory control size and your threading potentially added
if you must drive your programming in large volumes database fields.
My new answer for help you , only , you wrote you are not an programming professional.

Best regards.

jimdempseyatthecove · ‎04-10-2010

>>True, some function calls remained to access the qsort() function of C. I wonder if it is too many: there are three sortings in the innermost loop. I have specified the /Qparallel option.

You stated the runtime is very long.

Is the sort time significant as compared to other execution time?

If sort time is significant:
Is the result of the sort required for next calculation?
Or is it for purposes of organizing results datafor reports?

If sort for reports
export the unsorted data from the loop tomemory buffers if possible
or write unsorted data to file.
Then have seperate thread or seperate application sort the results data concurrently with the remainder of your calculations.

Maybe a good first step (baby step) for OpenMP is:

Compile with OpenMP enabled but with no OpenMP statements. Once that is working (libraries link into applicaiotion without errors). Then...

insert:

double T1 = omp_get_wtime();
...
double T2 = omp_get_wtime();
...
etc...

at major way points in your big routine
and insert arount the call to Qsort.

At the bottom of the code section insert

double E1 = T2 - T1;
double E2 = T3 - T2;
...
Then print out, or write out the results.

If the loop frequency is fast then in global scope

double E1 = 0.0;
double E2 = 0.0;
...

And change the prior code that calculated elapse time into E1, E2, ...
to accumulate elapse time into E1, E2, ...

E1 = E1 + T2 - T1;
E2 = E2 + T3 - T2.
...

Then at end of program, or at some timeinterval, display the elapse time then zero out the elapse time.

By placing the time collection statements appropriately, you can determine to some extent where parallelization may occur.

When parallelizing the code, you do not want multiple sections writing to the same variables.

Jim

Jim

kfsone · ‎04-10-2010

Firstly: While you have turned "/Qparallel" on, it will have no effect, since you have not enabled OpenMP (under Code Generation, I think).

I would encourage you to post (at least) some of your code. That much idle time on a long-running algorithm tends to indicate either a heavy IO burden or that the CPU core is spending most of its time waiting for memory. There are usually people here happy to look at someoneelse's code with an eye to optimizing it :)

- Oliver

jimdempseyatthecove · ‎04-11-2010

Oliver,

The idle time, in this case, indicates only 1 of 4 cores is working.
The program should be examined for opportunities of parallelization.
If Laszlo is not comfortable with doing this alone, I suggest he entices a university student with a case of beer (to be dispensed after the programming assistance).

Once the initial issues have been addressed, Laszlo will become familiar with what is required and then could take over the task of further refinements.

Jim Dempsey

TimP · ‎04-11-2010

The OpenMP option is not a prerequisite to enabling /Qparallel. If both are set, OpenMP pragmas take precedence over /Qparallel, both using the same OpenMP library, so they can be made to work together.

kfsone · ‎04-11-2010

Are you sure, Tim? In the past I've found that /Qparallel did nothing until I also enabled OpenMP (otherwise it produced parallel-ready serial code). But that was a few compiler versions ago and it may no-longer be valid (as you state).

TimP · ‎04-11-2010

/Qparallel and /Qopenmp may be interchangeable in the link step; you would need one, but not both, if either were in use in the compile.

kruppa_laszlo · ‎04-12-2010

(This is Oliver: For some reason, it is making me edit your post instead of letting me reply to it!)

aazue · ‎04-12-2010

Dear,Laszlo
I am sorry for the imaged expression that i having used, and as multiple sens can be interpreted.
I consider largely better all programmer that working C/C++ bottom level. (Your case) .
I have used this image,only ,here , in my state ,and probably also to several state, when we
want build an project very perfectly accorded with an large part C language , you have always burst out laughing of
control quality engineering or manager charge , answer same you reinvent the wheel
too expensive for an task specialized as used unique. absolute no relation with
level capacity, probably the inverse.
I sorry if you have understand sens is not that i think sincere.
About God more exactly (May God forgive me !!)
About OpenMp (more exactly, thread bottom level,only for me use is authorize)
As you having engine database and it can help you i hope,also erase my unintentional fault .
Sometime I use (trigger) with dummy (insert) as conditions for create an reference commune point to all threads.
Best regards.

kfsone · ‎04-12-2010

Hi Laszlo,

Those are indeed excellent loops for parallelization, plenty of workload there :)

One possible approach might be as follows.

What I've done is switch from using doubles to using a 64-bit unsigned long. This saves you losing precision and removes the need for int-to-double transformations, replaced with an efficient bit-shift.

[cpp]static const unsigned int ARRAY_SIZE = 1000000;

void KRU_qsort_1m (int Sorr_Bemenoe[ARRAY_SIZE][2])
{
   unsigned long long SorrG_A[ARRAY_SIZE];

   #pragma omp parallel for shared(SorrG_A, Sorr_Bemenoe)
   for (unsigned int f=0; f[1] << 32 // Key shifted to upper long
                + (unsigned long long)Sorr_Bemenoe[0];
   }

   // You will need to modify "compare" to use unsigned long long instead of double
   qsort((void *)&SorrG_A, ARRAY_SIZE, sizeof(unsigned long long), compare);
 
   //     Unpacking:
   #pragma omp parallel for shared(SorrG_A, Sorr_Bemenoe)
   for (unsigned int f=0; f[1] = (int)(SorrG_A  >> 32);              
        Sorr_Bemenoe[0] = (int)(SorrG_A );                    
   }
}
[/cpp]

I might be wrong about the need for the "shared" on the pragma omp.

I removed the for loop that initialized the SorrG_A to zero, since you are going to be overwriting every value anyway.

Now - I'm assuming that you were converting to doubles to get a large enough value to store the two numbers as a reverse-ordered key? That's what I have achieved here.

Also note that I have provided the compiler with a safety hint for optimization, by specifying that "f" is infact unsigned, so the compiler does not have to worry about negative values :)

Additionally, I wrote "++f", which takes getting used to in terms of readability, but it seems to make the Intel Compiler happier.

However -- this code is still likely to be a problem. The first performance hit comes from

 unsigned long long SorrG_A[ARRAY_SIZE];

which allocates space for that array every time you enter the loop onto the stack, which takes time.

In theory, you should be able to do an in-place qsort on the array you are passing to yourself.

Here is a complete example.

1- I created a "struct" called Bemenoe which contains the two ints you are using. It is internally exactly like an array of two ints but it is easier to work with. There is no overhead to using the struct. (I'm guessing that "Sorr" means array?)

2- I changed the compare function - it is probably not the most efficient way to do the compare, but it will be more efficient than doing the sort out-of-place.

3- I do the sort in-place - that is, I don't need to create an extra work space for the sort, which saves a lot of CPU time.

I test compiled this under Linux with:

icpc -O3 -Wall -openmp -o qsorttest qsorttest.cpp

qsorttest.cpp:

[cpp]#include 
#include 

static const unsigned int ARRAY_SIZE = 8 ;

struct Bemenoe
{
    int value ;
    int key ;
} ;

static int compare(const void* const lhs_ptr, const void* const rhs_ptr)
{
    const Bemenoe* const lhs = (Bemenoe*) lhs_ptr ;
    const Bemenoe* const rhs = (Bemenoe*) rhs_ptr ;

    // Left key is lower
    if ( lhs->key < rhs->key )
        return -1 ;

    // keys are equal
    if ( lhs->key == rhs->key )
    {
        // but left key is lower
        if ( lhs->value < rhs->value )
            return -1 ;
        // values are also equal
        if ( lhs->value == rhs->value )
            return 0 ;
    }

    // right side is lower.
    return 1 ;
}

int main(int argc, char* argv[])
{
    Bemenoe Sorr_Bemenoe[ARRAY_SIZE] = { { 1, 3 }, { 2, 3 }, { 2, 2 }, { 1, 5 }, { 8, 1 }, { 4, 3 }, { 7, 7 } , { 1, 1 } } ;

    qsort(Sorr_Bemenoe, ARRAY_SIZE, sizeof(Bemenoe), compare) ;

    for ( unsigned int i = 0 ; i < ARRAY_SIZE ; ++i )
    {
        printf("%u: %d, %dn", i, Sorr_Bemenoe.value, Sorr_Bemenoe.key) ;
    }

    return 0 ;
}[/cpp]

kfsone · ‎04-12-2010

The output that I get is:

[bash]osmith@ubuntu:~$ ./qsorttest
0: 1, 1
1: 8, 1
2: 2, 2
3: 1, 3
4: 2, 3
5: 4, 3
6: 1, 5
7: 7, 7
[/bash]

Which is sorted by the second (key) value before the first (value) value.

kfsone · ‎04-12-2010

Another way to write the compare function, which is far more efficient, is as follows:

[cpp]static int compare(const void* const lhs_ptr, const void* const rhs_ptr)
{
    const Bemenoe* const lhs = (Bemenoe*) lhs_ptr ;
    const Bemenoe* const rhs = (Bemenoe*) rhs_ptr ;

    // Diff will be < 0 if the left-hand key is smaller,
    // 0 if they are the same, and > 0 if the right hand key is smaller.
    const int diff = lhs->key - rhs->key ;

    // If the keys are equal, then the values decide which entry
    // is the smaller entry.
    if ( diff != 0 )
        return diff ;

    return lhs->value - rhs->value ;
}
[/cpp]

kruppa_laszlo · ‎04-13-2010

Hello, Oliver!

(I see you have found my last post so very valuable you decided to keep it just to yourself. I fully understand_)

I really thank you for the trouble you took to correct my code! Actually, I will need now some time to study your solution. Your level of C is a little higher than mine. I must be OK, anyway.

Laszlo

P.S. Sorr stands for sorrend, which means order/rank in Hungarian.

kfsone · ‎04-13-2010

Hehe - I'm really not sure why it insists on editing that reply of yours for me -- it's a shame because I'm sure some of the other readers might have better solutions :/

Please don't be afraid to ask any questions, no matter how "trivial" you might think them :)

The "const" identifiers: For this piece of code they are probably not necessary, they are a "hint" to the compiler (and yourself!).

A const on the left means "contents do not change" while a const on the right means "pointer does not change".

[cpp]char somechars[64] ;
char* p = somechars ;  // P points to somechars.
const char* constp = somechars ;
char* const pconst = somechars ;
const char* const constpconst = somechars ;

// This is allowed
*p = 'a' ;  // Set the first char of somechars to the letter 'a'.
p++ ;       // p now points to the second char of somechars.

*constp = 'a' ; // Not allowed - cannot change the contents of constp.
constp++ ;      // Allowed - constp the pointer is changeable.
// (and now constp points to the scond char of somechars)

*pconst = 'a' ;  // Allowed - we have only restricted the contents of pconst.
pconst++ ;       // Not allowed - we've said that pconst the pointer does not change.

constpconst = 'a' ; // Not allowed
constpconst++ ;     // Also not allowed
[/cpp]

Where "const" comes in useful is high-end optimizations, it helps the compiler make certain more aggressive optimizations (although I suspect in some cases it can work it out for itself)

It is also helpful as a self-guide if you have to track down a value being corrupted or modified. Const tells you "not this one!" :)

The one that puzzles some people is

const unsigned int arraysize;

I do a lot of cross-platform/cross-compiler work, and this makes some compilers very happy. It tells them they can treat the value like a "named literal". You aren't going to be changing it. Less and less compilers require this markup these days, but I find it a good habit to exercise because when I come back to a function, it provides me with one additional piece of information when I start reading.

"static const unsigned int ARRAY_SIZE = 100000"

The const says - I won't be changing this. The static says "only visible in the current compile module". That tells the compiler not to create storage space for the variable, and to use the actual value directly. However, it gives you a name for that particular variable so that you can write clearer code and if you decide to change the size of the arrays,you won't accidentally change some other use of the same numeric value.

"static" infront of a function means "only used inside this file". Again, it helps the compiler with some optimizations. It means that you won't be calling it from another file that the compiler does not know about.

You may already know some of these - but I wanted to clarify no extra hidden meanings you were unfamiliar with :)

kruppa_laszlo · ‎04-14-2010

Hi, Oliver!

The last long run of my program has just finished. Good news, the results show no error in the semantics, i.e. the model did what was expected of it. At the same time (as the results show) the algorithm itself requires slight adjustment. It is quite normal, just - if nothing happens - I will have to wait for another week to see new results.

I am quite definite about next launching it along with my first parallel constructs in the code I take that as simply a must. Mainly thanks to your hand-made examples now I can even use HELP on parallelism more sensibly than before. (Actually thats what I am busy with now: reading HELP.)

In short thanks a lot for everything!

Laszlo