[ICC] what does -g mode do for accessing a structure's member data?

rudaho · ‎07-05-2010

Hello~

I'm using icc to compile a multi-threading code using -openmp with -O1 option. But the behavior turns out to be very strange. The pseudo code for the situation is as below :

1 : typedef struct entry_s
2 : {
3 : double value;
4 : int lock;
5 : } entry;
6 :
7 : entry** elems; // in some place, the elems are allocated and assign the values (lock = 1)
8 :
........ // other code segments

9 : #pragma omp parallel for private(pElement, pCur, idx) schedule(static, 1)
10 : for (idx = 0; idx < Size; ++idx) {
11 : if (0 == idx) {
12 : pElement = elems[idx];
13 : do_something();
14 : pElement->lock = 0;
15 : }
16 : else {
17 : pElement = elems[idx-1];
18 : while (pElement->lock) {
19 : ;
20 : }
21 : pCur = elems[idx];
22 : do_something();
23 : pCur->lock = 0;
24 : }
25 : }

The basic idea is that the for loop is dependent to previous one (in my real application, the pElement is an entry of a matrix and I preform LU factorization, thus the dependency exist between some elements). As the code showed in line 18 to line 20, the while loop should hold(pass) the execution of the thread until the pElement's lock attribute is 0. If the code is compiled with -g mode, the behavior is correct and I got correct results. However, if I compiled with -O1 mode or without any option(-O, -g, etc), the program will exit the while loop, and I tried to print the pElement's lock value after the while loop, it shows the value as being 1. This is very strange since that if the value is 1, it should remain executing the while loop. In the code section for each thread, the lock will only set from 1 to 0, so there is no possibility for the value to reset to 1. Also, if I print the lock value in the while loop, the result turns out to be correct again.

I once think one reason is the compiler may put the pElement->lock into a register and doesn't access the data from memory again. But if it's so, then the behavior should be infinite loop, not exiting the loop scope. If I write a wait function and use volatile for the lock variable (or pElement), it's OK then. But the run time is much longer. So, does anyone know how the compiler and OS do for accessing the member data of a structure? Is there any way to get the correct result without volatile keyword to save performance? Or any idea about why this situation happen? Thanks a lot...

Best Regards

Yi-Ju

Michael_K_Intel2 · ‎07-05-2010

Hi Yi-Ju,

I would assume that the problem comes from the OpenMP memory model (see OpenMP spec 1.4).

As you're accessing a shared variable (through the pElement pointer), your threads might read unexpected values. Your code seems to work without optimizations. In this case the compiler does not hold values in registers and does not reorder instructions. With optimization enabled, the compiler gets "smart" and holds values in registers and does not acces main memory when a variable is read or written. When values are read or written to main memory is determined by the OpenMP memory model.

To make your code work correctly, you would have to place flush directives in your code (see OpenMP spec 1.4.2 and 2.8.6). Flushes tell the compiler that values read before should be re-read from memory and that everything that was modified should be written back. In addition, they limit the compiler's opportunity to re-order the instructions in your code. As OpenMP does not offer atomic updates (yet), you cannot get arround the flush directives.

Having said that, I have to note that it is a really bad idea to implement locking mechanisms in OpenMP with the flush directive. It is extremly complex to get correct code for all possible interleavings of the instruction streams. Hence, I would strongly recommend to replace your self-made locking mechanism with OpenMP locks to be on the safe side. The OpenMP implementation should automatically choose the best lock implementation (spin locks in your case) and provide a means to let you choose the locking implementation.

Cheers,
-michael

PS: You can find the OpenMP specification at http://www.openmp.org/mp-documents/spec30.pdf.

rudaho · ‎07-06-2010

Hi Michael~

Thanks for the reply. You make it clear about the memory access process. However, I didn't implement it as locking mechanism. Sorry that I use the unclear word "lock". In fact, it's like a flag to let the thread know the element is ready to read. Just like the Producer/Consumer example. The flag is used to tell the consumer that the data updated from Producer is ready to read. I guess locking may not work for the case, right? I'll look into the spec more clearly any way. Thanks a lot...

Best Regards

Yi-Ju

Michael_K_Intel2 · ‎07-06-2010

Hi Yi-Ju,

You indeed implemented a lock. Your while loop implements sort of a gate that only lets a thread to proceed execution if a certain condition holds. But I guess that is nitpicking. :-) You could also call it a flag for signalling an event. You could use lock for that as well, if you set the lock to taken and to released after the element has been processed.

The key thing is that some other thread might change the value, right? If that is the case, your code interferes with the OpenMP memory model. If you need to stick to the flagging mechanism, you should add the flush directives to ensure that values are actually read from main memory. Please see the corresponding examples in the OpenMP specification to get an impression how to place them. The rough rule is to place a flush before each variable usage as well as after each modification.

Cheers,
-michael

rudaho · ‎07-06-2010

Hi Michael~

Thanks for the clear explaination. I have read the flush example and implement into my program. It works fine but takes more time. I think this is the tradeoff of flush which I can't avoid. I'll look into the locking mechanism and try to use it then. Thanks

B.R.

Yi-Ju

mecej4 · ‎07-06-2010

You used a binary semaphore, see http://en.wikipedia.org/wiki/Semaphore_(programming) to achieve locking.

Michael_K_Intel2 · ‎07-06-2010

Hi Yi-Ju,

Yes, flush comes with the price of having an overhead that you cannot avoid. If you you shared data that is read and written in different threads, you need to pay the penalty of updating the main memory and probably not having everything in registers.

Maybe it would be worth thinking about a different parallelization scheme that might give a better performance. It could be worth thinking about having the gating flag not associated with each element, but with a block of elements of the matrix. A thread would then start updating as soon as all dependent blocks have been finished.

BTW, you could also implement this with OpenMP tasks and assign a block of the matrix to one task for execution.

Cheers,
-michael

rudaho · ‎07-07-2010

Hi Michael~

My company uses icc version 10.1 and OpenMP tasking is not supported then. And icc11 doesn't work for us for some reasons. But I'll think about new parallelization schemes. Thanks a lot for the suggestions...

BR

Yi-Ju

Michael_K_Intel2 · ‎07-07-2010

Hi Yi-Ju,

If you need to stick to ICC 10.x you could use Intel's proprietary tasking construct called taskq. You should find information on this in the compiler documentation.

Cheers,
-michael