Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Help:a strange problem in intel Core Quad cpu

ydq
Beginner
327 Views

I compose a multithreading program of parallel video encoding. A problem come out when i run it in intel core Quad cpu. It's a strange problem and described in detail as follows:
1. Before writing my multithreading program, i write a serial c program to simulate multithreading program. I call this c program as original c model of parallel program. Why I make this step? Because serial c program is easy to debug and child threads can simulate in serial running functions.
2.Step one makes the program correct in function. Then ,i use pthread-win convert original c model of parallel program to real parallel program. The core source code of multithreading program is as follows:

void EncodePicture(............)
{
.....................

/* Initialize and set thread detached attribute */
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

for(num=0;num<4;num++)//create 4 thread by pthread_create
{
printf("Creating thread %d ", num);
rc = pthread_create(&thread[num], &attr, Encode_I_Slice, (void *) (&PArg[num]));

if (rc)
{
printf("ERROR; return code from pthread_create() is %d ", rc);
exit(-1);
}
}

/* Free attribute and wait for the other threads */
pthread_attr_destroy(&attr);

for(num=0; num<4; num++)
{
rc = pthread_join(thread[num], (void **)&status);
if (rc)
{
printf("ERROR; return code from pthread_join() is %d ", rc);
exit(-1);
}
printf("Completed join with thread %d status= %d ",num, status);
}
}

3. The strange problem:
I build my multithreading program in intel single core cpu and run it, the encoding stream of video is correct.
I copy the .exe file to intel core Quad cpu and run it, the encoding stream of video is incorrect.

I build my multithreading program in intel core Quad cpu and run it, the encoding stream of video is incorrect.
I copy the .exe file to intel single core cpu and run it, the encoding stream of video is correct.
I build original c model of parallel program in intel core Quad cpu and run it, the encoding stream of video is correct.

4.The multithreading program running in intel core Quad cpu always exit in write_1_bit function for realloc problem. The write_1_bit
function is given as follows:


void write_1_bit(OutputStream *p,int b)
{
if(p->iBytePosition == STREAM_BUF_SIZE)
{
memcpy(p->pTempStreamBuf+p->nByteInStreamBuf,p->buf,STREAM_BUF_SIZE);
p->nByteInStreamBuf+=STREAM_BUF_SIZE;
p->pTempStreamBuf=realloc(p->pTempStreamBuf,p->nByteInStreamBuf+STREAM_BUF_SIZE);
if(p->pTempStreamBuf==NULL)
{
printf ("Fatal: p->pTempStreamB uf realloc error, exit (-1) ");
exit (-1);
}
p->iBytePosition= 0;
p->iBitOffset= 0;
}
.................................
}

The first parameter (OutputStream *p) pass to write_1_bit function is deferent in each thread.
And each elements of the structure which the first parameter point to is also deferent in each thread.
But four structurevariable which the first parameter point to in each thread is a globalvariable defined in another file.


How to tackle the strange problem in intel Core Quad cpu? Please, help!

0 Kudos
3 Replies
jimdempseyatthecove
Honored Contributor III
327 Views

ydq,

>> 2.Step one makes the program correct in function.

While your fundamental algorithm may be correct your program sequences appear to be sensitive to concurrent memory access issues which do not occur on a single processor/core system but which do occur on a multi processor/core system.

Because your symptom appears related to the C runtime system function realloc, the first thing to investigate is to verify that you are linking with the multi-threaded version of the C runtime library.

Next, if the C runtime system was not your problem then you may have a statement that is atomic in a single processor/core system but is not atomic on a multi processor/core system. This could be something innocuous such as:
++Count;
n=n+i;

If Count is a shared variable then relative modification of it has to be protected by use of either "#pragma omp atomic" or by use of one of the Interlocked library functions such as InterlockedIncrement or InterlockedAdd.

Also, in the sample code you sent, if p is shared concurrently by multiple threads then I do not see code and/or directives to protect sequence the realloc, nor protection in your "..." code to properly acquire the next iBytePosition and iBitPosition.

Jim Dempsey

0 Kudos
ydq
Beginner
327 Views

>>The multithreading program running in intel core Quad cpu always exit in write_1_bit function for realloc problem.

This problem is caused by linking with single-threaded version of the C runtime library.Now, I use /MT compile option in VC++6.0 to link with multi-threaded version of the C runtime library. And, this problem is fixed.

>>The strange problem:
I build my multithreading program in intel single core cpu and run it, the encoding stream of video is correct.
I copy the .exe file to intel core Quad cpu and run it, the encoding stream of video is incorrect.

I build my multithreading program in intel core Quad cpu and run it, the encoding stream of video is incorrect.
I copy the .exe file to intel single core cpu and run it, the encoding stream of video is correct.
I build original c model of parallel program in intel core Quad cpu and run it, the encoding stream of video is correct.

This strange problem is still exist and become more severe.

I know golbal variable is shared variable in multi-threaded application. There is three kind of global variable in my multi-threaded application.

The first kind of global variable is initialized in main thread before any child thread is created(at this moment,there is no child thread is exist).And in child thread this kind of global variable is just read and never modified. So i do not make some special operation on this kind of golbal varibale in my multi-threaded application.

The second kind of global variable is only used in one child thread but never used in other child threads.And i also do not make some special operation on this kind of golbal varibale in my multi-threaded application.

The third kind of global variable maybe is the problem, i guess! This kind of global variable is structure variable which is complicated and have many elements. The elements in this structure variable can be devided into deferent parts, and each part is only access (write and read) in a specific child thread.
I also do not make some special operation on this kind of golbal varibale in my multi-threaded application. Maybe this caused the memory coherency problem in qual core architecture.

Thank you for your technical support. More advice is welcomed!

0 Kudos
jimdempseyatthecove
Honored Contributor III
327 Views

Try placing a critical section around your bit write function. If that corrects the problem then you found at least one of the problem areas. If the critical section around the bit write does not fix the problem then move the critical section to encapsulate different sections of code. Once you find the major secton of code then move the bounds of the critical section closer together. Eventually you will find the problem (you may have more than one problem).

If your bit write is shared then consider something along the line of:

do
{
// make copy of current Next bit position
bitPosition = bufferContext.bitPosition;
if(bitPosition >= bufferContext.bitsInBuffer)
{
sleep(1); // or insert wait for single event
continue;
}
} while (
_InterlockedCompareExchange( // return current value
&bufferContext.bitPosition, // of this location
bitPosition+1, // conditional exchange with this value
bitPosition) // provided current value matches this
!= bitPosition); // continue if our copy different from current
// Here with bitPosition reserved
// insert bit into current buffer
if(bitValue == 0)
{
// buffer pre-initialized to 0's
} else {
// Insert 1 at bitPosition
do
{
IndexWord = bitPosition>>BITS_IN_WORD;
oldBits = bufferContext.buffer[IndexWord];
newBits = oldBits | (1<<(bitPosition&(BITS_IN_WORD-1)));
} while (
_InterlockedCompareExchange( // return current value
&bufferContext.buffer[IndexWord], // of this location
newBits, // conditional exchange with this value
oldBits) // provided current value matches this
!= oldBits); // continue if our copy different from current
}
// Here when word in buffer has output bit
do
{
oldBitsWritten = bufferContext.bitsWritten;
} while(
_InterlockedCompareExchange( // return current value
&bufferContext.bitsWritten, // of this location
oldBitsWritten+1, // conditional exchange with this value
oldBitsWritten) // provided current value matches this
!= oldBitsWritten);&nb sp; // continue if our copy different from current
if(bufferContext.bitsWritten == bufferContext.bitSizeOfBuffer)
{
// we filled the buffer
// write the buffer or realloc the buffer here
// ...
// first
bufferContext.bitsWritten = 0;
// last
bufferContext.bitPosition = 0;
}

Then the next improvement technique would be to introduce multi-buffering into the bit write function.

Jim Dempsey

0 Kudos
Reply