topic Problem when ported from single-core to multicore in Intel® Moderncode for Parallel Architectures

Problem when ported from single-core to multicore

hyphone — Wed, 26 Jan 2011 05:43:00 GMT

[bash]Hi all,
My program works fine under single-core environment. But if under the following environment:
gcc version 4.1.0 (SUSE Linux)
8 * Intel Xeon CPU E5520 @ 2.27GHz

Occasionly, the latter values are lower than the former values of two sequential TickGet()'s:
(Before return of TickGet(), we recorded values of 'uiIdex', 'lm_uiRollingTick[0]' and 'm_uiRollingTick[1]')
(gdb) call TICK_DebugShow()
uiIdx=1, uiTick[0]=35849, uiTick[1]=35848. -------- last time:     get uiTick[1]=35848
uiIdx=0, uiTick[0]=35849, uiTick[1]=35848. -------- former time: get uiTick[0]=35849
uiIdx=0, uiTick[0]=35848, uiTick[1]=35847.

I heard that Intel CPU is conservative and ordered. So what causes this problem and how?

CODE:
TICK runs a thread to update rolling tick(using TickRolling) at regular intervals.
And it provides a interface TickGet() for other threads to get the current ticks.
We use a read buffer m_uiRollingTick[1] to prevent using lock.

unsigned int m_uiRollingTickHigh[2];
unsigned int m_uiRollingTick[2];
volatile unsigned int m_uiTickIndex;

int TickGet(unsigned int *puiHigh, unsigned int *puiLow)
{
	unsigned int uiIndex;
	
	uiIndex = m_uiTickIndex;
	*puiHigh = m_uiRollingTickHigh[uiIndex];
	*puiLow  = m_uiRollingTick[uiIndex];

	return 0;
}

void TickRolling(unsigned int uiMillSec)
{
	unsigned int uiRollingTickAndLost;
	unsigned int uiLostTicks = uiMillSec/1000;
                                                                 
	m_uiRollingTickHigh[1]  = m_uiRollingTickHigh[0];
	m_uiRollingTick[1]      = m_uiRollingTick[0];
	m_uiTickIndex = 1;

	uiRollingTickAndLost = m_uiRollingTick[0] + uiLostTicks;
	if(m_uiRollingTick[0] > uiRollingTickAndLost)
	{
		m_uiRollingTickHigh[0]++;
	}
	m_uiRollingTick[0] += uiLostTicks;
	m_uiTickIndex = 0;
}

Thanks & Regards

Hyphone
[/bash]

Problem when ported from single-core to multicore

Dmitry_Vyukov — Wed, 26 Jan 2011 07:59:51 GMT

The code is dead wrong, it's difficult to enumerate all the problems.
For example, TickGet() can read high part from one index, and then low part from another index. Or TickRolling() can set m_uiTickIndex to 1 and then update values at index 1. All accesses are not atomic.
It's not only CPU that reorders accesses, it can be done can a compiler as well.
The code is not working on singlecore CPU as well, you were just lucky.

The easiest thing to do is use atomic 64-bit loads and stores. Then you do not need all that code - just atomically store new value, and atomically read current value.

And do read So what is a memory model? And how to cook it?

Problem when ported from single-core to multicore

hyphone — Thu, 27 Jan 2011 02:05:33 GMT

Dmitriy, Thank you for your reply.

Yes, the code may be not so nice, but it works on singlecore.
I checked there is no compiler out-of-order optimizing and I declared m_uiTickIndex as 'volatile'.
Then in my opinion, the Intel CPU will promise the executing order as the program order.

SoTickGet()reads m_uiTickIndex first,and thereis nopartly read.
And TickRolling() updates values at index 1 first, then sets m_uiTickIndex to 1.

PS: I run the program as DEBUG version, and there is no optimizing.
(gdb) disass TickRolling
Dump of assembler code for function TickRolling:
0x08048633 : push %ebp
0x08048634 : mov %esp,%ebp
0x08048636 : sub $0x10,%esp
0x08048639 : movl $0x0,0xfffffff8(%ebp)
0x08048640 : mov 0x8049128,%edx
0x08048646 : mov 0x8(%ebp),%eax
0x08048649 : mov %edx,%ecx
0x0804864b : mov $0x0,%edx
0x08048650 : div %ecx
0x08048652 : mov %eax,0xfffffffc(%ebp)
0x08048655 : mov 0x8049168,%eax
0x0804865a : mov %eax,0x804916c
0x0804865f : mov 0x8049160,%eax
0x08048664 : mov %eax,0x8049164
0x08048669 : movl $0x1,0x8049170
0x08048673 : mov 0x8049160,%eax
0x08048678 : add 0xfffffffc(%ebp),%eax
0x0804867b : mov %eax,0xfffffff8(%ebp)
0x0804867e : mov 0x8049160,%eax
0x08048683 : cmp 0xfffffff8(%ebp),%eax
0x08048686 : jbe 0x8048695
0x08048688 : mov 0x8049168,%eax
0x0804868d : add $0x1,%eax
0x08048690 : mov %eax,0x8049168
0x08048695 : mov 0x8049160,%eax
0x0804869a : add 0xfffffffc(%ebp),%eax
0x0804869d : mov %eax,0x8049160
0x080486a2 : movl $0x0,0x8049170
0x080486ac : leave
0x080486ad : ret
End of assembler dump.

And thanks for your recommendation, Iwill read the article carefully.

Problem when ported from single-core to multicore

Dmitry_Vyukov — Thu, 27 Jan 2011 12:14:15 GMT

> Yes, the code may be not so nice, but it works on singlecore.

It only appears to work most of the time.

> I checked there is no compiler out-of-order optimizing

So you intended to never compile it under release configuration, right?

> and I declared m_uiTickIndex as 'volatile'.

Volatile does not work that way. At least you need to declare ALL participating variables as volatile.

> Then in my opinion, the Intel CPU will promise the executing order as the program order.

Not quite. For example, Dekker's algorithm won't work on IA-32/Intel64 without explicit memory fences.

> And TickRolling() updates values at index 1 first, then sets m_uiTickIndex to 1.

For example, it can break in the following way.
Current time is 1,100 (high, low).
A reader reads it as 1,100.
On the next try, the reader reads high part 1. Then, time is changed to 2,50. Then reader reads low part - 50. So the result is 1,50. The time indeed goes back.

Problem when ported from single-core to multicore

hyphone — Fri, 28 Jan 2011 02:42:52 GMT

I'm afraidI didn't explain my point clear.

First, I think this code is wrong too.
And I see that if current time is 1,0xffffffff (high, low), one reader reads the high part 1 with index 0.
Then, time is changed to 2, 0. Then the reader reads the low part 0.
So the result is 1,0 and time goes back.

But as I ran the DEBUG version, I got a (0,35849) followed by (0,35848).
I just don't know how this problem come out,that is, what is the program executing flow in a global view.

BTW, ifI use spin_lock in TickGet and TickRolling, theproblem is gone.

The do-while loop also does the same work.
Is that a lock-freedomprotection as mentioned in your article?

intVOS_TickGet(unsigned int *puiHigh,unsigned int *puiLow)
{
unsigned intuiIndex;

do {
uiIndex = m_uiTickIndex;
*puiHigh = m_uiRollingTickHigh[uiIndex] ;
*puiLow = m_uiRollingTick[uiIndex];
}while (uiIndex != m_uiTickIndex);

return 0;
}

Problem when ported from single-core to multicore

Dmitry_Vyukov — Fri, 28 Jan 2011 08:37:43 GMT

> But as I ran the DEBUG version, I got a (0,35849) followed by (0,35848).

If you are interested in how exactly it's possible under sequentially consistent memory model, follow me.

Below is a sequence of modifications of the variables during update operation:

[cpp]time    index     tick[0]     tick[1]
0       0         10          5
1       0         10          10
2       1         10          10
3       1         15          10
4       0         15          10[/cpp]

First read operation starts at time=0.
A thread reads index=0 (time=0)
Then time advances to time=3.
Then the thread reads tick[0]=15 (time=3).

Second read operation starts at time=3.
The thread reads index=1 (time=3).
Then the thread reads tick[1]=10 (time=3).

So, indeed, in two consecutive reads under sequentially consistent memory model a thread observes time=15 and then time=10. Welcome to concurrent programming!

Problem when ported from single-core to multicore

Dmitry_Vyukov — Fri, 28 Jan 2011 08:40:16 GMT

By the way, to model such things you may use my Relacy Race Detector. It would show you precise execution history which leads to any possible outcome you are interested in.

Problem when ported from single-core to multicore

hyphone — Sun, 30 Jan 2011 05:46:39 GMT

Thank you for your replies.
I'll try the Relacy Race Detectortool to see the precise execution flow.

Problem when ported from single-core to multicore

Dmitry_Vyukov — Mon, 31 Jan 2011 08:07:30 GMT

As you see it can equally happen on a single-core machine. It does not relate to multicore, it's just a multithreading bug. As I said, the simplest way to fix it is to use 64-bit atomic loads and stores.