topic Re: Memory reactivity on Intel multicore processors in Intel® Moderncode for Parallel Architectures

Memory reactivity on Intel multicore processors

Dmitry_Vyukov — Wed, 04 Jul 2007 10:19:10 GMT

Here is the code:

int buffer[1000];
void write_value(int value)
{
int pos = 0;
while (buffer[pos]) ++pos;
if (pos == 999) return;
buffer[pos] = value;
}

Assume function write_value() is executed by multiple threads on Intel multicore processor.

Question: What would be the rate of value loss depending on (1) thread count, depending on (2) core count, depending on (3) whether L2 cache is shared between cores or not?

Question is about hardware, question is not about possible context switch between buffer cell check and store.

Thank you
Dmitriy V'jukov

Re: Memory reactivity on Intel multicore processors

Dmitry_Vyukov — Wed, 04 Jul 2007 14:34:57 GMT

Question: What would be the rate of value loss depending on (1) thread count, depending on (2) core count, depending on (3) whether L2 cache is shared between cores or not?

And depending on (4) frequency of calling function write_value() by threads? I.e. every thread periodically call function write_value() and then make some other work for X cycles. How value loss frequency depends on X?

It would be great to see tables like this:

core count = 2, shared L2 cache
X = 0 (threads only call write_value() in tight loop), value loss rate = 10%
X = 100, value loss rate = 5%
X = 1000, value loss rate = 1%

core count = 4, two L2 caches, every L2 cache shared between 2 cores
X = 0, value loss rate = 30%
X = 100, value loss rate = 10%
X = 1000, value loss rate = 3%

I don't have access to multicore Intel processors, so I can't test this. Anyway you can provide more information on this, and for wider range of processor architectures, and maybe some other details...

Thank you
Dmitriy V'jukov

Re: Memory reactivity on Intel multicore processors

jimdempseyatthecove — Thu, 05 Jul 2007 16:42:14 GMT

Dmitriy,

You should not write a shared write routine in the manner you described. Try something along the lines of

volatile long buffer[1000];

void init_buffer()
{
 for(i=0;i<1000;++i) buffer=0;
}

bool write_value(int value)
{
 if(value == 0) return false; ! Invalid input
 int pos;
 for(pos=0;pos<1000;++pos)
 {
 if(!buffer[pos])
        {
     if(InterlockedCompareExchange(&buffer[pos],value,0) == 0)
                return true; ! insert successful
        }
 }
    return false; ! Insert failed

}

Jim Dempsey

Re: Memory reactivity on Intel multicore processors

Dmitry_Vyukov — Thu, 05 Jul 2007 17:35:28 GMT

JimDempseyAtTheCove:
Dmitriy,

You should not write a shared write routine in the manner you described.

Please, can you explain in more detail, why I should not write a shared write routine in this manner?
I don't see problems here (except that some values can be lost). Stores are atomic. Other threads eventually will see stores made by other threads.

JimDempseyAtTheCove:
Try something along the lines of

Why?
You propose code that...let me guess...about 100 times slower than my. And loads cache coherency protocol more, so less scalable. Is there earnest reason to use such code?

Dmitriy V'jukov

Re: Memory reactivity on Intel multicore processors

Dmitry_Vyukov — Thu, 05 Jul 2007 17:41:44 GMT

I am trying to understand how much cycles it takes to core to flush it's store buffer.

I think (hope) that with 4 cores and frequency of write_value() calls about 1 call per 1000 cycles, value loss rate will be low enough, about <5%.

Dmitriy V'jukov

Re: Memory reactivity on Intel multicore processors

jimdempseyatthecove — Thu, 05 Jul 2007 18:48:20 GMT

Dmitriy,

If your intention is to reliably store data into a shared buffer without loss of data then use the InterlockedCompareExchange or an interlocked increment with static index. In your original code you could potentially loose more data than is stored (depending on the nature and phase of the programs).

However, it appears from your latest commentsthat your interest is in flushing cache.

If flushing or invalidating cache is your interest then I suggest you determing the size of the cache line, then walk the zone of interest in the cach in increments of the cache line. If you use SSE3 instructions (on todays processors) you can write 8 or 16 bytes at a time. If your system has a 16 byte cache line then writing 16 bytes per iteration to aligned data performs the operation using a write verses a read-modify-write as with you long buffer technique.

There are additional instructions available to invalidate and/or flushcache lines without or with writes to memory depending on your interests.

It may help if you provide detailed description of what you want to do as opposed to asking for comments on code, the purpose ofwhich is not disclosed.

Jim Dempsey

Re: Memory reactivity on Intel multicore processors

Dmitry_Vyukov — Thu, 05 Jul 2007 19:09:14 GMT

JimDempseyAtTheCove:
It may help if you provide detailed description of what you want to do as opposed to asking for comments on code, the purpose ofwhich is not disclosed.

Yes, of course.
Please, see full code and my intention here:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/844d08c4eeb5c5d5

Dmitriy V'jukov

Re: Memory reactivity on Intel multicore processors

Dmitry_Vyukov — Thu, 05 Jul 2007 19:20:03 GMT

randomizer:

Yes, of course.
Please, see full code and my intention here:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/844d08c4eeb5c5d5

Short conclusion:
I propose algorithm and implementation for multiple-producer/single-consumer queue. Enqueue and dequeue operations don't issue memory barriers and atomic read-modify-write operation at all. Algorithm based on "hazard" storing that I show in first post. But algorithm also includes some additional tricky logic to cope with lost nodes, so no lost nodes in result.

The main question is:
What will be the frequency of node loss in shared buffer under different workloads and on different platforms?

The good news is:
Frequency up to 50% or even up to 66% is OK. Because queue algorithm can successfully restore all other nodes. Producer have to only episodically successfully put nodes to shared buffer.

Dmitriy V'jukov

Re: Memory reactivity on Intel multicore processors

Dmitry_Vyukov — Thu, 05 Jul 2007 19:27:32 GMT

JimDempseyAtTheCove:
If your intention is to reliably store data into a shared buffer without loss of data...

No. I it would be enough to only episodically successfully store data to shared buffer.

JimDempseyAtTheCove:
then use the InterlockedCompareExchange or an interlocked increment with static index.

This is too... boring :)

JimDempseyAtTheCove:
In your original code you could potentially loose more data than is stored (depending on the nature and phase of the programs).

If I loose two times more that I store, it is OK.
If more... the question exactly about this - can I loose more?

JimDempseyAtTheCove:
However, it appears from your latest commentsthat your interest is in flushing cache.

No. But as I understand data loss frequency related to store buffer flushing speed.

JimDempseyAtTheCove:
Jim Dempsey

Thank you for interest and for answers.

Dmitriy V'jukov

Re: Memory reactivity on Intel multicore processors

jimdempseyatthecove — Fri, 06 Jul 2007 16:39:26 GMT

Dmitriy,

So your intention is to provide for a Queue with error recovery due to potential adverse interaction (competition) for an emptycell in the queue.

Observe your code with state numbers

1) while (buffer[pos]) ++pos;
2) if (pos == 999) return;
3) buffer[pos] = value;

I have not downloaded your code (not permitted to) so I cannot comment on your implementation specifically. However, I can comment in general based on your implementation of the enqueue routine.

Presumably you will insert something into the queue and then test immediately or shortly thereafter to see if that item remains in the queue. If the item does not remain in the queue you would assume that contention for the cell cause the same cell to be used by multiple threads. That is to say multiple threads issue statement 1) above and observe the same empty cell, then multiple threads issue statements 2 and 3 to update the same cell in the buffer with the last thread to "simultaneously" update being the "winner". Then presumably all threads examine the queue for their value and if not found (winner finds its entry in buffer) re-queue the value.

The problem with this is one of the competing threads that identify [pos] in buffer as being available could stall for a considerable time between states 1) and 3). This could be due to the O/S running something else in lieu of the competing thread. This stall time could be on the order of several milliseconds or even much longer (several seconds). Therefore the stall time will at times exceed the dwell time that you assume is safe for reexamination of buffer for sucessful insertion.

Also, in the code I suggested there is a similar problem. Assume a thread scans for available cell at statement 1). Then, before it inserts data the thread stalls for a long time (O/S context switch). During the stall time for that thread the cell is used by a different thread and then during the same stall time the de-queue process removes the entry and removes all entries and resets the queue to an empty state (all cells 0 now). Now then the stalled thread successfully inserts its value into the middle of the empty queue.

The problem now is the errant entry has empty cells before it. The entry is not lost, but the dequeue process will not fetch the entry for processing until the prior cells get reused (the errant cell will be skipped during the next fill pass).

Jim Dempsey

Re: Memory reactivity on Intel multicore processors

jimdempseyatthecove — Fri, 06 Jul 2007 17:01:27 GMT

Dmitriy,

If you wish to avoid Interlocked instructions then might I suggest an alternative technique that each thread has a private queue for enque to a resourceand that the server thread for the resource dequeues from each of the private queues. The enque process never has contention (except for queue full condition). And the dequeue is only delayed marginaly to find that queues are empty. This eliminates lost entries and the time spent for verifying lost entries at the expense at peeking into potentially empty queues.

On a 4 core system (4 processing theads)each shared resource would have 4 ring buffers (queues).

Jim Dempsey

Re: Memory reactivity on Intel multicore processors

Dmitry_Vyukov — Sun, 08 Jul 2007 18:48:25 GMT

JimDempseyAtTheCove:
Dmitriy,
I have not downloaded your code (not permitted to) so I cannot comment on your implementation specifically. However, I can comment in general based on your implementation of the enqueue routine.

I post code to this forum. Please see topic named "MPSC FIFO Queue w/o atomic_rmw/membars".

JimDempseyAtTheCove:
Presumably you will insert something into the queue and then test immediately or shortly thereafter to see if that item remains in the queue...

No. All this is not the case for my algorithm.

Dmitriy V'jukov

Re: Memory reactivity on Intel multicore processors

Dmitry_Vyukov — Sun, 08 Jul 2007 18:59:34 GMT

JimDempseyAtTheCove:
If you wish to avoid Interlocked instructions then might I suggest an alternative technique that each thread has a private queue for enque to a resourceand that the server thread for the resource dequeues from each of the private queues. The enque process never has contention (except for queue full condition). And the dequeue is only delayed marginaly to find that queues are empty. This eliminates lost entries and the time spent for verifying lost entries at the expense at peeking into potentially empty queues.

On a 4 core system (4 processing theads)each shared resource would have 4 ring buffers (queues).

Yes, this is rational suggestion.
If there are few threads, then - yes. It will be very good solution.
But I am targeted and thinking about manycore machines with, for
example, 100 cores (Intel promise 80 cores in 5 years). And you can
have, for example, 2 threads per core. So 200 threads. Every consumer
must pull 200 spsc queues... And to block when all queues are empty,
consumer must check all 200 queues 2 times...
And to determine total node count in N spsc queues, consumer have to
check all N queues too. Node count needed for load-balancing,
statistics, feedback etc...
I am thinking about solution when consumer have, for example, N/10
mpsc queues instead of N spsc queues (N - number of threads). So and
number of queues would be moderate and contention would be moderate
too.

Dmitriy V'jukov