optimal read/write of integer between 2 threads

dthtw · ‎05-13-2009

Hi,

I am currently working on the following problem : how to write the fastest code possible to have a thread X write an int that is read by thread Y, and making sure that the read performance in Y is also optimal ?
My constraints are the following :
1- The processor is an Intel i7 2.93 GHz with 4 cores
2- The os is Ubuntu Linux 8.10 / Kernel 2.6.27-11-generic
3- code has to be C (or C++)
4- Performance is far more important than portability (in fact portability for my specific case is 100% useless).
5- by setting appropriate affinities, threads X and Y can be on cores sharing L3 cache (to my understanding L1 and L2 caches are private on i7).
6- on average, there should be 20 more reads than writes.

Obviously the code should be lock-free.
A simple solution is to have the int variable be volatile, but in this case, we do not profit from the cache memory.

I started looking at different forums, and gathered the following thoughts :
1- use the functions dma_cache_inv to invalidate the cache at a location following a write in X. Hence next time Y reads, it fetches the value from memory
2- there are also some standard linux functions like flush_cache_all, flush_cache_page
3- I am not an assembly expert, but how difficult would it be to write a __asm__ section in C which invalidates the cache line containing the value which has changed...
4- We do not need mutexes, since the write/read of an integer is atomic. We just need to manage cache memory staleness.

I am just looking for expert advice on how to do this the fastest possible.
Thanks for your replies, and I hope this was not answered before.

Regards,
D.

Dmitry_Vyukov · ‎05-13-2009

Just declare you C/C++ int as volatile. C/C++ (as well as Java/C#) volatiles DO NOT affect cachability of the memory in any way. They affect only compiler optimizations.
This is the fastest and lightest way to synchronize your int.
However be aware that such synchronization will NOT synchronize any associated with int data. If there is no associated data that this is not a problem.

dthtw · ‎05-13-2009

Quoting - Dmitriy Vyukov

Just declare you C/C++ int as volatile. C/C++ (as well as Java/C#) volatiles DO NOT affect cachability of the memory in any way. They affect only compiler optimizations.
This is the fastest and lightest way to synchronize your int.
However be aware that such synchronization will NOT synchronize any associated with int data. If there is no associated data that this is not a problem.

Thanks for your quick reply, Dmitriy.

Just to clarify :
1- There is indeed no associated data : it is a pure "value" int, containing meaningfull data for my program. It is not a hidden pointer to data.
2- If I get it right : a volatile int will be 1- written to main memory 2- caches will be flushed 3- reader cores will update their cache on the next read 4- on any subsequent read, the value can be taken from the cache -> am I correct ? Hence the cache will effectively be used for reader threads/cores.

jimdempseyatthecove · ‎05-13-2009

D.

Your explination of (and possibly understanding of) the problem is incomplete.

? is the value being written, then changed, then written again, ... is itwritten to the same location?

volatile int sharedVar = 0;
...
sharedVar = value();
...
sharedVar = value();
...

if so,

Then what happens when the threads reading the variable do not read the variable between two writes?
Do you loose data or does it not matter? (some code requirements can accept loss of reading).

Can the thread writing the variable be throttled (i.e. can it slow down without loosing data)? An I/O device might not be able to be throttled without loss of data.

When the writing thread can be throttled, and assuming you have an integer value that can be used to indicate empty location (e.g. 0, 0x80000000, 0xcccccccc, ...) then consider creating a ring buffer initialized to "empty location".
The rules for writing and reading are

Both threads start at beginning of buffer

Reading thread waits for cell to transition from "empty location" to data (not "empty location"), copies data to local var, overwrites data with "empty location" value, advances index into buffer%bufferSize, uses data. (insert _mm_pause(); in wait loop or insert SwitchToThread();)

Writing thread generates data, loops on while(next location of buffer not equals "empty location") _mm_pause();,writes data (overwrites "empty location"), advances index into buffer%bufferSize. Then generates next data value.

As Dmitriy indicated, if you use volatile for your shared variables, then the compiler will not optimize the access to the variable out of your code.

When your buffer is large enough, the writes will make it out to the caches of the other systems with a higher latency than if you insert flushing instructions. When looking for throughput, look at NOT using flushing. When wishing to reduce latency then add flushing (but this will slow down throughput).

The lowest latency would be to schedule two threads sharing the same L1 cache (if HT available)
The fastest throughput may be to schedule two threads sharing the same L2 cache
or it may be toa thread in other L2depending on quantity of data. (other L2 will share via L3 or memory)

Throughput and latency tend to work against each other.

Jim Dempsey.

anthony_williams · ‎05-13-2009

Quoting - dthtw

Hi,

I am currently working on the following problem : how to write the fastest code possible to have a thread X write an int that is read by thread Y, and making sure that the read performance in Y is also optimal ?
My constraints are the following :
1- The processor is an Intel i7 2.93 GHz with 4 cores
2- The os is Ubuntu Linux 8.10 / Kernel 2.6.27-11-generic
3- code has to be C (or C++)
4- Performance is far more important than portability (in fact portability for my specific case is 100% useless).
5- by setting appropriate affinities, threads X and Y can be on cores sharing L3 cache (to my understanding L1 and L2 caches are private on i7).
6- on average, there should be 20 more reads than writes.

Firstly, leave the cache management to the CPU. It's very good at it. That said, pinning your threads to cores so they share a cache might help.

If it's a plain integer you want to pass between threads, just use a "volatile int" in C or C++ code. It should be declared volatile to ensure that the compiler actually emits every read and write, and doesn't cache it in a register or anything.

If all you are doing is plain reads and writes (x= some_expression(), some_local=x) then that should be sufficient, as plain loads and stores to aligned variables (which normal variables are in C or C++ unless you do strange things or fiddle with compiler options) are atomic.

The problem is if you do things like x++ or x+=6. These will NOT be atomic, even for volatile variables. If there's only one thread updating the variable this isn't a problem though, since other threads will see the value before or the value after, and all is well. From your description, it appears this is the case, but it is important to know. If there is more than one thread writing to the variable, this is a problem, as you may get intermingle reads and writes from different threads, thus discarding the write from one of the threads. In this case you will have to use either assembly language or compiler intrinsics such as InterlockedIncrement.

dthtw · ‎05-13-2009

Quoting - jimdempseyatthecove

D.

Your explination of (and possibly understanding of) the problem is incomplete.

? is the value being written, then changed, then written again, ... is itwritten to the same location?

volatile int sharedVar = 0;
...
sharedVar = value();
...
sharedVar = value();
...

if so,

Then what happens when the threads reading the variable do not read the variable between two writes?
Do you loose data or does it not matter? (some code requirements can accept loss of reading).

Can the thread writing the variable be throttled (i.e. can it slow down without loosing data)? An I/O device might not be able to be throttled without loss of data.

When the writing thread can be throttled, and assuming you have an integer value that can be used to indicate empty location (e.g. 0, 0x80000000, 0xcccccccc, ...) then consider creating a ring buffer initialized to "empty location".
The rules for writing and reading are

Both threads start at beginning of buffer

Reading thread waits for cell to transition from "empty location" to data (not "empty location"), copies data to local var, overwrites data with "empty location" value, advances index into buffer%bufferSize, uses data. (insert _mm_pause(); in wait loop or insert SwitchToThread();)

Writing thread generates data, loops on while(next location of buffer not equals "empty location") _mm_pause();,writes data (overwrites "empty location"), advances index into buffer%bufferSize. Then generates next data value.

As Dmitriy indicated, if you use volatile for your shared variables, then the compiler will not optimize the access to the variable out of your code.

When your buffer is large enough, the writes will make it out to the caches of the other systems with a higher latency than if you insert flushing instructions. When looking for throughput, look at NOT using flushing. When wishing to reduce latency then add flushing (but this will slow down throughput).

The lowest latency would be to schedule two threads sharing the same L1 cache (if HT available)
The fastest throughput may be to schedule two threads sharing the same L2 cache
or it may be toa thread in other L2depending on quantity of data. (other L2 will share via L3 or memory)

Throughput and latency tend to work against each other.

Jim Dempsey.

Hi Jim,

Thanks for your detailed response.

To be more precise :
1- there is only one writer thread X, which writes a value to the same memory location each time.
2- there is one reader thread Y for now, but maybe several in the future.
3- the writer thread blocks on a socket, listening for data, and writes the value when received.
4- the reader thread does a heck more things, but uses this value very often and needs the most up to date value. My initial guess is that there would be approximately 20 times more reads than writes.
5- it does not matter if 2 different values are written by X, and the first one is missed by Y : at any point in time, Y needs the most up to date value of the int variable.

I like your idea of the ring buffer, but since point 5, I do not think I need it.

I will go for volatile.

Thanks all for your replies.

jimdempseyatthecove · ‎05-14-2009

>>5- it does not matter if 2 different values are written by X, and the first one is missed by Y : at any point in time, Y needs the most up to date value of the int variable.

Then a single location will work best.

From a latency perspective

Threads sharing the same L1 (HT companion threads)
Threads sharing the same L2 (usualy 4-core processors have two L2 caches, 2 cores per L2)
Threads sharing the same L3 (provided processor has L3)
Threads sharing the same memory port (available on NUMA systems)
Then NUMA systems may have multiple "hops" to memory.

Usually write to memory forces cache eviction to other caches (not shared by current thread) but some of the newer techniques may write to other caches in place of eviction. This may occure to cores within the same die depending on design. Cores outside the die may be left with eviction although I cannot be sure that the two die on one socket would not support passing the data. This requires more interconnects.

If you can program the hardware device to write to a fixed buffer and have all threads map a memory pageto that buffer then you can reduce the latency since all threads essentially become the listener. e.g. if this is a USB device, all your application threads could monitor the communication buffer directly.

A potential problem with a singlelistener thread is it can be scheduled out at an in-opertune time. This may or may not be of consequence to your application.

Jim Dempsey