Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

out-of-order and several volatiles

OTorg
New Contributor III
917 Views
I implement a "single writer - multiple readers" primitive for several variables. In simplified form, it looks like this:
[cpp]class X { protected: volatile DWORD m_vers1; volatile DWORD m_a; volatile DWORD m_b; volatile DWORD m_c; volatile DWORD m_d; volatile DWORD m_vers2; public: X() : m_vers1( 0 ) , m_vers2( 0 ) {} void Write( DWORD a, DWORD b, DWORD c, DWORD d ) { ++m_vers2; m_a = a; m_b = b; m_c = c; m_d = d; ++m_vers1; } void Read( DWORD& ra, DWORD& rb, DWORD& rc, DWORD& rd ) { for ( ;; ) { DWORD vers1 = m_vers1; ra = m_a; rb = m_b; rc = m_c; rd = m_d; DWORD vers2 = m_vers2; if ( vers2 == vers1 ) break; } } };[/cpp]
Write() and Read() are called from different threads. I'm worried about out-of-order execution principe. Volatile keyword ensures that compiler not change the order. And how to say it to processor? Or this can not worry? How such is made inside critical sections?
0 Kudos
19 Replies
jimdempseyatthecove
Honored Contributor III
917 Views
As coded above:

a) When in Read ( vers2 == vers1 ) you will have a consistant load of m_a...m_d
(including the initialuninitialized data)
b) Read may experience drop outs. e.g. Read m_vers1==1, m_verse1==3 (miss sequence 2)
c) Multiple readers may read the same packet

Jim Dempsey
0 Kudos
OTorg
New Contributor III
917 Views
As coded above:

a) When in Read ( vers2 == vers1 ) you will have a consistant load of m_a...m_d
(including the initialuninitialized data)
b) Read may experience drop outs. e.g. Read m_vers1==1, m_verse1==3 (miss sequence 2)
c) Multiple readers may read the same packet

Jim Dempsey

a) In the code above, I omitted parts not related to the subject.First write is always present (initializing). Spinning/sleep is also present in Read cycle. Writes are rare (1 per second), reads arerelatively frequent (tens/hundreds per second).

b), c) It is correct by design.

Maybe you have something to say on the subject?

0 Kudos
jimdempseyatthecove
Honored Contributor III
916 Views
Try something like

[cpp]class X { protected: volatile DWORD m_vers1; volatile DWORD m_a; volatile DWORD m_b; volatile DWORD m_c; volatile DWORD m_d; volatile DWORD m_vers2; public: X() : m_vers1( 0 ) , m_vers2( 0 ) {} void Write( DWORD a, DWORD b, DWORD c, DWORD d ) { while( m_vers1 != m_vers2 ) continue; m_a = a; m_b = b; m_c = c; m_d = d; ++m_vers1; } void Read( DWORD& ra, DWORD& rb, DWORD& rc, DWORD& rd ) { for ( ;; ) { while( m_vers1 == m_vers2 ) continue; DWORD vers1 = m_vers1; ra = m_a; rb = m_b; rc = m_c; rd = m_d; DWORD vers2 = vers1 - 1; // one behind // swap only if verse2 one behind verse1 if ( CAS(&m_vers2, verse1, &verse2 ) ) break; } } };[/cpp]
Jim Dempsey
0 Kudos
SergeyKostrov
Valued Contributor II
916 Views
I would better usea critical section because they will guarantee a right execution order andhighly optimized
onmany platforms.

Best regards,
Sergey
0 Kudos
OTorg
New Contributor III
916 Views
Try something like
[cpp] void Write( DWORD a, DWORD b, DWORD c, DWORD d ) { while( m_vers1 != m_vers2 ) continue; m_a = a; m_b = b; m_c = c; m_d = d; ++m_vers1; } void Read( DWORD& ra, DWORD& rb, DWORD& rc, DWORD& rd ) { for ( ;; ) { while( m_vers1 == m_vers2 ) continue; DWORD vers1 = m_vers1; ra = m_a; rb = m_b; rc = m_c; rd = m_d; DWORD vers2 = vers1 - 1; // one behind // swap only if verse2 one behind verse1 if ( CAS(&m_vers2, verse1, &verse2 ) ) break; } } };[/cpp]
Jim Dempsey

I don't need such behaviour: "one published message should be captured by a one consumer". I need one-to-many scheme: message written can be consumed by 0, 1, or many readers. The same goes for the next message, etc.
And your answer does not apply to the topic of discussion. I am interested in the following:
[cpp]volatile DWORD m_y; volatile DWORD m_z; DWORD y = m_y; DWORD z = m_z; [/cpp]
Whether the CPU will execute instructions in a following order always: first read m_y, then m_z? The same applies to the writes. I did test the validity of this code, and got the correct results. But I just could not get into the wrong situation. So I'm interested in: is it correct conceptually? This applies to the present and future processors.

0 Kudos
OTorg
New Contributor III
916 Views
I would better usea critical section because they will guarantee a right execution order andhighly optimized
onmany platforms.

Best regards,
Sergey

My variables are located in a shared memory. Writer and readers run in different processes.Critical sections are not an interprocess primitives. Second, critical sections are not effective in a single-writer-multiple-readers tasks.

So, I had a choice: implement swmr-interprocess critical section, or think up something simpler for this particular (and similar) tasks...

0 Kudos
SergeyKostrov
Valued Contributor II
917 Views
Quoting dj_alek
...My variables are located in a shared memory. Writer and readers run in different processes...


Your initial statement was "...fromdifferent threads...". It wasn't clear that aWriter and Readers are working
in different processes. Thanks for clarification.

0 Kudos
OTorg
New Contributor III
917 Views
Quoting dj_alek
...My variables are located in a shared memory. Writer and readers run in different processes...


Your initial statement was "...fromdifferent threads...". It wasn't clear that aWriter and Readers are working
in different processes. Thanks for clarification.

Sorry, when I wrote the post, then focused on the out-of-order execution, not the nuances of use. Code above is a sample only and is intended for better understanding. I am interested in the out-of-order executionprinciple itself.

0 Kudos
jimdempseyatthecove
Honored Contributor III
917 Views
The problem with your original code was:

a) The writer could write multiple times before a reader could read. Eeven though you have multiple readers and the chance is low for this to happenit will still happen. Possible causes for missed reads (or bunged up data) are: Process time for all readers exceed production time forwriter. Reader threads get preempted or suspended (e.g.page file expansion, I/O block time, etc...).This causes missed packets.

b) The writer could begin a second write while the reader is in the process of reading the prior write. This cause bunged up data packets.

c) Multiple threads could have read the same or bunged up packet.

The code example I provided earlier avoids these issues.

Using a critical section is heavier weight. However, a critical section does contain a SpinWait, which in this case does you no good (you have a polling structure).

To incorporate thread suspension, you will need to use a condition variable in Linux or Event in Windows. This will add overhead in the writer and latency/overhead in the reader. Since you state your write rate is at relativelylong intervals, it would seem like use of condition variable or event should be investigated. Note, if you are using a threading toolkit, it may have alternate means to trigger (enqueue) a reader task.

Jim Dempsey
0 Kudos
OTorg
New Contributor III
917 Views
The problem with your original code was:

a) The writer could write multiple times before a reader could read. Eeven though you have multiple readers and the chance is low for this to happenit will still happen. Possible causes for missed reads (or bunged up data) are: Process time for all readers exceed production time forwriter. Reader threads get preempted or suspended (e.g.page file expansion, I/O block time, etc...).This causes missed packets.

b) The writer could begin a second write while the reader is in the process of reading the prior write. This cause bunged up data packets.

c) Multiple threads could have read the same or bunged up packet.

The code example I provided earlier avoids these issues.

Using a critical section is heavier weight. However, a critical section does contain a SpinWait, which in this case does you no good (you have a polling structure).

To incorporate thread suspension, you will need to use a condition variable in Linux or Event in Windows. This will add overhead in the writer and latency/overhead in the reader. Since you state your write rate is at relativelylong intervals, it would seem like use of condition variable or event should be investigated. Note, if you are using a threading toolkit, it may have alternate means to trigger (enqueue) a reader task.

Jim Dempsey

I understand that under the concept of "bunged up" you mean lost (no one read) messages, not the a-b-c-d data inconsistency. So:

a) It is correct by design.

b) It is correct by design.

c) It is correct by design.

Do you even read my last response to you?I am interested in aout-of-order processorexecution principle, not in particular data exchange model.

0 Kudos
jimdempseyatthecove
Honored Contributor III
917 Views
Bunged up means a reader reads something like:

a(at t=1), b(at t=1), c(at t=2), d(at t=2)

The above would occue if thet=2 write occured while reader waspart way through the read of t=1 packet.

The IA32 and Intel64 architecture, upto but not necessarily including the Transactional Memory Extensions (22mm Haswell feature) is designed to preserve write ordering (although adjacent writes to same cache line may be write combined). Should your a and b reside in one cache line and, c and d reside in a different (next) cache line then you should observe a&b updated first, c&d updated second.

With Haswell TSX/HLE, all the updates within the "protected" section will appear to the other threadsto have been updated simulteaneously (e.g. memory bus lock, multiple writes, memory bus unlock).
0 Kudos
jimdempseyatthecove
Honored Contributor III
917 Views
>> How such is made inside critical sections?

[cpp]class X { protected: volatile DWORD m_vers1; volatile DWORD m_a; volatile DWORD m_b; volatile DWORD m_c; volatile DWORD m_d; volatile DWORD m_vers2; Lock_t ReadLock; // Lock_t provided in toolkit X() : m_vers1( 0 ) , m_vers2( 0 ) { } void Write( DWORD a, DWORD b, DWORD c, DWORD d ) { // assure reader has finished reading prior write while( m_vers1 != m_vers2 ) continue; m_a = a; m_b = b; m_c = c; m_d = d; ++m_vers1; // now one ahead of m_verse2 } void Read( DWORD& ra, DWORD& rb, DWORD& rc, DWORD& rd ) { for ( ;; ) { { // Scope for LockLock LockLock lock(ReadLock); // returns from ctor when we have lock // i.e. entered critical section // test for writer next write if( m_vers1 != m_vers2 ) { // have new data ra = m_a; rb = m_b; rc = m_c; rd = m_d; ++m_verse2; // tell writer we have packet return; // dtor of LockLock lock(ReadLock) releases lock } // if( m_vers1 != m_vers2 ) } // End scope for LockLock, releases lock } // for(;;) } // Read };[/cpp]
Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
917 Views
The method using the Lock_t (generally supplied by threading toolkit, such as Boost, TBB, Cilk Plus, ...) requires you add the lock variable to the protecte object and may add additional computational overhead (dependent on implimentation). The earlier post with the CAS, using Windows InterlockedCompareExchange or GCC Linux __sync_bool_compare_and_swap functions, will produce inline code and require no space in your structure. The underlaying LockLock will use these functions plus have some anciliary code for the spin wait loop, which is redundant in your implimentation since your for(;;) is spinning without wait.

For additional information look at:

http://software.intel.com/en-us/articles/choosing-between-synchronization-primitives/

Jim Dempsey
0 Kudos
OTorg
New Contributor III
917 Views
I found rules of the game!
Intel Architectures Software Developer's Manual, Volume 3,Chapter multiple-processor management, memory ordering:
1.Neither Loads Nor Stores Are Reordered with Like Operations
2.Stores Are Not Reordered With Earlier Loads
3.Loads May Be Reordered with Earlier Stores to Different Locations
4. ... etc ...
Thanks to everyone who took part in the discussion!
0 Kudos
OTorg
New Contributor III
917 Views
Bunged up means a reader reads something like:

a(at t=1), b(at t=1), c(at t=2), d(at t=2)

The above would occue if thet=2 write occured while reader waspart way through the read of t=1 packet.

To protect from the situation you described,m_vers1/m_vers2 variables are used. Ifa(at t=1), b(at t=1), c(at t=2), d(at t=2), then vers2 will not be equal to vers1. Therefore, Reader will do one more (or several) iteration of the loop, till the data consistency: a(at t=N), b(at t=N), c(at t=N), d(at t=N)and, of course,vers1(at t=N), vers2(at t=N).


0 Kudos
OTorg
New Contributor III
917 Views
The method using the Lock_t (generally supplied by threading toolkit, such as Boost, TBB, Cilk Plus, ...) requires you add the lock variable to the protecte object and may add additional computational overhead (dependent on implimentation). The earlier post with the CAS, using Windows InterlockedCompareExchange or GCC Linux __sync_bool_compare_and_swap functions, will produce inline code and require no space in your structure. The underlaying LockLock will use these functions plus have some anciliary code for the spin wait loop, which is redundant in your implimentation since your for(;;) is spinning without wait.

Yes, I want to avoid the use of synchronization primitives. For this task, I think they are redundant.

By the way, I learned that the instructions prefixed with lock (which are used within a critical section), are also amemorybarrier (Intel Architectures Software Developer's Manual, Volume 3). And thus guarantee the correct order of execution.

0 Kudos
OTorg
New Contributor III
917 Views
Quoting dj_alek
The method using the Lock_t (generally supplied by threading toolkit, such as Boost, TBB, Cilk Plus, ...) requires you add the lock variable to the protecte object and may add additional computational overhead (dependent on implimentation). The earlier post with the CAS, using Windows InterlockedCompareExchange or GCC Linux __sync_bool_compare_and_swap functions, will produce inline code and require no space in your structure. The underlaying LockLock will use these functions plus have some anciliary code for the spin wait loop, which is redundant in your implimentation since your for(;;) is spinning without wait.

Yes, I want to avoid the use of synchronization primitives. For this task, I think they are redundant.

By the way, I learned that the instructions prefixed with lock (which are used within a critical section), are also amemorybarrier (Intel Architectures Software Developer's Manual, Volume 3). And thus guarantee the correct order of execution.

Thinking further, I came to conclusion: in order to insure against the misunderstanding of my intentions by the compiler, codecan be modified like this:

[cpp]class X { protected: volatile DWORD m_vers1; volatile DWORD m_a; volatile DWORD m_b; volatile DWORD m_c; volatile DWORD m_d; volatile DWORD m_vers2; public: X() : m_vers1( 0 ) , m_vers2( 0 ) {} void Write( DWORD a, DWORD b, DWORD c, DWORD d ) { _InterlockedIncrement( reinterpret_cast< volatile LONG* >( &m_vers2 ) ); m_a = a; m_b = b; m_c = c; m_d = d; _InterlockedIncrement( reinterpret_cast< volatile LONG* >( &m_vers1 ) ); } void Read( DWORD& ra, DWORD& rb, DWORD& rc, DWORD& rd ) { for ( ;; ) { DWORD vers1 = (DWORD)_InterlockedCompareExchange( reinterpret_cast< volatile LONG* >( &m_vers1 ), 0, 0 ); ra = m_a; rb = m_b; rc = m_c; rd = m_d; DWORD vers2 = (DWORD)_InterlockedCompareExchange( reinterpret_cast< volatile LONG* >( &m_vers2 ), 0, 0 ); if ( vers2 == vers1 ) break; } } };[/cpp] But it adds the processor bus locks and thus reduces productivity...
0 Kudos
jimdempseyatthecove
Honored Contributor III
917 Views
With Single Producer you do not need the interlocked in the Write function.
If you wish, you can insert a _mm_sfence() following the ++ in the Write function. The affects this may have on your operation are: Write will be a tad slower, Read will (may) observe the ++ earlier. In the scheme of things, I think you are looking for throughput not latency between write and read.

The CAS (InterlockedCompareExchange) need only be performed once in your Read. Capture the before value of m_verse1, read the data, CAS the m_vers2 with verse1-1 to verse1. if true then return, else loop.
[cpp]for ( ;; ){ DWORD vers1 = m_vers1; ra = m_a; rb = m_b; rc = m_c; rd = m_d; if(_InterlockedCompareExchange( reinterpret_cast< volatile LONG* >( &m_vers2 ), // loc verse1, // xchg verse1-1) // cmp == verse1-1) // == cmp val break; } [/cpp]Here you could optionally insert an _mm_rfence() in front of the capture of verse1.

Jim Dempsey
0 Kudos
OTorg
New Contributor III
917 Views
In the scheme of things, I think you are looking for throughput not latency between write and read.

In this particular task either bandwidth or latency are not the main criteria.I'm concerned about the correctness of the implementation (data consistency) and the absence of synchronization primitives.

I think the topic is closed.Thanks to all!

0 Kudos
Reply