Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Lock-Free using cmpxchg8b...

Intel_C_Intel
Employee
6,531 Views
Your threading tips are sound. The thread aware memory pools are nice. =)


Here are some of my advanced threading tips: =)


You should spice up your threading site by adding a:

" Rock-Solid Lock-Free Algo's /w cmpxchg8b & cmp8xchg16b Tutorials " additon to your site.


You Intel guys really need to promote the cmpxchg8b and the cmp8xchg16b ( for Merced I think ) opcodes. When you code lock-free algo's using these, they make advanced threading a work of art. They are not just for creating locks you know. =)


I have implemented a 100% ABA proof lock-free IBM FreeList with your cmpxchg8b opcode:

FreeList.c


Joe Seigh and I have developed A 100% ABA proof " real " lock-free atomic pointer and I did a lock-free fifo queue with it:

AtomicPtrABATest.zip

The AtomicPtr combines 100% ABA proof lock-free algos with the ability for a thread to reference a shared pointer without having to already own a marshaled copy!


Also, look at my advanced fifo queue code:

Queue.c

This code is very " thread " friendly. Put this code up against a " normal " fifo shared queue, and it will blow it away.

You should add some similar code examples on the Intel threading sites, or just add a link to my page. ;)


What do you threading masters think about it?

;)

Thank you.
0 Kudos
30 Replies
ClayB
New Contributor I
2,492 Views
> cmp8xchg16 is documented in the Itanium architecture
> software developers manual. It's basically a compare
> 8 bytes, swap 16 bytes.

Thanks, Joe. This is what I thought it was from the name. I guess this corresponds to what the french paper defines as CAS2. I still can't see how this would work, though, as intended. The idea about avoiding the ABA problem is to compare (and exchange) both the pointer involved and the count of operations. If the pointer has ended up with the same value even after other threads have updated the structure, the count of operations will have changed to inform the original thread that it must recheck the pointer and counter values.

Maged Michael's paper from PODC 2002 points out that DCAS (his name for CAS2) needs to make the comparison of both values before instigating the exchange. This was as I thought, but did not seem to be laid out in the french paper. Unless, I thought, the memory layout was working with the CAS2 operation. That is, given a single memory address that was actually going to refer to a double word location. So, you would need to place the pointer and the reference count into adjoining memory locations, but could use separate (8-byte?) values for comparison and exchange. That made some sense, but I would think it would be better designed to have the same three arguments as the simple CAS, but double the size of the arguments which would require the user to place values in the hi and lo parts of the double area. Still, since this was only an idea in the french paper to solve an algorithmic problem, anything would be feasible.

Unfortunately, the cmp8xchg16b looks to only compare the first 8 bytes, but exchange 16 bytes based on that comparison. This seems to be the same as my first interpretation on the CAS2 primitive. If so, does this not suffer from the same drawbacks and not give the direct solution to the ABA problem that was given in the french paper? Since Michael states that the DCAS is not a primitive implemented on any architecture (and I don't think the cmp8xchg16b does a proper job of it), does he have the more realizable ABA avoiding algorithm?

I guess I too wonder why there was no cmpxchg16b implemented on Itanium. This would seem to have solved all sorts of extra problems that the 8 byte compare can't handle easily. I suppose we'll have to poll someone involved with Itanium development to find the answer.

-- clay


0 Kudos
jseigh2
Beginner
2,492 Views
DCAS is not the same as CAS2. DCAS works on 2 discontiguous words in memory. DCAS is the name this instruction has on the Motorola 68020 and 68030. CAS2 usually refers to a contiguous doubleword (not to be confused with DWORD) compare and swap. If the word size is 64 bits, then CAS would work on 64 bit operands and CAS2 would work on 128 bit operands. The mnemonics are not standard in this area, so it is easy to get confused when reading any papers on this topic.

The Itanium cmp8xchg16 will work for lock-free stacks if you increment the change count on both the push and pop operations, not on just one like you could do with CAS2. But it does break some algorithms that do require a compare on both words.

I just noticed Microsoft's latest documentation on their Interlocked Singly Lists. It says

Singly linked lists (SLists) are straightforward to implement and use in
32-bit code. However, it is challenging to implement them in 64-bit code
because the amount of data exchangeable by the native interlocked
exchange primitives is not double the address size, as it is in 32-bit code.


There's no problem with this on the Itanium, but I don't think Microsoft is too happy with a certain other 64 bit architecture that they've agreed to port to. :) It's not so much a case of lock-free being lock free, although that's nice. It's a question of async safety, meaning you can use it in interrupt handlers/callbacks. If you use locks to do an SLIST implementation, it's no longer async safe and code that was previously depending on this is now broken. If Microsoft had used non invasive list links (SLIST_ENTRY) they could work around it, but they didn't.

Good luck on getting answers from the architects. They usually don't like publishing the architecture rationales. All the ones I've participated in, they've always stripped that stuff out. I'd end up thinking that if I hadn't been at those meetings then I would have no idea how to use that particular feature.

Joe Seigh
0 Kudos
Intel_C_Intel
Employee
2,492 Views
> The Itanium cmp8xchg16 will work for lock-free stacks
> if you increment the change count on both the push
> and pop operations, not on just one like you could do
> with CAS2. But it does break some algorithms that do
> require a compare on both words.

My new lock free system will work with a normal cmpxchg or ll/sc on a 32 or 64 bit system. It also will work great with cmp8xchg16.

The system provides nodes that can be used in most lock-free algos. I have a lifo & fifo on my site, and I plan to add some more.

Also?

If you use my new algo with a QWORD CAS, you can compare and swap two nodes at once. That will make it easier to create lock-free double-linked lists!

;)

What do ya think of that?

> However, it is challenging to
> o implement them in 64-bit code
> because the amount of data exchangeable by the
> e native interlocked

We must disassemble the SList API on a 64-bit AMD, then on an Itanium II. That would be interesting indeed!


=)
0 Kudos
ravenous
Beginner
2,492 Views
Hello lockfree, i tried accessing your codes to test out your lockfree algorithms but apparently the links are down. Is it possible to update the links. Tks.
0 Kudos
Intel_C_Intel
Employee
2,492 Views
> Hello lockfree, i tried accessing your codes to test
> out your lockfree algorithms but apparently the links
> are down. Is it possible to update the links. Tks.

Here is the home page:

AppCore

All of it is in plain Win32 C.

The algo's currently work, but I have updated some of them. I will post the updates soon...


Also, I am going to start coding some lock-free pthread compatible objects...

Here is my lock-free semaphore, that uses pthreads:

lfsema.cpp

lfsema.h

atomic.h


Here is a pthread test program for the semaphore:

main.cpp

thread.cpp

thread.h

queue.h

mutex.h


Enjoy!

;)


Please comment on them if you want...
0 Kudos
ymchew
Beginner
2,492 Views
Dear Lockfree,

Hi, I am relatively new to this research area. As i am involving in a lock-free techniques related research project recently, I thus start to learn regarding this.

I would like to ask if anyone of the users here, especially Mr. Lockfree could share the idea on how to implement a lock-free linked-list using CAS. I am facing difficulty in finding related reference material as this area is still quite new. I would like to enquire the independency of CAS on different processor platform as well.

I would be appreciate if any one of you could provide some materials to enhance my understanding regarding this lock-free techniques.

Thanks in advance.

Best Regards,
0 Kudos
Intel_C_Intel
Employee
2,492 Views
> I would like to ask if anyone of the users here,
> , especially Mr. Lockfree could share the idea on how
> to implement a lock-free linked-list using CAS.

Welcome to state-of-the-art contention handling!

;)



Here are some links that I posted to c.p.t, where I am known as SenderX:

2 Lock-Free Linked Lists


> I am
> facing difficulty in finding related reference
> material as this area is still quite new. I would
> like to enquire the independency of CAS on different
> processor platform as well.

Any processor that has a CAS or LL/SC that operates on memory sizes >= sizeof( void* ) can work for lock-free linked-lists. The IBM SMR algo makes this statement true. I am currently working on a Win32/PThread SMR algo.

I will probably post it to c.p.t.



You should think about using a memory reclamation schema with your lock-free lists:

Here is a link to my personal lock-free proxy garbage collector and stack, that can be iterated in a 100% lock-free manor!:

Lock-Free Garbage Collector & Stack


Here is a link to another VERY useful lock-free garbage collector:

IBM SMR GC Algo

This algo. will elimate the ABA problem, and allow for normal CAS or LL/SC.

It also clearly explains why a lock-free garbage collector is needed for "Dyanamic/Non-Static" lock-free algos.



I am working an a very portable, 100% lock-free, FAST, linked-list that allows for a Push, Pop, and Iteration to execute in parallel!

I might post it here...

=P


P.S.

Lock-free algos, are being used more and more. Java and the linux kernel are a few. Win32's SList API is also lock-free.

Read here how a lock-free algo. vastly increased the Java Dictonary peformance and the linux kernel:

Java Lock-Free

Linux Lock-Free


Any questions on lock-free algos are on topic here, and are encouraged...

Enjoy!

Message Edited by intel.software.network.support on 12-09-2005 02:21 PM

0 Kudos
ymchew
Beginner
2,492 Views
Dear Lockfree,

I am appreciate your reply as it further enchance my understanding in regarding area. However, in my very first stage of my research project, I am asked to implement a lockfree linked list on a 6-CPU based server, powered at Pentium Pro 200MHz, in Linux OS environment, preferably.

In the CAS code that you posted, I realized that it might be suitable for 64-bit processor eg. Itanium or etc. I am wondering if the CAS32 code availale for reference.

The other question is once I have the CAS32 code, how could I link it in my C coding environment of the linked-list.

Please enlighten me. Thank you very much.

Regards,
0 Kudos
danmay
Beginner
2,492 Views

Your links below to source code are dead. Do you have updated links or

can you e-mail me the code ?

0 Kudos
Chris_M__Thomasson
New Contributor I
2,492 Views
0 Kudos
Reply