Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Porting to a new platform

pvonkaenel
New Contributor III
550 Views
Hi,

I thought I saw documention outline the steps to follow to port TBB to a new architecture, but I cannot find it now. Is there such a beast, and if not do you have any recomendation on how to start a port? The architecture runs Linux, so I hope it is not to difficult.

Thanks,
Peter
0 Kudos
14 Replies
RafSchietekat
Valued Contributor III
550 Views
build/index.html#port

Any details?
0 Kudos
pvonkaenel
New Contributor III
550 Views
Quoting - Raf Schietekat
build/index.html#port

Any details?

Thanks for the pointer. I would like to try to get TBB working on a 64-core Tilera processor which runs a modified version of Linux.

Peter

0 Kudos
RafSchietekat
Valued Contributor III
550 Views
I've always felt that number of cores should be some even power of 2. :-) Would registering provide me access to instruction set documentation for atomic operations?
0 Kudos
pvonkaenel
New Contributor III
551 Views
Quoting - Raf Schietekat
I've always felt that number of cores should be some even power of 2. :-) Would registering provide me access to instruction set documentation for atomic operations?

I'm not sure - I don't know the details of the licensing arrangement ... which is unfortunate since I have questions on how to implement the 4 and 8 bytes compare-and-exchange operations required for TBB. By the way, do you know whythe 8 byte version required? That's my current sticking point.

Peter
0 Kudos
RafSchietekat
Valued Contributor III
551 Views
Strange to hide the documentation behind a captcha and a registration form... that doesn't work.

You can always start out with locked implementations, and for 8 bytes you may never need anything better, but if you don't find direct support for 4 bytes I wouldn't expect really dazzling performance.
0 Kudos
Alexey-Kukanov
Employee
551 Views
Quoting - pvonkaenel
I'm not sure - I don't know the details of the licensing arrangement ... which is unfortunate since I have questions on how to implement the 4 and 8 bytes compare-and-exchange operations required for TBB. By the way, do you know whythe 8 byte version required? That's my current sticking point.

Peter
On 32 bit platforms, 8-byte CAS is necessary for e.g. atomic. I doubt though it is used inside TBB. On 64 bit platforms, 8-byte CAS is a must-have primitive required for atomic operations with pointers and pointer-size integers.
0 Kudos
RafSchietekat
Valued Contributor III
551 Views
"On 64 bit platforms, 8-byte CAS is a must-have primitive required for atomic operations with pointers and pointer-size integers."
If there's nothing better than locks, but they work, perhaps have a look at what I did for the port to PA-RISC, with a bank of central mutexes from which one is picked based on the location of the atomic. That should greatly relieve the level of contention compared to having a single central mutex. It's a workable solution (a CAS hardware primitive isn't strictly required even if highly advisable), but to make it run more smoothly you should probably also provide direct support for single and double bytes, which now ride piggy-back on top of 4-byte CAS, at some cost. Later you can also think of directly implementing other instructions than just CAS. But start small, with 1 mutex.
0 Kudos
pvonkaenel
New Contributor III
551 Views
First of all, thanks for all the useful feedback.
This is a 32-bit plaform, so maybe in my first try I can just avoid atomic - it's good to hear it's not used internally. They do have several atomic functions in a file called atomic.h (is this a standard linux thing?) but they all come down to the single atomic instructionavailable on the processor. Does this imply that any atomic operation I want to perform is going to be expensive? I vaguely remember how "fancy" atomic operations can all be built from tas from a class in school, so I guess I'll have to implement it that way.

Where can I find the work you did on the port to PA-RISC? I found your reworking of the atomic code in the contributor section of the tbb site, but nothing about PA-RISC. I like the idea of the bank of central mutexes - I'd like to give that a try.
0 Kudos
RafSchietekat
Valued Contributor III
551 Views
"They do have several atomic functions in a file called atomic.h (is this a standard linux thing?) but they all come down to the single atomic instruction available on the processor."
Details?

"Does this imply that any atomic operation I want to perform is going to be expensive? I vaguely remember how "fancy" atomic operations can all be built from tas from a class in school, so I guess I'll have to implement it that way."
Without details...

"Where can I find the work you did on the port to PA-RISC? I found your reworking of the atomic code in the contributor section of the tbb site, but nothing about PA-RISC."
Look here instead (include/tbb/machine/gcc_hppa.h).
0 Kudos
pvonkaenel
New Contributor III
551 Views
Quoting - Raf Schietekat
"They do have several atomic functions in a file called atomic.h (is this a standard linux thing?) but they all come down to the single atomic instruction available on the processor."
Details?

"Does this imply that any atomic operation I want to perform is going to be expensive? I vaguely remember how "fancy" atomic operations can all be built from tas from a class in school, so I guess I'll have to implement it that way."
Without details...

"Where can I find the work you did on the port to PA-RISC? I found your reworking of the atomic code in the contributor section of the tbb site, but nothing about PA-RISC."
Look here instead (include/tbb/machine/gcc_hppa.h).

OK, here goes. The only in-silicon atomic instruction available is tas, but there are several atomic functions available in a header called atomic.h (part of libc). The documentation says that these routines are implemented as fast-path calls to linux emulation routines. I'm not sure what that means. There is a 32-bit compare and exchange which I'm using for the TBB compare and exchange, but I will also need to implement a 64-bit version eventually.

Is atomic.h a standard linux header or is this provided asa special service since only 1 atomic instruction exists?
0 Kudos
pvonkaenel
New Contributor III
551 Views
Quoting - Raf Schietekat
"Where can I find the work you did on the port to PA-RISC? I found your reworking of the atomic code in the contributor section of the tbb site, but nothing about PA-RISC."
Look here instead (include/tbb/machine/gcc_hppa.h).

Wow, that's quite a thread! I'm still trying to figure it out, and am also going over your centrallocker code. I hope to have more information on this topic shortly.

0 Kudos
RafSchietekat
Valued Contributor III
550 Views
"The only in-silicon atomic instruction available is tas"
But what does it do? Fetch-and-store? What sizes? Fetch-boolean-and-store-true? Something in-between?

"There is a 32-bit compare and exchange which I'm using for the TBB compare and exchange, but I will also need to implement a 64-bit version eventually."
The better performance you want, the more work you may have to do yourself. Maybe even the provided emulation routines are suboptimal, and TBB's emulation certainly is if you don't have real CAS. You'd actually be meta-emulating all those other instructions, which can get quite embarrassing with regard to, e.g., fetch-and-store, or locking. :-)

"Is atomic.h a standard linux header or is this provided as a special service since only 1 atomic instruction exists?"
I would guess that it's not standard. g++ documents a number of built-in atomic instructions, maybe you could try those (if they've been implemented here).

"Wow, that's quite a thread!"
I hope you were scanning from the end. :-)You'll find the latest version on 2009-04-02, but I've summarisedthe essential points below.

"I'm still trying to figure it out, and am also going over your centrallocker code. I hope to have more information on this topic shortly."
Basically it's just an array of locks, preferably one per cache line, that should get initialised properly, and then you take the address of the atomic, divide by number of bytes per cache line, take the result modulo the number of locks, and that's the one you use. In that file I used 16 bytes each time (the size of a PA-RISC semaphore), but maybe that should be a bigger value (unless a cache line is also 16 bytes, but I don't remember now).
0 Kudos
pvonkaenel
New Contributor III
550 Views
Quoting - Raf Schietekat
"The only in-silicon atomic instruction available is tas"
But what does it do? Fetch-and-store? What sizes? Fetch-boolean-and-store-true? Something in-between?

It loads a 32-bit word from memory into the destination register and atomically writes the value 1 into the memory address.
0 Kudos
RafSchietekat
Valued Contributor III
550 Views
Quoting - pvonkaenel
It loads a 32-bit word from memory into the destination register and atomically writes the value 1 into the memory address.
That's weird... either 0 (PA-RISC) or -1 would seem easier to understand. Maybe it trades ease of use for ease of initialisation?

Anyway, this means you might want to try the idea with the scattered locks, using either inline assembler or just the provided lock function(s?), and compare it with the provided atomic emulation.

In this situation, it also makes sense to look into changing a small TBB lock from 1 byte to 4 bytes (aligned), and monitor what impact that may have on memory usage, although I'd be willing to pay quite a lot to avoid having to jump through all those hoops for using single-byte locks on this platform (on PA-RISC, with 16-byte semaphore alignment, the cost would probably be prohibitive, so I didn't try there).
0 Kudos
Reply