The CHANGES file mentions fixes for ARM in TBB 4.2 Update 2, but has this been thoroughly tested?
Methinks there should be an ISB instruction in __TBB_control_consistency_helper(), which now has only a compiler fence. Without this instruction, a control dependency is probably not sufficient to order a load after the load on which the conditional branch depends.
I have not tested this myself, but I have come across a similar situation with Power Architecture, and the helper was introduced there to issue an ISYNC instruction that made a decided difference.
Could somebody verify or explain?
According to changes Initial arm port was released in Intel TBB 4.1 Update 3. And in Intel TBB 4.2 Update 2 there are fixes for that port. So I assume that guys checked it on their side. I'll send e-mail to the contributor with the link to this thread.
OpenCV library has used tbb on arm for a few years on android (gcc generic) and the latest news is Windows RT in July.
It's easy enough to find multiple sources stating that a control dependency does not order two loads, so even if it happens to work on the devices that were used for testing, perhaps it could still fail on others with more adventurous reordering.
Thanks for taking a look at the ARM code and for bringing this point up.
Having had a think, yes I agree with you that the current logic for __TBB_control_consistency_helper() is insufficient for ARM (as memory ordering is not necessarily dictated by control dependencies). We will require a dmb ish (an isb would typically be used when we affect the actual instruction stream). So the following code change should take place:
The patches for ARM support for TBB had been tested, but as you alluded to before, errant behaviour just didn't turn up during our testing.
I have submitted a patch to add a barrier to __TBB_control_consistency_helper.
Why use a cannon to shoot a mosquito? Of course ISB is required for self-modifying code, but that's not its only use. Just have a look here, at "4.3 Control-isb/isync Dependencies" on page 15.
Also see mac_ppc.h for load with acquire: it looks weird, but in heavily reused code weird and cheap is better than simple and costly, and a quick search shows that this is possible and considered appropriate for ARM as well (I would think all the more so since ARM lacks the equivalent of LWSYNC). This might not matter much on a simple device (I don't really know), but let's not make that an assumption.
I'm also not a big fan of relying on the compiler to issue loads and stores. Maybe I'm just paranoid, but it's not that much trouble to issue the proper assembler instructions for all operations, even if for nothing more than peace of mind. The paramount example is 64-bit loads and stores in a 32-bit environment: if I remember correctly, by default TBB delegates to CAS there, not the most economical of implementations...
That brings me to another problem: I don't see the use for these 64-bit operations without also having __TBB_64BIT_ATOMICS defined?
I may have access to this kind of hardware soon, and I'll certainly come back to this if and when I do.