First, a philosophical

Florian_R_ · ‎05-15-2013

For the last weeks I tried around with barriers on the Xeon Phi. Right now it seems to me that the already included barrier in OpenMP yields the best performance. I can beat the cycles / barrier by implementing a tournament barrier (I guess the OMP barrier in the icc implementation uses a tournament barrier; by just looking at the scaling), however, I cannot beat it by far.

My issue is that the performane of barriers on the MIC seems unsatisfactory. The barrier scales with the # of threads (logically), however, giving a rough number for the thread count that we usually have is 10k cycles. This is too much.

I also implemented a centralized barrier, which naturally performs better with a low number of threads (say smaller 16). The big problem with the centralized barrier, however, is that the scaling behavior is a lot worse. This problem arises in the last stage of the barrier, the spin-lock where all, but the last thread to arrive at the barrier, are waiting is taking to long to propagate to all the threads.

The boil the whole post down to two questions:

1. Do any of you guys know a better (faster, more scalable) way to construct barriers on the Xeon Phi (for my centralized barrier or synchronization in general I use the lock cmpxchg instruction)?

2. Are there any (probably not documentated, since I did not find any) other assembly instructions or (global accessible) registers that might help with the barriers topic?

Thanks a lot!

Florian

James_C_Intel2 · ‎05-15-2013

First, a philosophical question: How are you measuring barrier performance?

If you’re measuring the rate at which you can issue barriers, something like this

[cpp]

tnow = time();

for (int i=0; i<10000; i++)

barrier();

tend = time();

[/cpp]

I don’ t think that is the best way, since

It’s measuring the case when all the threads are already temporally aligned
It doesn’t measure the real barrier overhead

My preference is strongly for measuring the time from the last arrival to the last departure. (“Last In – Last Out” or LILO). That does measure the real barrier overhead, and should be done with some jitter introduced into the thread arrival times, to reflect real barrier usage.

To implement a fast barrier on KNC you need to exploit the memory hierarchy. So use a simple barrier for the (up to) four threads on each core that are sharing the L1 cache. Then use a different barrier between the cores. If you are really only interested in a barrier (not a barrier with reduction, or a fork/join), then a dissemination barrier may be a good choice. However if you need reduction, or are really interested in fork/join, then a tree barrier is likely to be better.

Are there any (probably not documented, since I did not find any) other assembly instructions or (global accessible) registers that might help with the barriers topic?

There are no secret instructions or barrier hardware (or, if there are they’re so secret no-one told me about them either :-)). However, the NGO stores allow you to achieve many more outstanding stores than you can otherwise, so are useful in broadcasting, so can be used effectively in the release phase of a (fairly flat) tree barrier.

Florian_R_ · ‎05-15-2013

I am already measuring it as you suggest, so the measurement is not my problem. The problem is the performance of the barrier (as stated). There is some rumor about a "lightweight" barrier that is not documented and gives a great performance boost that is significantly faster than the barrier used in OpenMP.

Speaking of using 4 threads per core - we want to use at max. 3 threads per core since this gives us best performance. Since execution is only done every other cycle we need to have at least 2, however 4 threads are already too much for our needs. An important remark here is that I tried around with various affinity models (either doing it by hand or using scatter, balanced or compact) and from my tests it did not seem to matter much. This also matches tests from other people in our group.

Thanks for your answer I will try around with NGO stores then!

James_C_Intel2 · ‎05-15-2013

Even with three threads/core you will still gain by using a simple, on-core barrier, then a different barrier between the cores. (Of course, this assumes that your threads are affinitized to at least the core level)

If you do use NGO stores, don't forget that you need a logical mfence ("lock; addl $0,(%rsp)") after them to ensure that no othermemory accesses in this thread climb above them...

Evgueni_P_Intel · ‎05-15-2013

Florian,

As to your question 1, synchronization using lock cmpxchg is expected to be slow -- you may want to search the Web for so called lock-free synchronization algorithms. This way fastest tree barriers (mentioned by James) spend ~2000 cycles LILO on 61 cores.

As to your question 2, the answer is "no, there are none."

Thanks,
Evgueni.

Florian_R_ · ‎05-16-2013

Hi Evgueni,

thanks for your reply. I agree with you that sync. is always slow, however, centralized barriers require synchronization. I also (as mentioned) implemented a tournament barrier - such a barrier does not require any sync, however, a tree barriers requires a sync. in form of a locked increment. Needless to say I think that the Phi is missing a lock addl instruction so how to archieve this? Atomic variables are also not available at hardware level, so there will always be sync. at least for reading them out (which has to happen at some point).

Can you post some code for the implementation of such a barrier with 2000 cycles on 61 cores? Right now I can not believe this until I see such an implementation.

Thanks for your help!

Florian

James_C_Intel2 · ‎05-16-2013

"a tree barrier requires a sync. in form of a locked increment." Sorry, but no, it doesn't.

In the check-in phase give each "slave" a separate "I'm here" flag (ideally each in a different cache-line), and have the master loop over them until they're all set. There is no need for any atomic operation. Then in the check-out phase have the slaves spin waiting for their unique flag to be set (this is where using NGO stores in the master helps).

"I think that the Phi is missing a lock addl instruction so how to archieve this?" Why do you think it is missing that instruction? It is not. MIC has all the normal X86 locked operations, which certainly include lock; addl.

Evgueni_P_Intel · ‎05-16-2013

Please try something like the code below -- it achieves ~2000 LILO on 61 cores after some tuning. It assumes that mem = malloc(team*64). [cpp] static void barrier(void *mem, long long int me, long long int team) { if (team == 1) return; volatile long long int *counter = mem; long long int my_count = counter[me*8], next = 1; while (next < team-me) { if ((me & next) != 0) break; while (counter[(me+next)*8] <= my_count) _mm_pause(); next *= 2; } counter[me*8] += 1; // I am done while (counter[0] <= my_count) _mm_pause(); } [/cpp]

Florian_R_ · ‎05-16-2013

Hi,

first of all thanks for the code. Unfortunately it does not work. First of all _mm_pause() seems to be not supported on the MIC: "error #13393: Opcode unsupported on target architecture: pause". Compiling for the Xeon works fine (I compile for the MIC with -mmic; I am using the latest icc version). If you just ignore the pause and go for a busy waiting model I get a deadlock. Turning off optimizations does not change anything. So are you sure this worked on the MIC?

Regarding the sync. look: If you set the flag "I'm here" or whatever you want to call it then this is (according to my knowledge) a tournament barrier. A tree barrier does not define a winner or loser and hence has to be synced in those areas. Of course this is all only talking about words, but as I already stated: I've already implemented a tournament barrier - this barrier does work and does not require any sync. operations, however, performance is not much better than the OpenMP barrier.

Coming back to the lock prefixed instructions. I went over the instruction handbook of the Xeon Phi and I did not see such an instruction. Also trying it out yields an "Exit reason 4 - Illegal instruction" error. On the Xeon it works as expected.

Evgueni_P_Intel · ‎05-16-2013

Yes, it is safe to remove _mm_pause().
It is needed to zero mem before the first use.

James_C_Intel2 · ‎05-16-2013

I understand the difference between a tournament and a tree barrier. However, you are confusing an implementation detail of your preferred tree barrier (how the master determines that all its children have arrived, which you are assuming has to be done with an atomic increment of a single location), with the higher level properties that make it a tree barrier (which threads communicate with which other threads in each phase). My assertion is that I can implement a tree barrier without atomics, which is patently true (since I have an existence proof :-)). The trick is to ensure that there is only a single writer for any memory location at any point in time, then no atomics are required. So, instead of the master (at each level) running code like this

[cpp]

void waitForChildren(int nChildren, int volatile * syncCounter) { while (nChildren != *syncCounter); }

[/cpp]

you give each child its own location, and have the master wait for them like this

[cpp]

void waitForChildren(int nChildren, int volatile * syncCounter) {
for (int present = 0; present != nChildren;) {
for (int i=0; i<nChildren; i++)
present += (syncCounter != 0);
}
}
[/cpp]

now there isno need for any atomic updates, but this is still definitively a component of a tree barrier.

I have no idea what you are doing wrong to fail to execute a lock; addl. If you look at the manual in the tables on pp657,658 you will see that both lock and add are suppported. (And I have used this instruction!)

Florian_R_ · ‎05-24-2013

Alright. So I tried around with different implementations this week.

First of all: You are right, I do not know why the lock addl (or lock incl etc.) did not work before on the Phi, but I re-checked and it works now. So thanks for pointing that out.

Second: I do want to discuss about words now - so that sync or not topic is not interesting for me. What is interesting is "how to archieve 2000 LILO on 61 cores" (1 thread per core - as various threads on one core could be handled much more efficiently by a centralized barrier).

I tried the code above and managed to get it to work, however, it is (at best) at OpenMP level. Hence my own (tournament) barrier is already better. So the key question is: How to get it that fast? Obviously (just by looking that the code contains _mm_pause instructions) the code has not been made for the phi - has it?

I would really appreciate a little bit more information since barriers are quite important for us.

Thanks!

Fastest possible barrier