topic Can you attach your test in Software Archive

Achieving peak on Xeon Phi

Rakib — Wed, 26 Mar 2014 22:15:55 GMT

Hi,

I am on a corei7 quad core machine with ASUS P9X79WS motherboard and Xeon Phi 3120A card installed.

Operating system is RHEL 6.4 with mpss 3.1 for phi and parallel_sutdio_2013 SP1 installed.

Just for detail, the phi card has 57 cores, with capability of about 1003 GFlops for double precision.

I am seeing some performance issues that I don't understand.

When I time MKL's parallel DGEMM on phi card, it is getting 300GFlops, which is about 30% of peak.

Note that I am doing native execution.

Now this performance is not matching what is posted here http://software.intel.com/en-us/intel-mkl/ (achieving about 80% of peak).

So, my first question is, is this difference solely because I am using low-end phi card so there are limitations?

After seeing this, I wrote a test program that tries to achieve peak with assembly language.

The function is simple. It runs a loop that iterates for 25,000,000 times and in each iteration, I am doing 30 independent FMA instructions and unrolled 8 times. So the total flop in each iteration is 30x8x2x8 = 3840 . Note that, this means I am doing 25000000x3840 floating point operations without accessing any memory.

Now if I run this code serially, I get 8.74 Gflops which is basically the serial peak (8.8 Gflops).

If I run this code in parallel with 2 threads on 1 core, I get 17.4 Gflops which is basically the peak for 1 core (17.6 Gflops).

Now the problem is, if I run the same code in parallel with 2 threads per core and using 56 cores (112 threads), I only get 89% of peak.

But if I run it with 4 threads per core i.e. a total of 224 threads, I get 99% of peak, which is what I expect.

So, my second question is, even when I have no memory access at all, why did I need 4 threads to achieve peak?

Is there any other latency that we don't know about that gets hidden by 4 threads per core?

Can someone please clarify?

Sorry for the long post and Thank you for reading.

What are the matrix sizes you

JS1 — Thu, 27 Mar 2014 01:22:00 GMT

What are the matrix sizes you used in DGEMM? MKL in PHi gets modest performance for small and skinny matrices.

Regarding your second question, I think Phi needs to run 4 threads per core to compensate the in-order instruction execution fashion.

Thank you for replying.

Rakib — Thu, 27 Mar 2014 01:38:45 GMT

Thank you for replying.

Sorry, I forgot to mention that. I ran square DGEMM with N=10,000 .

For the second question, I think they mention that you need at least 2 threads per core because 1 thread can issue an instruction every other cycle.

Since I have 30 consecutive FMA instructions that are independent, shouldn't it be enough to saturate the FPU pipe?

And since we have enough independent instructions, in-order execution shouldn't be an issue. Should it?

>>Since I have 30 consecutive

jimdempseyatthecove — Thu, 27 Mar 2014 12:39:29 GMT

>>Since I have 30 consecutive FMA instructions that are independent, shouldn't it be enough to saturate the FPU pipe?

Yes, but your also have the store (to register) portion of time for the intrinsic function. With two threads the FPU might not be fully utilized. What does 3 threads per core yield?

Jim Dempsey

Hi,

Rakib — Thu, 27 Mar 2014 21:35:00 GMT

Hi,

Please note that I am not using intrinsics. I wrote a function purely in assembly. I am calling this function from a C code where I am timing it.

As for the pipeline, I verified that FMA instructions have a latency of 4.

So you only need 2 independent FMA instructions, coming from 2 different threads making it 4 independent instructions to fully utilize the FPU.

Also note that, if I weren't fully utilizing the FPU pipe, I wouldn't get the peak for 2 threads with just 1 core.

My original question is, why we can get peak with 2 threads when using 1 core but not when using all 56 cores?

As for your question, right now with the same code, I am getting 65% of peak with 2th/core, 84% with 3th/core and 99% with 4th/core .

This is my other concern. Sometimes I am seeing inconsistencies of performance on phi.

Hi Rakib,

McCalpinJohn — Thu, 27 Mar 2014 22:12:00 GMT

Hi Rakib,

You said that you were getting 100% of peak on 1 core using 2 threads, but that you drop to 89% when running 2 threads on each of 56 cores -- can you run a scaling test to see how many cores you have to use before you start seeing this problem?

In the 112 thread case, you might be seeing some interference from the operating system. I have seen the OS run stuff on cores where I am trying to do compute work, even when I leave other cores open. It is possible that using all four threads on a core makes it less likely that the OS will schedule on that core -- all the logical processors are busy. But if you only run two threads on a core the OS might think "hey, there are two logical processors here that I can use to run this service process!".

To track this down, I would time each thread independently and see if the slow-down comes from a small number of cores. VTune might be able to identify instances when cores run something that is not your executable?

Hi John,

Rakib — Fri, 28 Mar 2014 21:41:15 GMT

Hi John,

Thank you for your reply. That's exactly what I tried next: scaling test. I didn't put the result here because it's not consistent. But here it is:

With 2 threads per core, following is the performance as I keep increasing the number of cores. Ellipsis are used where values are not changing much.

Number of Cores   Peak Performance
1   99.38%
2   99.37%
3   99.35%
4   99.34%
5   99.31%
6   66.23%
...
14   66.17%
15   82.41%
16   81.92%
17   81.49%
18   99.13%
19   81.08%
20   80.67%
21   91.08%
22   80.51%
...
26   78.45%
27   98.98%
28   98.96%
29   98.92%
30   81.10%
...
55   82.54%
56   65.86%

This is not repeatable. It varies randomly. But using 56 threads always gets poor performance.

I will also put the per-thread timing in the next post.

Can you attach your test

jimdempseyatthecove — Sat, 29 Mar 2014 13:13:07 GMT

Can you attach your test program? (include any environment variables relating to KMP_... and OMP_.)

Jim Dempsey

>> It is possible that using

jimdempseyatthecove — Sat, 29 Mar 2014 16:30:01 GMT

>> It is possible that using all four threads on a core makes it less likely that the OS will schedule on that core -- all the logical processors are busy.

A test of this hypothesis would be to schedule all four threads per core, but then four tests

1 thread per core FMA, 3 threads per core loop of _mm_delay32(1000000)
2 threads per core FMA, 2 threads per core loop of _mm_delay32(1000000)
3 threads per core FMA, 1 thread per core loop of _mm_delay32(1000000)
4 threads per core FMA, 0 threads per core loop of _mm_delay32(1000000)

The O/S will view the delaying threads as compute bound. The delay loop will terminate upon seeing any of the compute threads complete (e.g. setting volatile flag).

The permutations could be made in one run the application whereby an outer control loop permutes flags, one per logical processor and flag indicates if thread to run FMA or delay loop.

Note, the _mm_delay32 will reduce power consumption for core, and more importantly nearly eliminate the thread scheduling of time slot within its core.

Jim Dempsey

Hi Jim,

Rakib — Mon, 31 Mar 2014 19:07:44 GMT

Hi Jim,

Please note that, I am not using OpenMP. I am using pthreads, setting the affinity myself.

Nice suggestion for the tests. I was trying to spawn the dummy threads and then just wait on a barrier. It didn't help.

But with your suggestion of busy-waiting, it actually helps. I was able to achieve 98% of peak with 2 active threads and 2 dummy threads per core.

So, to summarize the answer for my second question, even if it is best, in terms of cache, for an application to use 2 threads/core, you still need to spawn 4 threads/core with 2 on each core busy-waiting.

For the first question, are you seeing similar performance (30% of peak) from MKL DGEMM on phi?

I do not typically use MKL in

jimdempseyatthecove — Tue, 01 Apr 2014 12:20:17 GMT

I do not typically use MKL in my applications. What I do know is:

When your program is single-threaded, use (link) the multi-threaded MKL
When your program is multi-threaded, use (link) the single-threaded MKL

Running multi-threaded program with multi-threaded MKL oversubscribes threads (each having thread pool).
If you do need this combination, e.g. MKL only called from outside parallel region of multi-threaded application, then you may find you need to set KMP_BLOCKTIME=0.

Jim Dempsey

Hi Jim,

Rakib — Tue, 01 Apr 2014 14:08:17 GMT

Hi Jim,

Yes. I am using single threaded program when I am trying to time MKL's parallel DGEMM performance.

I am just curious about the mismatch in performance with Intel's reported performance.