I tried to post an update

Peter_B_9 · ‎09-27-2013

According to http://www.intel.com/content/dam/www/public/us/en/documents/performance-briefs/xeon-phi-product-family-performance-brief.pdf the Phi 5110P is capable of a theoretical double precision performance of 1011 GFlop/s, and can achieve a practical 833 GFlop/s with DGEMM. The slides indicate that this was measured with 7680x7680 matrices.

Using the methodology described at http://software.intel.com/en-us/articles/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function I've attempted to duplicate these results using MKL 2013_sp1.0.080. My standalone (i.e. not off-load) test program is unable to achieve more than 527 GFlop/s, approximately half of the theoretical maximum and only 63% of what Intel advertises.

I've tried using huge pages, but that reduced throughput by about 1%.

What can I do to achieve better throuput with DGEMM? Does Intel have a sample program which demonstrates the claimed 833 GFlop/s?

I've attached the program I used to measure the performance.

(I would have posted this on the Premier Support forum, but our support account is still not working!)

Evgueni_P_Intel · ‎09-27-2013

Dear Peter B.

Please post the thread affinity settings that you use.

This information will help us to answer your question.

Thank you.

Evgueni.

Roman_D_Intel1 · ‎09-27-2013

Hello Peter,

I second Evgueni. Could you please look at the KB article http://software.intel.com/en-us/articles/performance-tips-of-using-intel-mkl-on-intel-xeon-phi-coprocessor ? One more thing that may be important is padding leading dimension of matrix C so that LDC * sizeof(element of C) it is not a multiple of 4096. Using huge pages should not be important in the latest releases of MPSS since transparent huge pages are enabled in the kernel by default.

Bernard · ‎09-28-2013

Do you have at least two threads per one core and how many threads were created?

Peter_B_9 · ‎09-30-2013

Thanks for the feedback. As you can see from the source code, I use the default affinity settings and test at 15, 30, 60, 120 and 240 threads.

Changing KMP_AFFINITY to balanced significantly improved performance. At 120 threads, throughput increased from 523 GFlop/s to 645 GFlop/s. At 240 threads it increased from 314 GFlop/s to 825 GFlop/s.

Changing the leading dimension to avoid being a multiple of 4096 seems to have a small negative performance impact.

Sumedh_N_Intel · ‎09-30-2013

Hi Peter,

Could you tell me more about your issue with the Premier Support? What do you mean by the "support account is not working"? Is anyone helping you with this?

Peter_B_9 · ‎09-30-2013

Hi Sumedh,

I first tried to use our premier support account in August, but the website was down for maintenance for a week. The next time I tried, the website presented an invalid security certificate. Following that, it simply refused our credentials. We contacted Intel, but the problem was not resolved before our contract expired with us never having succesfully used premier support. A representative from our purchasing department is negotiating a new support agreement with Intel, but given the unreliability of the site, I'm questioning its value.

I have found these forums to be a useful source of information, and one of your engineers, Zhang Zhang, also contacted me directly by e-mail and was very helpful.

Roman_D_Intel1 · ‎09-30-2013

Peter B. wrote:
Changing the leading dimension to avoid being a multiple of 4096 seems to have a small negative performance impact.

That's unexpected. I could get +2-5% performance by padding the leading dimension in the original benchmark. I've attached the modified code.

Bernard · ‎09-30-2013

Can you enable large pages support(2mb) in order to minimize pressure on DTLB and rerun your test?

Sumedh_N_Intel · ‎10-01-2013

Hi Peter,

We are sorry you experienced issues accessing Intel Premier support. The problems stemmed from a timing issue, as this reporting system was being moved to a new technology right when you had applied for an evaluation copy of the products. I believe there are communications happening with your group, outside of this forum, to resolve this situation – please confirm if you are not seeing progress on that front.

We are happy to hear that you’ve found our forum helpful. We believe there is an appropriate place for both systems: for questions, usages, dialog with Intel on programming and optimization and setup, the forum is the right place. When actual bugs/issues or feature requests with the Intel® MPSS or products are found or wanted, use of Intel Premier support is best, as this system directly reaches the developers developing and supporting the individual products, and is the best process for notifying requestors of the status of a fix or enhancement.

Peter_B_9 · ‎10-01-2013

Roman: Measuring more carefully, I do see about a 1% improvement with the extra padding in the leading dimension.

Bernard: As noted, I tried using huge pages but saw a small degradation. Roman notes that transparent huge pages are enabled in the kernel by default.

(Reposting due to forum failure)

Bernard · ‎10-01-2013

Hi Peter

I do not know how much improvement in terms of raw processing speed you will get with large pages , but it is worth to try it.

Peter_B_9 · ‎10-02-2013

I tried to post an update yesterday but the forums were down again (some error about a spam filter).

Roman: I measured with the change to prevent 4096 alignment more carefully and I get about a 1% improvement. Combining this with KMP_AFFINITY=balanced I can achieve (and sometimes slightly exceed) the advertised throughput.

Bernard: As noted in the initial post, huge pages made no measurable improvement.

Sumedh: We have purchased a new support contract. I'll test the premier support forum soon.

Sumedh_N_Intel · ‎10-09-2013

Hi,

I just wanted to let you know that a small set of benchmarks comprising of GEMM, Linpack, Stream and SHOC, is provided along with the MPSS. You could possibly use these benchmarks to reproduce performance numbers on your systems. The benchmark are equipped with scripts that set up the environment, run the benchmark and report performance numbers.

Please note that installing these benchmarks along with the MPSS is optional and you may need to install these benchmark if they were not installed with the MPSS.

For the MPSS Gold relase Update 3, the benchmarks can be found in /opt/intel/mic/perf/. The sources can be found in /opt/intel/mic/perf/src where as the scripts can be found in /opt/intel/mic/perf/micp/micp/scripts.

How to demonstrate advertised DGEMM 833 Flop/s performance?