Profermace Problem of Phi

Chen__Xin · ‎10-15-2014

Hi Charles Congdon,

I am using Phi Coprocessor to doing a project. When I ran this function in CPU, 24 cores, the time is about 8.5 msec. But It ran about 700 msec pn Phi. In order to figure it our, I print the time at a begin of openmp part of offload code:

#pragma omp parallel for
for( size_t y = 0; y < 192; y++)
{

tmp[omp_get_thread_num()] = When();

__attribute__((target(mic))) double When()
{
#ifndef _WIN32
static struct timeval tp;
gettimeofday(&tp, NULL);
double t = (double)tp.tv_sec;
double t1 = (double) tp.tv_usec;
return (t + t1 * 1e-6);
#else
clock_t start = clock( );
double duration = (double)start / CLOCKS_PER_SEC;
return duration;
#endif
}

The results of time are:

id = 1, time = 1408126111.008337

id = 2, time = 1408126111.008337

id = 3, time = 1408126111.000000

id = 4, time = 1408126111.007646

id = 5, time = 1408126111.000000

id = 6, time = 1408126111.000000

id = 7, time = 1408126111.008337

id = 8, time = 1408126111.007646

id = 9, time = 1408126119.337340

id = 10, time = 1408126111.000007

id = 11, time = 1408126111.000000

id = 12, time = 1408126119.337340

id = 13, time = 1408126111.000007

id = 14, time = 1408126111.000007

id = 15, time = 1408126111.000000

id = 16, time = 1408126119.337340

id = 17, time = 1408126111.008337

id = 18, time = 1408126111.000007

id = 19, time = 1408126111.000007

id = 20, time = 1408126111.008337

id = 21, time = 1408126111.008337

id = 22, time = 1408126111.008337

id = 23, time = 1408126111.015979

id = 24, time = 1408126111.000015

id = 25, time = 1408126111.007645

id = 26, time = 1408126111.015983

id = 27, time = 1408126111.000000

id = 28, time = 1408126119.337340

id = 29, time = 1408126111.000000

id = 30, time = 1408126111.007644

id = 31, time = 1408126111.008337

id = 32, time = 1408126111.015964

id = 33, time = 1408126111.000007

id = 34, time = 1408126111.000007

id = 35, time = 1408126111.000000

id = 36, time = 1408126111.000007

id = 37, time = 1408126111.000000

id = 38, time = 1408126111.000000

id = 39, time = 1408126111.000000

id = 40, time = 1408126111.007639

id = 41, time = 1408126111.000000

id = 42, time = 1408126111.000000

id = 43, time = 1408126111.008337

id = 44, time = 1408126111.007643

id = 45, time = 1408126111.007644

id = 46, time = 1408126111.007644

id = 47, time = 1408126111.015980

id = 48, time = 1408126111.008337

id = 49, time = 1408126111.000000

id = 50, time = 1408126111.015983

id = 51, time = 1408126111.000007

id = 52, time = 1408126119.337340

id = 53, time = 1408126111.007640

id = 54, time = 1408126111.000000

id = 55, time = 1408126111.000000

id = 56, time = 1408126111.000000

id = 57, time = 1408126111.008337

id = 58, time = 1408126111.000007

id = 59, time = 1408126111.008337

id = 60, time = 1408126111.015960

id = 61, time = 1408126119.337340

id = 62, time = 1408126111.008337

id = 63, time = 1408126111.000000

id = 64, time = 1408126111.007646

id = 65, time = 1408126111.000000

id = 66, time = 1408126111.000000

id = 67, time = 1408126111.008337

id = 68, time = 1408126111.007646

id = 69, time = 1408126119.337340

id = 70, time = 1408126111.000007

id = 71, time = 1408126111.000000

id = 72, time = 1408126111.008337

id = 73, time = 1408126111.000007

id = 74, time = 1408126111.000000

id = 75, time = 1408126111.008337

id = 76, time = 1408126111.008337

id = 77, time = 1408126111.008337

id = 78, time = 1408126111.000007

id = 79, time = 1408126111.000007

id = 80, time = 1408126119.337340

id = 81, time = 1408126119.337340

id = 82, time = 1408126111.008337

id = 83, time = 1408126111.015979

id = 84, time = 1408126111.015964

id = 85, time = 1408126111.007645

id = 86, time = 1408126111.015983

id = 87, time = 1408126111.000000

id = 88, time = 1408126111.008337

id = 89, time = 1408126111.000007

id = 90, time = 1408126111.000000

id = 91, time = 1408126111.008337

id = 92, time = 1408126111.000015

id = 93, time = 1408126111.007646

id = 94, time = 1408126111.000007

id = 95, time = 1408126111.000000

id = 96, time = 1408126111.007646

id = 97, time = 1408126111.000000

id = 98, time = 1408126111.000000

id = 99, time = 1408126111.000000

id = 100, time = 1408126111.007639

id = 101, time = 1408126111.000000

id = 102, time = 1408126111.000000

id = 103, time = 1408126111.008337

id = 104, time = 1408126111.007643

id = 105, time = 1408126111.007644

id = 106, time = 1408126111.007644

id = 107, time = 1408126111.015980

id = 108, time = 1408126119.337340

id = 109, time = 1408126111.000000

id = 110, time = 1408126111.015983

id = 111, time = 1408126111.000007

id = 112, time = 1408126111.008337

id = 113, time = 1408126111.007640

id = 114, time = 1408126111.000000

id = 115, time = 1408126111.000000

id = 116, time = 1408126111.000000

id = 117, time = 1408126111.008337

id = 118, time = 1408126111.000007

id = 119, time = 1408126111.008337

id = 120, time = 1408126119.337340

id = 121, time = 1408126119.337340

id = 122, time = 1408126111.008337

id = 123, time = 1408126111.000000

id = 124, time = 1408126111.007646

id = 125, time = 1408126111.000000

id = 126, time = 1408126111.000000

id = 127, time = 1408126111.008337

id = 128, time = 1408126111.007646

id = 129, time = 1408126119.337340

id = 130, time = 1408126111.000007

id = 131, time = 1408126111.000000

id = 132, time = 1408126111.008337

id = 133, time = 1408126111.000007

id = 134, time = 1408126111.000007

id = 135, time = 1408126111.008337

id = 136, time = 1408126111.008337

id = 137, time = 1408126119.337340

id = 138, time = 1408126111.007646

id = 139, time = 1408126111.007646

id = 140, time = 1408126111.008337

id = 141, time = 1408126111.008337

id = 142, time = 1408126111.008337

id = 143, time = 1408126111.015979

id = 144, time = 1408126111.000015

id = 145, time = 1408126111.007645

id = 146, time = 1408126111.015983

id = 147, time = 1408126111.000000

id = 148, time = 1408126111.008337

id = 149, time = 1408126111.000000

id = 150, time = 1408126111.000000

id = 151, time = 1408126111.000000

id = 152, time = 1408126111.000015

id = 153, time = 1408126111.000007

id = 154, time = 1408126111.000007

id = 155, time = 1408126111.000000

id = 156, time = 1408126111.000007

id = 157, time = 1408126111.000000

id = 158, time = 1408126111.000007

id = 159, time = 1408126111.000000

id = 160, time = 1408126111.007639

id = 161, time = 1408126111.000000

id = 162, time = 1408126111.000000

id = 163, time = 1408126111.000000

id = 164, time = 1408126111.007643

id = 165, time = 1408126111.007644

id = 166, time = 1408126111.007644

id = 167, time = 1408126111.015980

id = 168, time = 1408126111.008337

id = 169, time = 1408126111.000007

id = 170, time = 1408126111.015983

id = 171, time = 1408126111.000000

id = 172, time = 1408126111.008337

id = 173, time = 1408126111.007640

id = 174, time = 1408126111.000000

id = 175, time = 1408126111.000000

id = 176, time = 1408126111.000000

id = 177, time = 1408126111.008337

id = 178, time = 1408126111.000007

id = 179, time = 1408126111.008337

id = 180, time = 1408126111.000007

id = 181, time = 1408126111.008337

id = 182, time = 1408126111.008337

id = 183, time = 1408126111.000000

id = 184, time = 1408126111.007646

id = 185, time = 1408126111.000000

id = 186, time = 1408126111.000000

id = 187, time = 1408126111.008337

id = 188, time = 1408126111.007646

id = 189, time = 1408126119.337340

id = 190, time = 1408126111.000007

id = 191, time = 1408126111.000000

You can see the time at id 9, 28, 61, 69, 121... the time 1408126119.337340. Obviously, it is wrong. Could you tell me what happened about this time? If possible, could you let me know why the the fuction is so slower on Phi.

My email is Xin.Chen@hermes-microvision.com. I really need you help!

Xin

Charles_C_Intel1 · ‎10-15-2014

Please show us more code if you would be so kind.

Where did you place your #pragma offload statement? If it is inside the #pragma omp parallel for loop, which your use of __attribute__((target(mic))) within the loop would suggest, then you have as many threads as host cores each doing separate offloads ever loop iteration. That should result in highly unpredictable behavoir and timings.

Try to keep your offload statements happening on one host thread, and then use parallelism inside that offload to take full advantage of the coprocessor. If you are doing more work inside your loop more than just timing it, you could also have a load imbalance (some theads doing more work others). If you just have the timer in there, you should consider the precision of get-time-of-day, and whether it requires exclusive access to any system resources to get its result. Also, since not all threads run at the same time, and OpenMP does not start all threads instantly, any timestamps you query could be expected to be different.

OpenMP/offload experts feel free to weigh in

TaylorIoTKidd · ‎10-17-2014

I have some questions as well.

I assume the code you gave us above has the actual workload code removed. Am I correct? If it isn't revealing too much, can you tell us the name of the workload?
Have you factored in the offload setup and data movement time in your offload statement. Thread setup and data movement take a non-trivial amount of time so your offloaded workload has to do enough computation to make the offload worthwhile.
Does your workload exploit the 512-bit vectorizer? Each of the cores are enhanced Pentium generation processors, meaning that to get good performance, you need to be able to exploit the vectorizer as well as all the cores.
Are you exploiting data persistence on the card to minimize data transfer times?
What type of data locality do you have? If you have a lot of data locality between adjacent threads, you might try a "compact" affinity. If not, maybe "scattered".

Regards
--
Taylor

Chen__Xin · ‎10-17-2014

Hi Charles and Taylor,

Thank you very much for your ideas and help. Actually this code is from what I simplified a parallel Phi computation that I revised from Phi test of SDK. Since the number of cores is 240, the size of array is also 240. On basis of your comments, I have improved our case. You can find the two versions, old and current, as followings:

Before:

double t0 = When();

#pragma offload target(mic) inout(pi:length(240) alloc_if(1) free_if(1))

{

printf("Get max number = %d\n", omp_get_max_threads( ));

fflush(0);

double t0 = When();

#pragma omp parallel for

for( int n = 0; n < 240; n++)

{

float tmp = pi;

for (int i=0; i<count; i++)

{

float t = (float)((i+0.5f)/count);

pi += 4.0f/(1.0f+t*t);

}

Pi /=count;

}

double t1 = When();

printf("Timer = %f\n", t1-t0);

fflush(0);

Current

double t0 = When();

#pragma offload target(mic) inout(pi:length(240) alloc_if(1) free_if(1))

{

#pragma omp parallel for

for( int n = 0; n < 240; n++)

{

// pi = 0.0f;

int count = 10000;

float tmp = pi;

for (int i=0; i<count; i++)

{

float t = (float)((i+0.5f)/count);

tmp += 4.0f/(1.0f+t*t);

}

tmp /=count;

pi =tmp;

}

double t1 = When();

printf("Timer = %f\n", t1-t0);

fflush(0);

Compared to the old one, the current one uses local variable, tmp. Although it is low lower than our expectation inferred from core numbers, it gets a reasonable results. I conducted more experiments, I only got 2 speedups (240 cores vs 16 cores). Hence I want to know the memory model of Phi. Then I can design cache-friend code. Could you help me again?

Charles_C_Intel1 · ‎10-17-2014

To start with, you will want to put an offload before the timed one you show that does a little OpenMP dummy loop to start the OpenMP runtime. As it is, your code is timing the the time it takes OpenMP to start up as well as your compute (on both the host and coprocessor), which may cost as much as the compute. Getting OpenMP running outside the timed offload statement will let you better time just actual computation.

Next, is a good idea of yours to use a local variable rather than pi directly in the computations, but at the end of the loop each core is writing to pi at roughly the same time. As a result, the memory may ping back and forth between the caches, costing you runtime. Padding the array so that each "i" falls on a different cache line might help.

Note that performance numbers you may have seen are on highly tuned SGEMM code and theoretical multiply/add rates. Your mileage may vary, and there are a number of other forum posts talking about how to get the best performance out of such code.

Concerning the architecture of the coprocessor, https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner is a good place to start. There's probably more on https://software.intel.com/en-us/mic-developer as well, an area you have clearly already discovered. Simple points to remember: (1) the memory used by each of the up to 4 hardware threads/core needs to ideally fit into the cache of the core (2) try to avoid referring to memory in the cache of another core and (3) remember that there are at most 16 memory channels and a limit of 32 prefetches per core, so try to keep the requests from main memory to only a few threads at a time.

Chen__Xin · ‎10-17-2014

Hi Charles,

Thanks a lot. I go to mkl folder and find dgemm_exmple.c and related file. Is it a right one?

I will spend on studying on it. If I have any quesitons, I will contact you.

jimdempseyatthecove · ‎10-18-2014

Hi Xin,

Your Xeon Phi has 60 cores, each core has 4 hardware threads. Each core has an L2 and L1 cache that is shared amongst the hardware threads of that core. The efficiency of the core does not linearly scale with the number of hardware threads used within the core. Due to the ~~out-of-order~~ in-order architecture of the core, 2 threads per core will almost always run better than 1 thread per core, and 3 or 4 threads per core will mostly run better than 2 threads per core (but not with linear scaling). As to if 3 or 4 threads per core runs better, this depends on the design of your program and the problem at hand.

Think of the profile of the scaling curve of each core as that of a shark fin. Therefore, any scaling test you set up with Xeon Phi should be a 2D experiment ScaleingFor(nCores, nThreadsPerCore) as opposed to traditional 1D ScalingFor(nCores) or 1D ScalingFor(nLogicalProcessors).

Jim Dempsey

Chen__Xin · ‎10-21-2014

Hi Jim,

Thank you for your answer. I will redesign the evuluation function.

Xin

Chen__Xin · ‎10-21-2014

Hi Charles,

I cannot find SGEMM source codes. Could you tell me where it is?

Xin

TimP · ‎10-21-2014

The source code for sgemm as modified for MKL isn't available (not even to most Intel employees), if that's what you're asking. The original source, e.g. from netlib.org, is equivalent for the single thread case. MIC benchmarks rely on the optimizations incorporated in MKL.

Charles_C_Intel1 · ‎10-21-2014

Greetings:

If you look at our code recipes pages (https://software.intel.com/en-us/articles/code-recipes-for-intelr-xeon-phitm-coprocessor), you should find an entry for "GEMM, STREAM, Linpack" that might help. I found similar instructions from Sumedh in another post:

I just wanted to let you know that a small set of benchmarks comprising of GEMM, Linpack, Stream and SHOC, is provided along with the MPSS...The benchmark are equipped with scripts that set up the environment, run the benchmark and report performance numbers.

Please note that installing these benchmarks along with the MPSS is optional and you may need to install these benchmark if they were not installed with the MPSS.

For the MPSS Gold relase Update 3, the benchmarks can be found in /opt/intel/mic/perf/. The sources can be found in /opt/intel/mic/perf/src where as the scripts can be found in /opt/intel/mic/perf/micp/micp/scripts.

Charles

Chen__Xin · ‎10-21-2014

Hi Charles,

Nice to hear back from you soon! Actually, we are designing a stream-like heterogeneous system with Phi istead of use MKL library directly. As a result, we really need to take your source code of DGEMM as our reference and deeply understand memory hierarchy pf Phi. I appreciate if you can shared part of the code. Or give me some hints how to improve the following code (A = B*0.5 + C * 0.5)

#pragma offload target(mic) out(outImage:length(width*height) alloc_if(1) free_if(1)) in(img1,img2:length(width*height) alloc_if(1) free_if(1))
        {
          //int a = 0;
          //omp_set_num_threads(192);
          //const size_t iCPUNum = omp_get_max_threads();
          //printf("Get number = %d\n", iCPUNum);
          //fflush(0);
          const size_t ySegment = height/iCPUNum;
#pragma omp parallel for
          for (size_t n = 0; n < iCPUNum; n++)
            {
              const size_t starty = n * width;
              size_t endy = starty + ySegment;
              if(n = (iCPUNum -1))  endy = height;
              unsigned char  tmpArray1[width];
              unsigned char  tmpArray2[width];
              unsigned char  tmpArrayout[width];

              for (size_t y = starty; y < endy; y++)
                {
                  memcpy(tmpArray1, &img1[y*width], width*sizeof(char));
                  memcpy(tmpArray2, &img2[y*width], width*sizeof(char));
                  for (size_t nn = 0; nn < LOOPNUM; nn++)
                    {
                      for (size_t x = 0; x < width; x++)
                        {
                          tmpArrayout[ x] = tmpArray1*0.5f + tmpArray2*0.5f;
                        }
                    }
                  memcpy( &outImage[y*width], tmpArrayout,width*sizeof(char));

                }
            }//end of n<iCPUNum
        }//end of pragma

TimP · ‎10-21-2014

size_t is a particularly awkward data type for Intel(r) Xeon Phi(tm) to use as for index. You would want at least to check that you get a satisfactory vectorization report.

Chen__Xin · ‎10-21-2014

Hi Tim,

Thank you for your comments and reminders. Actually, I have seen the report:

imageAdd.cpp(168): (col. 9) remark: LOOP WAS VECTORIZED
imageAdd.cpp(168): (col. 9) remark: *MIC* LOOP WAS VECTORIZED

Of course, I will continue to check all possible parts that will negative affect the performance. However, I think I should focus on Phi memory. I conducted some simple tests and concluded that the reason why my program can not reach theoretical value is from cache-missing penalty.

It is why I am trying to deeply understand Phi memory hierarchy and find a chance to learn a good sample example from Intel.

jimdempseyatthecove · ‎10-22-2014

Why all the memcpy's?

It would be better to assure img1, img2 and outImage buffers are cache line aligned (and use the restrict attribute).

Then use #pragma omp parallel for simd

This said, your "code" is a synthetic load you wish to apply for testing. I'd pick a more realistic synthetic code that models your expected use.

Jim Dempsey

jimdempseyatthecove · ‎10-22-2014

Also, for your synthetic load, force your tmpArray's to be cache aligned and then use #pragma simd on the inner loop.

I am sure you are aware that your inner most statement wouldn't make practical sense in a character manipulation program.

It would be much better to use:

tmpArrayout = (tmpArray1 / 2 + tmpArray2 / 2) + ((tmpArray1 & tmpArray2) & 1);

or

tmpArrayout = ((tmpArray1 >> 1) + (tmpArray2 >> 1)) + ((tmpArray1 & tmpArray2) & 1);

depending on what the compiler and/or instruction set is able to do.

This would avoid converting between char and float (and messing up the vector widths).

Jim Dempsey

TimP · ‎10-22-2014

As Jim suggested, the up-conversion to float limits the vector width to 16, incurring conversion overheads, where your goal would be to use the maximum width available with small integer operations.

In principle, a compiler should be able to change /2 to a shift. You might consider making those 2 and 1 constants explicitly smaller, e.g. (short int)2 in case Jim's suggestion forces everything to 32-bit width.

Unnecessary use of temporary copies might be the biggest issue in caching problems.

jimdempseyatthecove · ‎10-22-2014

(char)2 the arrays are unsigned char.

If UINT8 is not supported then consider "cast" to UINT32

out = (((in1 >> 1) & 0x7F7F7F7F) + ((in2 >> 1) & 0x7F7F7F7F) +((in1 & in2) & 0x01010101);

That should vectorize and handle 64 unsigned chars at a time (assuming you cast the array properly).

Jim Dempsey

Charles_C_Intel1 · ‎10-28-2014

The basic problem is while the GEMM algorithm is easy to write, writing a DGEMM that squeezes every ounce of performance out of the available hardware is hard. As is measuring best-performance. And as you might expect, this is all hardware-specific. In general, it is a waste of your valuable time to be writing things like DGEMM or any other non-trivial numerical kernel or solver for which library implementations are available, as there are a lot of subtleties involved with both numerical correctness/stability and performance considerations which the library developers have already solved for you. Since you have the Intel(R) C++ Compiler you also have Intel(R) MKL, so why not use it?

That said, we think that anyone doing serious scientific computing work can benefit greatly by having a good conceptual understanding of the main considerations involved. Here are some references to get you started:

For a very good and human-readable explanation of the issues involved with writing good matrix-matrix multiplication routines, we highly recommend chapter 1, "Matrix Multiplication Problems" from the classic "Matrix Computations" text by Gene H. Golub and Charles F. Van Loan. Googling reveals several links to the full text of the third edition (there is now a 4th edition). E.g., http://web.mit.edu/ehliu/Public/sclark/Golub%20G.H.,%20Van%20Loan%20C.F.-%20Matrix%20Computations.pdf
Here is a public paper which exposes many of details how DGEMM can be implemented on Xeon Phi:
http://pcl.intel-research.net/publications/ipdps13_linpack.pdf
See about pg 50 of Drepper's classic http://www.cs.bgu.ac.il/~os142/wiki.files/drepper-2007.pdf
"What every programmer should know about memory" Old but effective. Can't match the MKL team, but enough hints to do way better than the simple transpose method. Add AVX2 or IMCI intrinsics instead of SSE and you'll probably do well.
By no means official, but the paper noted below describes a very good way to implement DGEMM for multithreaded architectures. It is completely open-source:
T. M. Smith, R. van de Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. Van Zee.
Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS).
Phoenix, Arizona, May 2014.
"Anatomy of High-Performance Many-Threaded Matrix Multiplication"
http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf

Also known as:
FLAME Working Note #71.
The University of Texas at Austin, Department of Computer Science. Technical Report TR-13-20. 2013.
"Opportunities for Parallelism in Matrix Multiplication"
http://www.cs.utexas.edu/users/flame/pubs/FLAWN71.pdf

Home Page https://code.google.com/p/blis/
GitHub Source https://github.com/flame/blis

Hope this helps.

Charles_C_Intel1 · ‎10-28-2014

Another reference:

You can still download GoToBlas2 from TACC. It was faster than MKL at one point in time. This version does not include AVX, AVX2, or MIC support. https://www.tacc.utexas.edu/tacc-projects/gotoblas2

Jeffrey_H_Intel · ‎10-28-2014

GotoBLAS2 lives on as OpenBLAS (http://www.openblas.net/), which supports AVX and FMA instructions (https://github.com/xianyi/OpenBLAS/blob/develop/README.md).