- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Charles Congdon,
I am using Phi Coprocessor to doing a project. When I ran this function in CPU, 24 cores, the time is about 8.5 msec. But It ran about 700 msec pn Phi. In order to figure it our, I print the time at a begin of openmp part of offload code:
#pragma omp parallel for
for( size_t y = 0; y < 192; y++)
{
tmp[omp_get_thread_num()] = When();
__attribute__((target(mic))) double When()
{
#ifndef _WIN32
static struct timeval tp;
gettimeofday(&tp, NULL);
double t = (double)tp.tv_sec;
double t1 = (double) tp.tv_usec;
return (t + t1 * 1e-6);
#else
clock_t start = clock( );
double duration = (double)start / CLOCKS_PER_SEC;
return duration;
#endif
}
The results of time are:
id = 1, time = 1408126111.008337
id = 2, time = 1408126111.008337
id = 3, time = 1408126111.000000
id = 4, time = 1408126111.007646
id = 5, time = 1408126111.000000
id = 6, time = 1408126111.000000
id = 7, time = 1408126111.008337
id = 8, time = 1408126111.007646
id = 9, time = 1408126119.337340
id = 10, time = 1408126111.000007
id = 11, time = 1408126111.000000
id = 12, time = 1408126119.337340
id = 13, time = 1408126111.000007
id = 14, time = 1408126111.000007
id = 15, time = 1408126111.000000
id = 16, time = 1408126119.337340
id = 17, time = 1408126111.008337
id = 18, time = 1408126111.000007
id = 19, time = 1408126111.000007
id = 20, time = 1408126111.008337
id = 21, time = 1408126111.008337
id = 22, time = 1408126111.008337
id = 23, time = 1408126111.015979
id = 24, time = 1408126111.000015
id = 25, time = 1408126111.007645
id = 26, time = 1408126111.015983
id = 27, time = 1408126111.000000
id = 28, time = 1408126119.337340
id = 29, time = 1408126111.000000
id = 30, time = 1408126111.007644
id = 31, time = 1408126111.008337
id = 32, time = 1408126111.015964
id = 33, time = 1408126111.000007
id = 34, time = 1408126111.000007
id = 35, time = 1408126111.000000
id = 36, time = 1408126111.000007
id = 37, time = 1408126111.000000
id = 38, time = 1408126111.000000
id = 39, time = 1408126111.000000
id = 40, time = 1408126111.007639
id = 41, time = 1408126111.000000
id = 42, time = 1408126111.000000
id = 43, time = 1408126111.008337
id = 44, time = 1408126111.007643
id = 45, time = 1408126111.007644
id = 46, time = 1408126111.007644
id = 47, time = 1408126111.015980
id = 48, time = 1408126111.008337
id = 49, time = 1408126111.000000
id = 50, time = 1408126111.015983
id = 51, time = 1408126111.000007
id = 52, time = 1408126119.337340
id = 53, time = 1408126111.007640
id = 54, time = 1408126111.000000
id = 55, time = 1408126111.000000
id = 56, time = 1408126111.000000
id = 57, time = 1408126111.008337
id = 58, time = 1408126111.000007
id = 59, time = 1408126111.008337
id = 60, time = 1408126111.015960
id = 61, time = 1408126119.337340
id = 62, time = 1408126111.008337
id = 63, time = 1408126111.000000
id = 64, time = 1408126111.007646
id = 65, time = 1408126111.000000
id = 66, time = 1408126111.000000
id = 67, time = 1408126111.008337
id = 68, time = 1408126111.007646
id = 69, time = 1408126119.337340
id = 70, time = 1408126111.000007
id = 71, time = 1408126111.000000
id = 72, time = 1408126111.008337
id = 73, time = 1408126111.000007
id = 74, time = 1408126111.000000
id = 75, time = 1408126111.008337
id = 76, time = 1408126111.008337
id = 77, time = 1408126111.008337
id = 78, time = 1408126111.000007
id = 79, time = 1408126111.000007
id = 80, time = 1408126119.337340
id = 81, time = 1408126119.337340
id = 82, time = 1408126111.008337
id = 83, time = 1408126111.015979
id = 84, time = 1408126111.015964
id = 85, time = 1408126111.007645
id = 86, time = 1408126111.015983
id = 87, time = 1408126111.000000
id = 88, time = 1408126111.008337
id = 89, time = 1408126111.000007
id = 90, time = 1408126111.000000
id = 91, time = 1408126111.008337
id = 92, time = 1408126111.000015
id = 93, time = 1408126111.007646
id = 94, time = 1408126111.000007
id = 95, time = 1408126111.000000
id = 96, time = 1408126111.007646
id = 97, time = 1408126111.000000
id = 98, time = 1408126111.000000
id = 99, time = 1408126111.000000
id = 100, time = 1408126111.007639
id = 101, time = 1408126111.000000
id = 102, time = 1408126111.000000
id = 103, time = 1408126111.008337
id = 104, time = 1408126111.007643
id = 105, time = 1408126111.007644
id = 106, time = 1408126111.007644
id = 107, time = 1408126111.015980
id = 108, time = 1408126119.337340
id = 109, time = 1408126111.000000
id = 110, time = 1408126111.015983
id = 111, time = 1408126111.000007
id = 112, time = 1408126111.008337
id = 113, time = 1408126111.007640
id = 114, time = 1408126111.000000
id = 115, time = 1408126111.000000
id = 116, time = 1408126111.000000
id = 117, time = 1408126111.008337
id = 118, time = 1408126111.000007
id = 119, time = 1408126111.008337
id = 120, time = 1408126119.337340
id = 121, time = 1408126119.337340
id = 122, time = 1408126111.008337
id = 123, time = 1408126111.000000
id = 124, time = 1408126111.007646
id = 125, time = 1408126111.000000
id = 126, time = 1408126111.000000
id = 127, time = 1408126111.008337
id = 128, time = 1408126111.007646
id = 129, time = 1408126119.337340
id = 130, time = 1408126111.000007
id = 131, time = 1408126111.000000
id = 132, time = 1408126111.008337
id = 133, time = 1408126111.000007
id = 134, time = 1408126111.000007
id = 135, time = 1408126111.008337
id = 136, time = 1408126111.008337
id = 137, time = 1408126119.337340
id = 138, time = 1408126111.007646
id = 139, time = 1408126111.007646
id = 140, time = 1408126111.008337
id = 141, time = 1408126111.008337
id = 142, time = 1408126111.008337
id = 143, time = 1408126111.015979
id = 144, time = 1408126111.000015
id = 145, time = 1408126111.007645
id = 146, time = 1408126111.015983
id = 147, time = 1408126111.000000
id = 148, time = 1408126111.008337
id = 149, time = 1408126111.000000
id = 150, time = 1408126111.000000
id = 151, time = 1408126111.000000
id = 152, time = 1408126111.000015
id = 153, time = 1408126111.000007
id = 154, time = 1408126111.000007
id = 155, time = 1408126111.000000
id = 156, time = 1408126111.000007
id = 157, time = 1408126111.000000
id = 158, time = 1408126111.000007
id = 159, time = 1408126111.000000
id = 160, time = 1408126111.007639
id = 161, time = 1408126111.000000
id = 162, time = 1408126111.000000
id = 163, time = 1408126111.000000
id = 164, time = 1408126111.007643
id = 165, time = 1408126111.007644
id = 166, time = 1408126111.007644
id = 167, time = 1408126111.015980
id = 168, time = 1408126111.008337
id = 169, time = 1408126111.000007
id = 170, time = 1408126111.015983
id = 171, time = 1408126111.000000
id = 172, time = 1408126111.008337
id = 173, time = 1408126111.007640
id = 174, time = 1408126111.000000
id = 175, time = 1408126111.000000
id = 176, time = 1408126111.000000
id = 177, time = 1408126111.008337
id = 178, time = 1408126111.000007
id = 179, time = 1408126111.008337
id = 180, time = 1408126111.000007
id = 181, time = 1408126111.008337
id = 182, time = 1408126111.008337
id = 183, time = 1408126111.000000
id = 184, time = 1408126111.007646
id = 185, time = 1408126111.000000
id = 186, time = 1408126111.000000
id = 187, time = 1408126111.008337
id = 188, time = 1408126111.007646
id = 189, time = 1408126119.337340
id = 190, time = 1408126111.000007
id = 191, time = 1408126111.000000
You can see the time at id 9, 28, 61, 69, 121... the time 1408126119.337340. Obviously, it is wrong. Could you tell me what happened about this time? If possible, could you let me know why the the fuction is so slower on Phi.
My email is Xin.Chen@hermes-microvision.com. I really need you help!
Xin
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please show us more code if you would be so kind.
Where did you place your #pragma offload statement? If it is inside the #pragma omp parallel for loop, which your use of __attribute__((target(mic))) within the loop would suggest, then you have as many threads as host cores each doing separate offloads ever loop iteration. That should result in highly unpredictable behavoir and timings.
Try to keep your offload statements happening on one host thread, and then use parallelism inside that offload to take full advantage of the coprocessor. If you are doing more work inside your loop more than just timing it, you could also have a load imbalance (some theads doing more work others). If you just have the timer in there, you should consider the precision of get-time-of-day, and whether it requires exclusive access to any system resources to get its result. Also, since not all threads run at the same time, and OpenMP does not start all threads instantly, any timestamps you query could be expected to be different.
OpenMP/offload experts feel free to weigh in
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have some questions as well.
- I assume the code you gave us above has the actual workload code removed. Am I correct? If it isn't revealing too much, can you tell us the name of the workload?
- Have you factored in the offload setup and data movement time in your offload statement. Thread setup and data movement take a non-trivial amount of time so your offloaded workload has to do enough computation to make the offload worthwhile.
- Does your workload exploit the 512-bit vectorizer? Each of the cores are enhanced Pentium generation processors, meaning that to get good performance, you need to be able to exploit the vectorizer as well as all the cores.
- Are you exploiting data persistence on the card to minimize data transfer times?
- What type of data locality do you have? If you have a lot of data locality between adjacent threads, you might try a "compact" affinity. If not, maybe "scattered".
Regards
--
Taylor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Charles and Taylor,
Thank you very much for your ideas and help. Actually this code is from what I simplified a parallel Phi computation that I revised from Phi test of SDK. Since the number of cores is 240, the size of array is also 240. On basis of your comments, I have improved our case. You can find the two versions, old and current, as followings:
Before:
double t0 = When();
#pragma offload target(mic) inout(pi:length(240) alloc_if(1) free_if(1))
{
printf("Get max number = %d\n", omp_get_max_threads( ));
fflush(0);
double t0 = When();
#pragma omp parallel for
for( int n = 0; n < 240; n++)
{
float tmp = pi
for (int i=0; i<count; i++)
{
float t = (float)((i+0.5f)/count);
pi
}
Pi
}
}
double t1 = When();
printf("Timer = %f\n", t1-t0);
fflush(0);
Current
double t0 = When();
#pragma offload target(mic) inout(pi:length(240) alloc_if(1) free_if(1))
{
#pragma omp parallel for
for( int n = 0; n < 240; n++)
{
// pi
int count = 10000;
float tmp = pi
for (int i=0; i<count; i++)
{
float t = (float)((i+0.5f)/count);
tmp += 4.0f/(1.0f+t*t);
}
tmp /=count;
pi
}
}
double t1 = When();
printf("Timer = %f\n", t1-t0);
fflush(0);
Compared to the old one, the current one uses local variable, tmp. Although it is low lower than our expectation inferred from core numbers, it gets a reasonable results. I conducted more experiments, I only got 2 speedups (240 cores vs 16 cores). Hence I want to know the memory model of Phi. Then I can design cache-friend code. Could you help me again?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To start with, you will want to put an offload before the timed one you show that does a little OpenMP dummy loop to start the OpenMP runtime. As it is, your code is timing the the time it takes OpenMP to start up as well as your compute (on both the host and coprocessor), which may cost as much as the compute. Getting OpenMP running outside the timed offload statement will let you better time just actual computation.
Next, is a good idea of yours to use a local variable rather than pi
Note that performance numbers you may have seen are on highly tuned SGEMM code and theoretical multiply/add rates. Your mileage may vary, and there are a number of other forum posts talking about how to get the best performance out of such code.
Concerning the architecture of the coprocessor, https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner is a good place to start. There's probably more on https://software.intel.com/en-us/mic-developer as well, an area you have clearly already discovered. Simple points to remember: (1) the memory used by each of the up to 4 hardware threads/core needs to ideally fit into the cache of the core (2) try to avoid referring to memory in the cache of another core and (3) remember that there are at most 16 memory channels and a limit of 32 prefetches per core, so try to keep the requests from main memory to only a few threads at a time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Charles,
Thanks a lot. I go to mkl folder and find dgemm_exmple.c and related file. Is it a right one?
I will spend on studying on it. If I have any quesitons, I will contact you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Xin,
Your Xeon Phi has 60 cores, each core has 4 hardware threads. Each core has an L2 and L1 cache that is shared amongst the hardware threads of that core. The efficiency of the core does not linearly scale with the number of hardware threads used within the core. Due to the out-of-order in-order architecture of the core, 2 threads per core will almost always run better than 1 thread per core, and 3 or 4 threads per core will mostly run better than 2 threads per core (but not with linear scaling). As to if 3 or 4 threads per core runs better, this depends on the design of your program and the problem at hand.
Think of the profile of the scaling curve of each core as that of a shark fin. Therefore, any scaling test you set up with Xeon Phi should be a 2D experiment ScaleingFor(nCores, nThreadsPerCore) as opposed to traditional 1D ScalingFor(nCores) or 1D ScalingFor(nLogicalProcessors).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Thank you for your answer. I will redesign the evuluation function.
Xin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Charles,
I cannot find SGEMM source codes. Could you tell me where it is?
Xin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The source code for sgemm as modified for MKL isn't available (not even to most Intel employees), if that's what you're asking. The original source, e.g. from netlib.org, is equivalent for the single thread case. MIC benchmarks rely on the optimizations incorporated in MKL.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Greetings:
If you look at our code recipes pages (https://software.intel.com/en-us/articles/code-recipes-for-intelr-xeon-phitm-coprocessor), you should find an entry for "GEMM, STREAM, Linpack" that might help. I found similar instructions from Sumedh in another post:
I just wanted to let you know that a small set of benchmarks comprising of GEMM, Linpack, Stream and SHOC, is provided along with the MPSS...The benchmark are equipped with scripts that set up the environment, run the benchmark and report performance numbers.
Please note that installing these benchmarks along with the MPSS is optional and you may need to install these benchmark if they were not installed with the MPSS.
For the MPSS Gold relase Update 3, the benchmarks can be found in /opt/intel/mic/perf/. The sources can be found in /opt/intel/mic/perf/src where as the scripts can be found in /opt/intel/mic/perf/micp/micp/scripts.
Charles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Charles,
Nice to hear back from you soon! Actually, we are designing a stream-like heterogeneous system with Phi istead of use MKL library directly. As a result, we really need to take your source code of DGEMM as our reference and deeply understand memory hierarchy pf Phi. I appreciate if you can shared part of the code. Or give me some hints how to improve the following code (A = B*0.5 + C * 0.5)
#pragma offload target(mic) out(outImage:length(width*height) alloc_if(1) free_if(1)) in(img1,img2:length(width*height) alloc_if(1) free_if(1)) { //int a = 0; //omp_set_num_threads(192); //const size_t iCPUNum = omp_get_max_threads(); //printf("Get number = %d\n", iCPUNum); //fflush(0); const size_t ySegment = height/iCPUNum; #pragma omp parallel for for (size_t n = 0; n < iCPUNum; n++) { const size_t starty = n * width; size_t endy = starty + ySegment; if(n = (iCPUNum -1)) endy = height; unsigned char tmpArray1[width]; unsigned char tmpArray2[width]; unsigned char tmpArrayout[width]; for (size_t y = starty; y < endy; y++) { memcpy(tmpArray1, &img1[y*width], width*sizeof(char)); memcpy(tmpArray2, &img2[y*width], width*sizeof(char)); for (size_t nn = 0; nn < LOOPNUM; nn++) { for (size_t x = 0; x < width; x++) { tmpArrayout[ x] = tmpArray1*0.5f + tmpArray2 *0.5f; } } memcpy( &outImage[y*width], tmpArrayout,width*sizeof(char)); } }//end of n<iCPUNum }//end of pragma
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
size_t is a particularly awkward data type for Intel(r) Xeon Phi(tm) to use as for index. You would want at least to check that you get a satisfactory vectorization report.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim,
Thank you for your comments and reminders. Actually, I have seen the report:
imageAdd.cpp(168): (col. 9) remark: LOOP WAS VECTORIZED
imageAdd.cpp(168): (col. 9) remark: *MIC* LOOP WAS VECTORIZED
Of course, I will continue to check all possible parts that will negative affect the performance. However, I think I should focus on Phi memory. I conducted some simple tests and concluded that the reason why my program can not reach theoretical value is from cache-missing penalty.
It is why I am trying to deeply understand Phi memory hierarchy and find a chance to learn a good sample example from Intel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why all the memcpy's?
It would be better to assure img1, img2 and outImage buffers are cache line aligned (and use the restrict attribute).
Then use #pragma omp parallel for simd
This said, your "code" is a synthetic load you wish to apply for testing. I'd pick a more realistic synthetic code that models your expected use.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, for your synthetic load, force your tmpArray's to be cache aligned and then use #pragma simd on the inner loop.
I am sure you are aware that your inner most statement wouldn't make practical sense in a character manipulation program.
It would be much better to use:
tmpArrayout
or
tmpArrayout
depending on what the compiler and/or instruction set is able to do.
This would avoid converting between char and float (and messing up the vector widths).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As Jim suggested, the up-conversion to float limits the vector width to 16, incurring conversion overheads, where your goal would be to use the maximum width available with small integer operations.
In principle, a compiler should be able to change /2 to a shift. You might consider making those 2 and 1 constants explicitly smaller, e.g. (short int)2 in case Jim's suggestion forces everything to 32-bit width.
Unnecessary use of temporary copies might be the biggest issue in caching problems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(char)2 the arrays are unsigned char.
If UINT8 is not supported then consider "cast" to UINT32
out = (((in1 >> 1) & 0x7F7F7F7F) + ((in2 >> 1) & 0x7F7F7F7F) +((in1 & in2) & 0x01010101);
That should vectorize and handle 64 unsigned chars at a time (assuming you cast the array properly).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The basic problem is while the GEMM algorithm is easy to write, writing a DGEMM that squeezes every ounce of performance out of the available hardware is hard. As is measuring best-performance. And as you might expect, this is all hardware-specific. In general, it is a waste of your valuable time to be writing things like DGEMM or any other non-trivial numerical kernel or solver for which library implementations are available, as there are a lot of subtleties involved with both numerical correctness/stability and performance considerations which the library developers have already solved for you. Since you have the Intel(R) C++ Compiler you also have Intel(R) MKL, so why not use it?
That said, we think that anyone doing serious scientific computing work can benefit greatly by having a good conceptual understanding of the main considerations involved. Here are some references to get you started:
- For a very good and human-readable explanation of the issues involved with writing good matrix-matrix multiplication routines, we highly recommend chapter 1, "Matrix Multiplication Problems" from the classic "Matrix Computations" text by Gene H. Golub and Charles F. Van Loan. Googling reveals several links to the full text of the third edition (there is now a 4th edition). E.g., http://web.mit.edu/ehliu/Public/sclark/Golub%20G.H.,%20Van%20Loan%20C.F.-%20Matrix%20Computations.pdf
- Here is a public paper which exposes many of details how DGEMM can be implemented on Xeon Phi:
http://pcl.intel-research.net/publications/ipdps13_linpack.pdf - See about pg 50 of Drepper's classic http://www.cs.bgu.ac.il/~os142/wiki.files/drepper-2007.pdf
"What every programmer should know about memory" Old but effective. Can't match the MKL team, but enough hints to do way better than the simple transpose method. Add AVX2 or IMCI intrinsics instead of SSE and you'll probably do well. - By no means official, but the paper noted below describes a very good way to implement DGEMM for multithreaded architectures. It is completely open-source:
T. M. Smith, R. van de Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. Van Zee.
Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS).
Phoenix, Arizona, May 2014.
"Anatomy of High-Performance Many-Threaded Matrix Multiplication"
http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdfAlso known as:
FLAME Working Note #71.
The University of Texas at Austin, Department of Computer Science. Technical Report TR-13-20. 2013.
"Opportunities for Parallelism in Matrix Multiplication"
http://www.cs.utexas.edu/users/flame/pubs/FLAWN71.pdfHome Page https://code.google.com/p/blis/
GitHub Source https://github.com/flame/blis
Hope this helps.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another reference:
You can still download GoToBlas2 from TACC. It was faster than MKL at one point in time. This version does not include AVX, AVX2, or MIC support. https://www.tacc.utexas.edu/tacc-projects/gotoblas2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
GotoBLAS2 lives on as OpenBLAS (http://www.openblas.net/), which supports AVX and FMA instructions (https://github.com/xianyi/OpenBLAS/blob/develop/README.md).

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page