Hi Tim,

Emo_T_ · ‎11-28-2013

I am getting a segmentation fault in the following program, compiled with: icc -mmic -O0 test.c and executed natively on Xeon Phi. If the #pragma is removed it runs fine. It appears that the compiler (14.0.1) obeys the #pragma but forgets about it later... #pragma pack(1) struct test { char c; double d; }; int main(void) { struct test t; t.d = 0; // segmentation fault here return 0; }

TimP · ‎11-28-2013

You're forcing a mis-aligned access which the hardware can't accommodate. It would have to generate code to copy the double as a byte string to aligned storage.

Emo_T_ · ‎11-28-2013

Thanks Tim! Indeed the segfault is caused by a vector instruction: => 0x00000000004004a3 <+31>: vpackstorelpd %zmm0,(%rax){%k1} This instruction is supposed to be "unaligned", but apparently this means 512-bit unaligned yet 64-bit aligned. Ideally the compiler will be modified so as to avoid alignment assumptions that cannot be verified... In the meantime, is there any way to force the compiler to play it safe? I am trying to port a large code base which uses struct packing to speed up serialization, and it would be nice to be able to just run the code (as promised:) and worry about optimization later. Btw, here is a simpler test that causes the same error: void main(void) { char text[9]; *(double*)(text+1) = 0; }

James_C_Intel2 · ‎11-29-2013

"I am trying to port a large code base which uses struct packing to speed up serialization, and it would be nice to be able to just run the code (as promised:) and worry about optimization later."

This frightens me somewhat. If the major optimisation in this code is struct packing to speed up serialization, it seems very unlikely to be a good candidate for execution on Xeon Phi, since it has clearly been optimised to improve I/O performance, while the Xeon Phi works best for codes that are highly CPU-bound.

Of course, we all remember Ken Batcher's definition of a supercomputer : "A machine for turning a compute-bound problem into an I/O-bound one", but if you're starting with an I/O-bound problem you may achieve more by spending your money on a solid-state disk, rather than a Xeon Phi.

http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-ssd.html?iid=subhdr+products_flash

Emo_T_ · ‎11-29-2013

Hi James, I guess my brief description of the code was misleading. It is a physics engine, very much CPU-bound. It evaluates the dynamics of a robot many times with different states and controls (in different threads running independently of each other), so as to figure out what is the best way to control the robot. The only IO on MIC is loading the binary model file once - which is very fast and is not included in the timing tests. Structure packing did not affect overall CPU time, and so the structures were packed to make the model files smaller and more sensible, and also to speed up occasional serialization over sockets (which is not done when running on MIC). These are structures-of-dynamically-allocated-arrays actually. Anyway, it turned out that modifying the code to run on MIC was easy once I knew what the problem is. The initial timing tests however are not as good as I had hoped. Compared to the host i7-3930K (slightly overclocked), the Xeon Phi is 18 times slower when running one model evaluation, and 1.2 times slower when running 224 model evaluations (again, one evaluation = one thread). MKL slow things down a bit both on the CPU and MIC -- perhaps because there is a large number of linear algebra operations on small vectors and matrices, and so inlining of native functions is more beneficial than using more optimized but not inlined library functions. I am using OpenMP for multi-threading, and compiling with -O2. I am not quite sure what to try next, but here are some preliminary thoughts: -- It is possible that the small cache-per-core on MIC is a problem; each evaluation/thread needs to access considerably more data that the available cache (most of it is self-generated, but it still, it has to be stored somewhere). If that is the problem, it doesn't seem to have a solution. -- Perhaps the vector registers are not utilized well, but that would also apply to the CPU and account for at most a factor of 2 (256-bit vs 512-bit) while the performance with 224 threads seems to be off by a larger factor. The fact that MKL does not help means that better vectorization will involve significant code changes to align the data on 512-bit boundaries... and it is not even clear if that makes sense given the large of amount of operations on small vectors (3D or 4D) mentioned earlier. -- If I ever have the time I am considering an OpenCL implementation, using doubleN instead of double, so that each thread simulates N copies of the model at all times. Things get tricky because of collision detection, and the fact that different model states have different numbers of active contacts... but it may be possible to add fake contacts for padding and proceed in pure SIMD mode after the collision detection phase. If you have any suggestions I would love to hear them. In particular, is there anything I should look for in VTune (not that I know how to use it on MIC:) I mostly work with Visual Studio, but decided to install CentOS because the Windows version of MPSS still has glitches - which disappeared once I moved to Linux. In my brief experimentation with VTune on the CPU, I got a CPI of 0.6. Is this by any chance an indication that my code is particularly well suited for CPUs? Yours, Emo PS: The above comments are not meant to be negative. It is great that Intel is pushing the Xeon Phi, and I am sure I will find other uses for it even if this particular application turns out not to be a good candidate.

TimP · ‎11-30-2013

Cache capacity is a frequent reason for MIC performance peaking at 2N-2 threads (or even N-1) on N cores. You would need attention to affinity so as to spread the threads evenly across cores when not using 4 threads per core.

Very low CPI on Xeon, such as you quote, raises the suspicion that you don't execute many simd parallel instructions, or that you execute a lot of OpenMP spin waits. I'd be a little surprised that mis-alignment didn't increase CPI. I don't know of anyone attempting to run vectorized code on host CPU with mis-aligned data. As you say, there's a difference between unaligned (data aligned according to data type but not according to simd width) and mis-aligned (not aligned according to data type). If the compiler chose not to vectorize on account of visible mis-alignment, you would see it in the vec-report.

I can't guess what you mean by MKL slowing down your application. There are many ways to use MKL. The current MIC MKL ?gemm isn't optimized for minimum dimensions less than 32, and even at that size it's difficult to match host performance. ifort MATMUL should do a better job at -O3 than MKL for cases which aren't large enough to benefit from invoking additional threads inside matrix multiplication. Note that ifort -O3 on host implies -opt-matmul (which you must turn off to avoid using MKL) but currently there is no -opt-matmul for MIC. But I'm wasting words since you didn't say if you are using matrix multiplication.

For optimization of vectorization of short aligned data, where you want the compiler to be permitted to access (but discard) data beyond the end of the loop, the compiler offers -opt-assume-safe-padding. It's important to use aligned data if you want to see full advantage of MIC with loop lengths less than 2000 or so.

Emo_T_ · ‎11-30-2013

Ok, I figured out how to use VTune. Here is a comparison of results, same code as described above, 224 OpenMP threads running in parallel, no synchronization among them. Not sure if it explains why the Xeon Phi is slower than the i7-3930K... It is notable that the i7 ends up executing much fewer instructions at a much better CPI... is this normal? Is there a way to get VTune to estimate vectorization for the CPU?

Emo_T_ · ‎11-30-2013

Hi Tim, I tried smaller numbers of threads but that only increases the advantage of i7; for example with 96 threads I get 26 sec runtime on MIC vs 12.4 sec on i7. If cache is the problem, one would expect using single precision instead of double precision to improve things. I reconfigured the code to use single precision everywhere, and it indeed helped: with 224 threads the runtime on MIC went from 36.8 sec down to 28.8 sec. However it also helped on i7: from 28.8 sec down to 23.5 sec, so the relative speed remained the same (a factor of 1.2 in favor of i7). VTune shows that almost all the time is spent in my code, with OpenMP and everything else being negligible. Re MKL, what I meant (cryptically) was that I have native BLAS functions that can be replaced with the corresponding MKL functions with a configuration flag. I am using mkl:sequential (since threads are used for parallel evaluation). The vector/matrix dimensions are around 50, and sometimes less, so it is not surprising that MKL does not help. Overall, the code solves relatively small problems many times - each thread evaluates and integrates the robot dynamics for 10,000 time steps in my timing tests. One would think that would be a good candidate for a many-core processor... So, unless you see something else in the VTune results I just posted, the conclusion is that I need alignment - which is easier said than done here, because this is not a typical HPC application where each vector is manipulated as a unit. Instead the elements of the vectors (e.g. the vector of joint velocities) sometimes need to be manipulated in a complicated pattern determined by the underlying kinematic tree - so aligning the first element does not mean that all subsequent operations will be aligned.

jimdempseyatthecove · ‎11-30-2013

Without seeing the code it is hard to offer suggestions about what is going on and how to improve performance. From the rough description of your problem, it sounds like you may be able to benefit from lifting the vectorization to higher loop level. Roughly meaning assume you have N joints each with 50 dimensions. Lifting here means taking vectorization out of within a joint (e.g. {x,y,z}={x,y,z} + {dx,dy,dz}*dt) such that you populate the vector with one dimension from a collection of joints.

x[0:vw] = x[0:vw] + dx[0:vw]*dt;
y[0:vw] = y[0:vw] + dy[0:vw]*dt;
z[0:vw] = z[0:vw] + dz[0:vw]*dt;

Where vw is the vector width and the above is in C++ CEAN notation.

Assuming you have more than two joints, lifting the vectorization may yield faster code. This is the classic AOS verses SOA. For Xeon Phi, when applicable SOA may be best.

Jim Dempsey

Emo_T_ · ‎11-30-2013

Thanks Jim, I think your "lifting the vectorization" suggestion is related to the OpenCL option I mentioned earlier. Currently I have something like this: struct Model { double joint_velocity[NJOINT]; ... } model[NSIMULATION]; #pragma omp parallel for for( n=0; nhttp://homes.cs.washington.edu/~todorov/ They show complex movements synthesized with optimal control methods, which in turn rely on brute-force physics simulation, using the same engine I am now trying to port to MIC and perhaps GPU.

jimdempseyatthecove · ‎12-01-2013

Unless you mean speed, velocity typically has an X, Y and Z component. However, if this is a hinged system with one degree of freedom, then velocity could have one dimension. Similar thing with the other components.

As TimP pointed out, Xeon Phi (and other CPU's as well), will exhibit best performance with vectors .AND. when the vectors are adjacent. Thus you would want to unpack your current packed_model[NSIMULATION/8]; into:

double8 packed_model_joint[NSIMULATION/8];
double8 packed_model_other_thingie[NSIMULATION/8];
etc..., or more useful:
double packed_model_joint[NSIMULATION];
double packed_model_other_thingie[NSIMULATION];

The latter is preferred since the compiler optimizations may deal better with double's than with your double8's.

This all depends on how large NSIMULATION is.

Jim Dempsey

Emo_T_ · ‎12-01-2013

Hi Jim, NSIMULATION is in the range 100 to 1000. The distinction between scalar speed and vector velocity is common in 3D. However once you move to general N-dimensional configuration manifolds (where N could be anything including 1) people tend to call everything "velocity". In my timing tests, I am using a humanoid model with N = 30 (what I called NJOINT earlier). To make things more concrete I posted a code sample on my website: http://homes.cs.washington.edu/~todorov/files/sample.c It shows two large data structures (one with constant model parameters, the other with dynamic variables that are recomputed at each simulation step) as well as one of the *many* functions operating on these structures. So, given that all relevant quantities are N-dimensional vectors where N can be different for each (set of) quantities, your proposal would correspond to: struct PackedModel { doubleX vector1[NSIMULATION/X][N1]; doubleX vector2[NSIMULATION/X][N2]; ... } packed_model; The second part of your proposal corresponds to setting X=1. In contrast, my idea for OpenCL porting is to use struct PackedModel { doubleX vector1[N1]; doubleX vector2[N2]; ... } packed_model[NSIMULATION/X]; Once the code is written with X left as a configuration option, one can quickly test all values of X allowed by OpenCL and find the optimal setting on a given device/compiler. The more important design decision is where [NSIMULATION/X] appears. Your proposal is unfortunately a non-starter for me because it would mean replacing every piece of code of the form some_operation(quantity1, quantity2, ...) with a loop over simulations. This would require massive rewriting. In principle though, it would be interesting to know which approach is better. Of course it makes sense to operate on nearby data to maximize the benefits of cache. However it is not clear to me what is the best way to achieve this goal here. If this was a simple computation processing each element of each vector independently, your approach would be better. But the computation is very much coupled, meaning that different elements of different vectors need to be accessed together... I guess what I need is lots of compute cores with lots of cache-per-core... good thing we still have Moore's law :) Anyway, I already started porting the code to OpenCL. Will let you know if I get it to work.

compiler bug with #pragma pack ?