- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You're forcing a mis-aligned access which the hardware can't accommodate. It would have to generate code to copy the double as a byte string to aligned storage.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"I am trying to port a large code base which uses struct packing to speed up serialization, and it would be nice to be able to just run the code (as promised:) and worry about optimization later."
This frightens me somewhat. If the major optimisation in this code is struct packing to speed up serialization, it seems very unlikely to be a good candidate for execution on Xeon Phi, since it has clearly been optimised to improve I/O performance, while the Xeon Phi works best for codes that are highly CPU-bound.
Of course, we all remember Ken Batcher's definition of a supercomputer : "A machine for turning a compute-bound problem into an I/O-bound one", but if you're starting with an I/O-bound problem you may achieve more by spending your money on a solid-state disk, rather than a Xeon Phi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Cache capacity is a frequent reason for MIC performance peaking at 2N-2 threads (or even N-1) on N cores. You would need attention to affinity so as to spread the threads evenly across cores when not using 4 threads per core.
Very low CPI on Xeon, such as you quote, raises the suspicion that you don't execute many simd parallel instructions, or that you execute a lot of OpenMP spin waits. I'd be a little surprised that mis-alignment didn't increase CPI. I don't know of anyone attempting to run vectorized code on host CPU with mis-aligned data. As you say, there's a difference between unaligned (data aligned according to data type but not according to simd width) and mis-aligned (not aligned according to data type). If the compiler chose not to vectorize on account of visible mis-alignment, you would see it in the vec-report.
I can't guess what you mean by MKL slowing down your application. There are many ways to use MKL. The current MIC MKL ?gemm isn't optimized for minimum dimensions less than 32, and even at that size it's difficult to match host performance. ifort MATMUL should do a better job at -O3 than MKL for cases which aren't large enough to benefit from invoking additional threads inside matrix multiplication. Note that ifort -O3 on host implies -opt-matmul (which you must turn off to avoid using MKL) but currently there is no -opt-matmul for MIC. But I'm wasting words since you didn't say if you are using matrix multiplication.
For optimization of vectorization of short aligned data, where you want the compiler to be permitted to access (but discard) data beyond the end of the loop, the compiler offers -opt-assume-safe-padding. It's important to use aligned data if you want to see full advantage of MIC with loop lengths less than 2000 or so.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, I figured out how to use VTune. Here is a comparison of results, same code as described above, 224 OpenMP threads running in parallel, no synchronization among them. Not sure if it explains why the Xeon Phi is slower than the i7-3930K... It is notable that the i7 ends up executing much fewer instructions at a much better CPI... is this normal? Is there a way to get VTune to estimate vectorization for the CPU?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Without seeing the code it is hard to offer suggestions about what is going on and how to improve performance. From the rough description of your problem, it sounds like you may be able to benefit from lifting the vectorization to higher loop level. Roughly meaning assume you have N joints each with 50 dimensions. Lifting here means taking vectorization out of within a joint (e.g. {x,y,z}={x,y,z} + {dx,dy,dz}*dt) such that you populate the vector with one dimension from a collection of joints.
x[0:vw] = x[0:vw] + dx[0:vw]*dt;
y[0:vw] = y[0:vw] + dy[0:vw]*dt;
z[0:vw] = z[0:vw] + dz[0:vw]*dt;
Where vw is the vector width and the above is in C++ CEAN notation.
Assuming you have more than two joints, lifting the vectorization may yield faster code. This is the classic AOS verses SOA. For Xeon Phi, when applicable SOA may be best.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unless you mean speed, velocity typically has an X, Y and Z component. However, if this is a hinged system with one degree of freedom, then velocity could have one dimension. Similar thing with the other components.
As TimP pointed out, Xeon Phi (and other CPU's as well), will exhibit best performance with vectors .AND. when the vectors are adjacent. Thus you would want to unpack your current packed_model[NSIMULATION/8]; into:
double8 packed_model_joint[NSIMULATION/8];
double8 packed_model_other_thingie[NSIMULATION/8];
etc..., or more useful:
double packed_model_joint[NSIMULATION];
double packed_model_other_thingie[NSIMULATION];
The latter is preferred since the compiler optimizations may deal better with double's than with your double8's.
This all depends on how large NSIMULATION is.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page