Hi John,

Hao_L_ · ‎08-29-2015

MIC requires strict 64Byte data alignment to utilize vpu, but why? I found Sparc also have such an requirement. But other multi-core CPU can handle unaligned data.

As MIC can automatically vectorize a for loop of data(with compiler optimization), what if the data is unaligned in this case? will the auto optimization still work? if yes, how?

MODIFIED CONTENT:

I would like to clarify my problem here.

I am trying to implement sort merge join on MIC by using simd. But when I implement the merge phase, I find some of sorted data are unaligned, so I am confused if there is any way to merge unaligned data by using simd instructions.

I wrote a bitonic sort which has reported performs good performance by using simd on CPU, this is why I try to use intrinsic to implement merge in a consistent way. But so far the sort phase only report 1.3x faster than scalar sort, my current code does not use prefetch and huge page, but this performance still too slow compared with theoretical performance. I may try to do some optimization and give a report later in the forum.

TimP · ‎08-29-2015

This is too broad a topic for a full answer here. The difference between host and future Mic should narrow with adoption of avx512. You can find published examples where host requires alignment for vectorization.

jimdempseyatthecove · ‎08-30-2015

>>As MIC can automatically vectorize a for loop of data(with compiler optimization), what if the data is unaligned in this case? will the auto optimization still work? if yes, how?

In general terms, after the compiler has determined a given loop is a candidate for vectorization (payback to vectorize a loop with arrays of unknown alignment), it inserts loop preamble code that test for array alignment, and if (when) alignment not met, the preamble code further analyses the array alignment to see if they can be made aligned by executing a few iterations using scalar instructions. If it cannot, it performs the entire loop using scalar instructions. However, if alignment can be attained, then the first few iterations are performed using scalar instructions until alignment is made, then the remainder (or most of the remainder of the loop that can be) is performed with vector instructions, then if anything left over, it is done using scalar instructions.

The code to work around the misaligned data is NOT insignificant. Therefore it behooves you to align your data. The little effort you put into your program to ensure alignment is well worth it.

For CPU designs that permit use of unaligned vectors, the unaligned load and stores take longer to perform. Therefore, even on these architectures, it may be beneficial for the compiler to perform the work described in the prior paragraph. Therefore it behooves you to align your data.

Jim Dempsey

McCalpinJohn · ‎08-30-2015

Xeon Phi does not require strict alignment to use the VPU!

It requires strict alignment to use memory operands with arithmetic instructions -- just like SSE requires.

Xeon Phi uses a pair of instructions (VLOADUNPACKLO + VLOADUNPACKHI) to load unaligned data into a register, while early SSE implementations could use a single instruction (typically MOVUPS). BUT, the MOVUPS instruction ran at 1/2 the speed of the aligned MOVAPS instruction on Pentium II, Pentium III, Pentium M, and Core 2 (both Merom and Wolfdale). It was only with Nehalem/Westmere that the unaligned loads became as fast as the aligned loads in the common cases, and only with Sandy Bridge/Ivy Bridge that the unaligned loads became as fast as the aligned loads in all cases (except page crossings, if I recall correctly).

The ability to use arbitrarily aligned memory operands with AVX instructions is very nice, but I shudder to think at how many transistors and how many person-years of engineering went into making that possible. The core used in Xeon Phi needed to be cheaper, smaller, and use less power than any recent Intel processor implementation, so compromises had to be made.

TimP · ‎08-30-2015

As John points out, you are using vpu even if not vectorizing successfully so I didn't try to answer that part literally. Alignment adjustments may be taken care of by the compiler, losing some advantage of aligned vectorization.

Hao_L_ · ‎08-31-2015

Hi John,

Thank you for your reply! Yes I want to ask why MIC do not support SSE instructions load unaligned data. I have not found any instructions to load unaligned data, should I write a function manually? if yes, how? For example, I have an array of 64 bit integers, say array[10000], the array is aligned in 64 Bytes. Is that possible if I want to load the array from array[2]? Since first 8 elements are not in one cache line, avx _mm_load_epi64 seems can not apply, and in practice it induces segment fault.

John D. McCalpin wrote:

Xeon Phi does not require strict alignment to use the VPU!

It requires strict alignment to use memory operands with arithmetic instructions -- just like SSE requires.

Xeon Phi uses a pair of instructions (VLOADUNPACKLO + VLOADUNPACKHI) to load unaligned data into a register, while early SSE implementations could use a single instruction (typically MOVUPS). BUT, the MOVUPS instruction ran at 1/2 the speed of the aligned MOVAPS instruction on Pentium II, Pentium III, Pentium M, and Core 2 (both Merom and Wolfdale). It was only with Nehalem/Westmere that the unaligned loads became as fast as the aligned loads in the common cases, and only with Sandy Bridge/Ivy Bridge that the unaligned loads became as fast as the aligned loads in all cases (except page crossings, if I recall correctly).

The ability to use arbitrarily aligned memory operands with AVX instructions is very nice, but I shudder to think at how many transistors and how many person-years of engineering went into making that possible. The core used in Xeon Phi needed to be cheaper, smaller, and use less power than any recent Intel processor implementation, so compromises had to be made.

jimdempseyatthecove · ‎08-31-2015

Have you considered using: __m512i _mm512_i32gather_epi32 (__m512i vindex, void const* base_addr, int scale) to collect your 64-bit integers (crossing cache lines)?

Jim Dempsey

McCalpinJohn · ‎08-31-2015

Xeon Phi does not support any SSE instructions -- it has its own vector instruction set.

The Xeon Phi ISA reference manual describes the VLOADUNPACK* instructions that are used to load unaligned data. The description of the instructions is more than a little confusing, so the best way to understand them is to see how the compiler generates the instructions and then compare that to the discussion in the ISA reference manual. Any code that vectorizes but does not have perfect alignment will use these instructions, so it should be easy to generate a simple test and review the assembly listing.

The ISA reference manual refers to intrinsics to generate these instructions and the compiler documentation lists them as well. You just need to search the compiler documentation (e.g., https://software.intel.com/en-us/compiler_15.0_ug_c) for "unpack" and then look for the intrinsics that apply to Xeon Phi.

Hao_L_ · ‎09-01-2015

Oh yes, that remind me. I have asked question about gather, John said gather instruction uses a for loop to load data into register. But at that time all my data are in different cache lines, use gather is no good than a scalar for loop. In this case, if eight element are in two cache lines, seems I can get benefit from gather instruction, right?

jimdempseyatthecove wrote:

Have you considered using: __m512i _mm512_i32gather_epi32 (__m512i vindex, void const* base_addr, int scale) to collect your 64-bit integers (crossing cache lines)?

Jim Dempsey

Hao_L_ · ‎09-01-2015

Thank you John, the description of loadunpack* is really confusing. I have read the doc and I will try some sample code. As Jim mentioned, if I only load data from two cache lines once, it seems gather instruction is also apply. In this case we only need load two cache lines, will a for loop degrade the load performance?

John D. McCalpin wrote:

Xeon Phi does not support any SSE instructions -- it has its own vector instruction set.

The Xeon Phi ISA reference manual describes the VLOADUNPACK* instructions that are used to load unaligned data. The description of the instructions is more than a little confusing, so the best way to understand them is to see how the compiler generates the instructions and then compare that to the discussion in the ISA reference manual. Any code that vectorizes but does not have perfect alignment will use these instructions, so it should be easy to generate a simple test and review the assembly listing.

The ISA reference manual refers to intrinsics to generate these instructions and the compiler documentation lists them as well. You just need to search the compiler documentation (e.g., https://software.intel.com/en-us/compiler_15.0_ug_c) for "unpack" and then look for the intrinsics that apply to Xeon Phi.

Hao_L_ · ‎09-01-2015

I tested gather instruction, it seems even slower than scalar version... It seems this does not take any benefit, the for loop still is a bottleneck.

Hao L. wrote:

Oh yes, that remind me. I have asked question about gather, John said gather instruction uses a for loop to load data into register. But at that time all my data are in different cache lines, use gather is no good than a scalar for loop. In this case, if eight element are in two cache lines, seems I can get benefit from gather instruction, right?

Quote:

jimdempseyatthecove wrote:

Have you considered using: __m512i _mm512_i32gather_epi32 (__m512i vindex, void const* base_addr, int scale) to collect your 64-bit integers (crossing cache lines)?

Jim Dempsey

TimP · ‎09-01-2015

When you started this thread, it wasn't clear that you were referring to the difficulty of using intrinsics on unaligned data. When compiling c code, the compiler takes care of using different instructions for aligned and unaligned array sections. It's still not clear why you prefer intrinsic. No doubt, you would need to reconsider your optimization when the architecture changes.

Hao_L_ · ‎09-01-2015

I modified this thread, hope it is more clear now.

I have ever wrote some code samples by using openmp and throw optimization task to compiler by using vector pragma, but the performance is not very well and this leads great trouble to me to know what happened in each step. I referred some implementation on CPU and some of them use sse or avx instructions, so I begin to take optimization in each step manually. So far I found KNC support less instructions, and no document mentioned icc on MIC provides better auto optimizations than on general cpu, so I think tools and instructions on MIC is still limited. This is the main reason I decide to opt each step by myself using intrinsic.

Intrinsic is annoy and lead to extra heavy workload, but I do not know if there are any elegant way to do optimizations based on my own demand?

Thank you!

Tim Prince wrote:

When you started this thread, it wasn't clear that you were referring to the difficulty of using intrinsics on unaligned data. When compiling c code, the compiler takes care of using different instructions for aligned and unaligned array sections. It's still not clear why you prefer intrinsic. No doubt, you would need to reconsider your optimization when the architecture changes.

Efficiently Use KNC Instructions on Unaligned Data