Thank you for your

Jun_Hyun_S_ · ‎09-18-2014

Hi, I've been working with Xeon Phi to get an optimal performance out of a simple, offloaded lookup -> return function.
Below is the line that's causing problem:

__m512i vec = _mm512_i32extgather_epi32 (v_index, p_lookup_table, _MM_UPCONV_EPI32_UINT16, sizeof(uint16_t), _MM_HINT_NT);

The table has been allocated with _mm_malloc with 4KB alignment for DMA and each entry is a 16-bit unsigned integer, hence the upconversion to 32bit int. (And 2 byte scale)

Apparently, however, it is not a valid upconversion argument to intrinsic.
I believe I have used the right header (immintrin.h), and right indices that would surely not generate segfaults (although I don't think that's what this is about).

Any thoughts or opinions will be greatly appreciated.

Also, I'm accessing those lookup tables from all possible threads. Is there any way I can replicate the tables to the number of memory controllers in total (8) and bind each copy to each controller to maximize read throughput? (Somewhat like how you run applications with numactl or code with <numa.h> included)

Kevin_D_Intel · ‎09-22-2014

I am checking w/Developers about this.

Kevin_D_Intel · ‎09-23-2014

Development did not have insight into the question regarding the table replication. Regarding the compilation error, they believe the code in question participated in the host compilation in some manner and that the suggested resolution involves guarding the intrinsic call with #ifdef __MIC__ or __KNC__. They wrote:

The code above is absolutely correct (it is better to use _MM_SCALE_2 instead of sizeof(uint16_t) though, but it does not matter in this case because _MM_SCALE_2 = 2).

And this code should be compiled for KNC without errors (I have just verified that the compiler generates correct code for this line when compiling for KNC.)

However, I would like to note that the non-default up-conversion and NT-hint parameters to this intrinsic are available only for KNC - when using this intrinsic for non-KNC code, only default values are allowed, which are _MM_UPCONV_EPI32_NONE and _MM_HINT_NONE.

So I guess what happens is that this line was compiled for the host, which caused compilation error. Suggestion is to guard this code (which I believe is part of an offloaded code) with #ifdef __MIC__ or #ifdef __KNC__.

Jun_Hyun_S_ · ‎09-28-2014

Thank you for your comprehensive answer, Kevin.

It indeed was the missing #ifdef __MIC__ block.
I didn't realize it could be that because I thought it was sort of a given since it was preceded by #pragma offload target(mic:0).
But I kinda figured you still had to explicitly guard the block because offload may not trigger
in the absence of a MIC device, in which case the offload block will have to execute
from the host processor, which, at current point, does not support 512-bit vector intrinsics.

Also for those with the same problem that I suffered, you need to guard the code with the
#ifdef __MIC ~ #endif NOT right after the offload pragma.
There was an article explaining that somewhere in one of the Phi development resources in Intel.
I would attach the link but could not find it.

Kevin_D_Intel · ‎09-29-2014

That is correct. The offload code construct is compiled for both the host and coprocessor for the reason you indicate (absence of the coprocessor) and as noted in the list of restrictions on offloaded code (item #2) and writing target specific code, you may not use #ifdef __MIC__ inside the scope the #pragma offload.

calling _mm512_i32extgather_epi32 emits invalid upconv argument error