- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I've been working with Xeon Phi to get an optimal performance out of a simple, offloaded lookup -> return function.
Below is the line that's causing problem:
__m512i vec = _mm512_i32extgather_epi32 (v_index, p_lookup_table, _MM_UPCONV_EPI32_UINT16, sizeof(uint16_t), _MM_HINT_NT);
The table has been allocated with _mm_malloc with 4KB alignment for DMA and each entry is a 16-bit unsigned integer, hence the upconversion to 32bit int. (And 2 byte scale)
Apparently, however, it is not a valid upconversion argument to intrinsic.
I believe I have used the right header (immintrin.h), and right indices that would surely not generate segfaults (although I don't think that's what this is about).
Any thoughts or opinions will be greatly appreciated.
Also, I'm accessing those lookup tables from all possible threads. Is there any way I can replicate the tables to the number of memory controllers in total (8) and bind each copy to each controller to maximize read throughput? (Somewhat like how you run applications with numactl or code with <numa.h> included)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am checking w/Developers about this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The code above is absolutely correct (it is better to use _MM_SCALE_2 instead of sizeof(uint16_t) though, but it does not matter in this case because _MM_SCALE_2 = 2).
And this code should be compiled for KNC without errors (I have just verified that the compiler generates correct code for this line when compiling for KNC.)
However, I would like to note that the non-default up-conversion and NT-hint parameters to this intrinsic are available only for KNC - when using this intrinsic for non-KNC code, only default values are allowed, which are _MM_UPCONV_EPI32_NONE and _MM_HINT_NONE.
So I guess what happens is that this line was compiled for the host, which caused compilation error. Suggestion is to guard this code (which I believe is part of an offloaded code) with #ifdef __MIC__ or #ifdef __KNC__.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your comprehensive answer, Kevin.
It indeed was the missing #ifdef __MIC__ block.
I didn't realize it could be that because I thought it was sort of a given since it was preceded by #pragma offload target(mic:0).
But I kinda figured you still had to explicitly guard the block because offload may not trigger
in the absence of a MIC device, in which case the offload block will have to execute
from the host processor, which, at current point, does not support 512-bit vector intrinsics.
Also for those with the same problem that I suffered, you need to guard the code with the
#ifdef __MIC ~ #endif NOT right after the offload pragma.
There was an article explaining that somewhere in one of the Phi development resources in Intel.
I would attach the link but could not find it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That is correct. The offload code construct is compiled for both the host and coprocessor for the reason you indicate (absence of the coprocessor) and as noted in the list of restrictions on offloaded code (item #2) and writing target specific code, you may not use #ifdef __MIC__ inside the scope the #pragma offload.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page