I have already answered this

Vincent_L_ · ‎12-11-2014

Hi, I recently met a weird problem.

In the offload area like below

double *values=(double *)_mm_malloc(100*sizeof(double),64);

#pragma offload target(mic:0) in(a:length(100))
{
...
int idx=3;
_mm512_mask_load_epi64(.. , writemask , &(values[idx]));
}

It compiles well and runs with error" process on the device 0 was terminated by signal 11 (SIGSEGV)". And I found that when idx is multiple of 8 there is no such error. Intel doc said that the 3th argument of _mm512_mask_load_epi64 function should be a 64-byte-aligned address. But I have already make values aligned in the _mm_malloc function. It should be 64-byte-aligned no matter what idx is.

James_C_Intel2 · ‎12-12-2014

I have already answered this once in your other thread, but let's try again...

You point out that the documentation states that the third argument to the _mm512_mask_load_epi64 must be 64byte aligned, then show us you code which looks like this

_mm512_mask_load_epi64(.. , writemask , &(values[idx]));

Let me ask you some questions:

What is the third argument to the intrinsic? Is it "values", or "&(values[idx])" ?
If "values" is 64 byte aligned, what offset from 64 byte alignment does the third argument have when idx==3?
Is that offset zero?

Vincent_L_ · ‎12-12-2014

Thanks a lot. I have figured out what I misunderstood (I used to mistake bit for a bype：）). And you information is very helpful.

But now I am facing a new question: if I want to load the elememts of from 3th to 10th in the values array to a __m512d vector using the _mm512_mask_load_pd function. How can I do to avoid the aligned problem?

James_C_Intel2 · ‎12-16-2014

But now I am facing a new question: if I want to load the elememts of from 3th to 10th in the values array to a __m512d vector using the _mm512_mask_load_pd function. How can I do to avoid the aligned problem?

I think you are asking the wrong question, because that question has no answer (assuming that you really mean the more general issue of loading any arbitrary offset chunk, not just the case whre idx==3, which you could solve by suitably mis-aligning the array).

I think the question you're trying to ask is "How can I efficiently load masked mis-aligned 64 bit integer values into a vector register?" (Which doesn't assert beforehand an impossible condition of using an instruction that can't do the job!)

Unfortunately I'm not a vector instruction set expert, but if you ask that question you're more likely to get a useful answer!

p.s. Looking at the code the compiler generates for this operation is probably a good way to start to answer it, since the compiler embodies a lot of knowledge and expertise.