Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7939 Discussions

How to specify __mm512 to coreside with __mm256 or _mm128

jimdempseyatthecove
Honored Contributor III
1,369 Views

In working with the intrinsics guide, and as for use with AVX512 one of the intrinsics I am wanting to use is

__m512d _mm512_broadcastsd_pd(__m128d a)

However, I'd like to avoid using a mov to move data from register to register, I'd rather use a cast.

The __m128 registers are co-resident with the low 8 __m512 registers. So, what I am asking for is

__m512d foo compilerDirective(use a register in range of 0:7);

Or one to specifically specify the zmm register to use (in range of 0:7). (I'd rather not regress to assembler)

There would be a similar issue with specifying zmm to overlay with ymm.

Jim Dempsey

0 Kudos
8 Replies
Melanie_B_Intel
Employee
1,369 Views

In the Intel intrinsic guide i see this, would it fit your needs? There's another intrinsic which zero fills instead of leaving upper bits undefined. I'm not a codegen expert, but i'd predict that the compiler would optimize away a register to register move, if one was generated in the intermediates stages of compilation.

__m512d _mm512_castpd128_pd512 (__m128d a)
#include "immintrin.h"
CPUID Flags: AVX512F

Description

Cast vector of type __m128d to type __m512d; the upper 384 bits of the result are undefined. This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,369 Views

Melanie,

Thanks for replying, however, I do not think you quite understand the nuance of the issue.

When declaring a SIMD variable

__mm512d X;

The compiler is at liberty to assign the variable to any one of the 32 zmm registers.

__mm128d Y;

The compiler has a restriction to assign the variable to any one of the 8 xmm registers
*** which share the low 128 bits of the low 8 zmm registers

__mm256d Z;

The compiler has a restriction to assign the variable to any one of the 16 ymm registers
*** which share the low 256 bits of the low 16 zmm registers

All of this is quite clear to understand.

Now then, look at the intrinsic:

__m512d _mm512_broadcastsd_pd(__m128d a)

As documented in the intrinsic guide, and in the 64-ia-32-archetecture manual, the argument must be an xmm register. This restriction may be due to a limited number of bits available in the instruction (not explained).

What I want to do, is use a __m512d register that is forced (coerced) into residing in zmm0 through zmm7, and thus have the low 128 bits appear in the respective xmm register (xmm0 through xmm7).

By having a directive to instruct the compiler to choose a zmm register overlaying with xmm, as well as ymm, and choosing a ymm register that overlays the xmm registers, then I can avoid burning a register and requiring an instruction to move the value between the registers.

All of the above is moot should the instruction decode of __m512d _mm512_broadcastsd_pd(__m128d a) accept ymm8:ymm15 as well as zmm8:zmm31. This is not documented as being supported.

I hope this explains the situation.

I think that by chance, by specifying the __m512 variables first, the compiler might have a higher probability of assigning from the lowest 8 zmm  registers, but there is no assurance that it will.

Jim Dempsey

0 Kudos
Melanie_B_Intel
Employee
1,369 Views

If you see a problem with inefficient code generation, I think the best way to get traction on the issue is to provide a test case demonstrating the problem, it can be reported and corrected within the compiler without adding a directive.  The compiler performs multiple passes over the program so it should be able to determine an optimal register assignment.  Meanwhile, I discussed the issue with a codegen expert and he provided this info, hope it helps. 

I’m not sure I understood Jim’s concerns correctly.

 

Jim wrote:

Ø  __mm128d Y;

Ø  The compiler has a restriction to assign the variable to any one of the 8 xmm registers
*** which share the low 128 bits of the low 8 zmm registers

This may not be true if I get it right.

First of all, xmm8 – xmm31 also share the low 128 bits of corresponding zmm8 – zmm31 registers.

Secondly, the compiler is free to allocate 128-bit variable on xmm8 – xmm31 as soon as the compiler knows that those registers are available.

 

The compiler may know that xmm8 – xmm31 registers are available by a target knob (-xcommon-avx512 for example), or even w/o target arch knob the compiler can make an assumption that since avx512 intrinsics is used the compiler is free to utilize avx512 specific features (upper registers for example).

 

These detailed explanation goes to your general assertion:

  • the compiler would optimize away a register to register move, if one was generated in the intermediates stages of compilation.

which is true.

 

I’m not whether or not the example below demonstrates the feature Jim is looking for, but maybe it can be a testing bed for him:

 

#include <immintrin.h>

 

__m512d foo(__m128d a) {

    return _mm512_broadcastsd_pd(a);

}

 

 

Results:

vbroadcastsd %xmm0, %zmm0

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,369 Views

To answer the question anther way, is the following valid?

     vbroadcast %xmm31, %zmm0

It is unclear from the architecture manual as to if the bitfield in the vbroadcast that is used to select xmm register permits xmm register numbers higher than 7. IF the instruction decode bitfield for the xmm register is restricted to 3 bits, then I have a legitimate need to request a means for specifying that a selected __m512d variable is selected within the range of 0:7.

Note, if the vbroadcast does permit the use of all 32 registers, then the architecture manual could just as well have stated xmm/ymm/zmm were all valid for specifying the scalar to be used in the broadcast.

My intended use is to load 8 doubles into a __m512d variable (compatible for use with cast to __m128d) and then into the broadcast intrinsic to distribute the low double. Then sometime shortly thereafter use a mm512 pack instruction to shift the upper 7 doubles lower one cell such that the now lowest double can be used in the next broadcast, then the next, ...

IF the vbroadcast is limited to xmm0:xmm7, then I cannot do this should the compiler select a zmm register above zmm7.

Jim Dempsey

0 Kudos
Anoop_M_Intel
Employee
1,369 Views

Hi Jim,

>>To answer the question anther way, is the following valid?
>>vbroadcast %xmm31, %zmm0

The answer is yes. Below is a simple example demonstrating the same:

#include <immintrin.h>

extern "C" {
__m512d foo(__m128d a) {
    return _mm512_broadcastsd_pd(a);
}
}

#pragma linkage myconv (parameters(xmm12), result(zmm12))
#pragma use_linkage myconv(foo)

The above code generates the following instruction:

vbroadcastsd %xmm12, %zmm12

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,369 Views

#pragma linkage is not documented? (compiler 17.0 update 1)
#pragma use_linkage is not documented? (compiler 17.0 update 1)

Can you post a link to the documentation? Barring that, can you post the description for these (and related) pragmas?

Jim Dempsey

0 Kudos
SergeyKostrov
Valued Contributor II
1,369 Views
>>... >>#pragma linkage myconv (parameters(xmm12), result(zmm12)) >>#pragma use_linkage myconv(foo) >>... It looks like Watcom compiler pragma-based function declaration. ( Nice to see that! )
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,369 Views

Anoop,

How would you use linkage inside a large function

void LargeFunction()
{
   ... statements
   for(int i=o; ;<N; ++;)
   {
       __mm512d x; // place x in traditional xmm register (0:7)
      ...

 

0 Kudos
Reply