Using SSE Intrinsics/movdqa in a linux driver?

pierre_n_1 · ‎08-18-2016

Hi,

I'm trying to use SSE intrinsics in the linux Kernel following a previous post in this forum: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/543853

I've included x86intrin.h as described above, and called kernel_fpu_begin before calling my intrinsics. However, I get a General Protection Fault(0) when I try to run the instruction movdqa.

Basically, what my C code is doing is:

const u8 *someFunction(...) {
   const __m128i var = _mm_setzero_si128();
   const __m128i var2 = _mm_set1_epi8(0xf);
.....
   __m128i var3 =  _mm_loadu_si128(some_pointer);
....
}

And the corresponding faulty ASM instructions given are:

All code
========
   0:   00 48 c7                add    %cl,-0x39(%rax)
   3:   c1                      (bad)
   4:   f0 fe                   lock (bad)
   6:   f4                      hlt
   7:   81 ba 11 06 00 00 eb    cmpl   $0x2e66a6eb,0x611(%rdx)
   e:   a6 66 2e
  11:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  18:   00
  19:   0f 1f 00                nopl   (%rax)
  1c:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  21:   55                      push   %rbp
  22:   48 8d 2c 24             lea    (%rsp),%rbp
  26:   48 8d 64 24 e0          lea    -0x20(%rsp),%rsp
  2b:*  66 0f 7f 45 f0          movdqa %xmm0,-0x10(%rbp)                <-- trapping instruction
  30:   48 85 ff                test   %rdi,%rdi
  33:   66 0f 7f 4d e0          movdqa %xmm1,-0x20(%rbp)
  38:   0f 84 f0 01 00 00       je     0x22e
  3e:   48                      rex.W
  3f:   85                      .byte 0x85

Code starting with the faulting instruction
===========================================
   0:   66 0f 7f 45 f0          movdqa %xmm0,-0x10(%rbp)
   5:   48 85 ff                test   %rdi,%rdi
   8:   66 0f 7f 4d e0          movdqa %xmm1,-0x20(%rbp)
   d:   0f 84 f0 01 00 00       je     0x203
  13:   48                      rex.W
  14:   85                      .byte 0x85

It seems that the data I give to movdqa is not aligned but I don't really know how to check that?

According to the panic report, it happens right before i call _mm_setzero_si128. To make my code work, I had to add -mpreferred-stack-boundary=4 for compiling the unit containing the SSE instructions. I tried to use mstackrealign in case it was my stack who was not aligned but with no effect. So basically my compiling command line is:

gcc ... (default kernel for Atom CPU) -fno-strict-aliasing -fno-common -mpreferred-stack-boundary=3 -march=atom -mtune=atom -m64 -mno-red-zone -mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args -fstack-protector -fno-omit-frame-pointer -fno-optimize-sibling-calls (added compiling arguments) -mpreferred-stack-boundary=4 -mstackrealign

Does anyone would have a similar problem or an idea to debug further? At least how to know how to check if the address given to movdqa are aligned or not..

Thanks!

Richard_Nutman · ‎08-22-2016

Hi,

Are you sure you're using _mm_loadu_si128 ? As that should compile to movdqu, which shouldn't cause a fault on unaligned addresses.

Is it possible the compiler somehow thinks the address is aligned so it's using movdqa when it really isn't ?

If you're loading values off the stack you want 16 byte alignment for SSE ideally, not 4.

You can test for alignment as follows;

uintptr_t align = (uintptr_t)some_ptr;
align &= (16-1);        // where 16 byte alignment is required.
if(align)
   log("address is not aligned to 16 bytes");

jimdempseyatthecove · ‎08-23-2016

All code
========
... 
  1c:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)   Function starts on next instruction
  21:   55                      push   %rbp               save (push) outer scope base pointer
  22:   48 8d 2c 24             lea    (%rsp),%rbp        set base pointer to new stack frame
  26:   48 8d 64 24 e0          lea    -0x20(%rsp),%rsp   reserve 32 bytes on stack for local variables
                                                          *** note stack is not aligned here
              trapping instruction
  2b:*  66 0f 7f 45 f0          movdqa %xmm0,-0x10(%rbp)  aligned move of 16 bytes starting at 16 byte
                                                          below unaligned base pointer
  30:   48 85 ff                test   %rdi,%rdi
  33:   66 0f 7f 4d e0          movdqa %xmm1,-0x20(%rbp)  *** this will trap as well
...

The problem is, you did not declare your local variables which are to be used for aligned AVX use as being aligned. If this is an assembler routine you constructed then the alignment is your responsibility. C++ (most) have a means to declare aligned data structures. The _mm256 types should be aligned, you may be casting an array to _mm256 which will not align the data location.

Self alignment may look like this:

All code
========
... 
  1c:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)   Function starts on next instruction
  21:   55                      push   %rbp               save (push) outer scope base pointer
                                                          %rbp is now a free register
                                                          *** do not perform the following in the stack register
  22:   48 8d 2c 24             lea    -0x1F(%rsp),%rbp   set base pointer to stack pointer -31
                                and    -0x20,%rbp         base pointer points to desire stack (32 byte aligned) 
                                                          note (%rbp) change to next instruction     
  26:   48 8d 64 24 e0          lea    -0x20(%rbp),%rsp   reserve 32 bytes on stack for local variables
...

Note, do not attempt to perform the alignment directly in the stack pointer. Should an external interrupt interrupt your thread before you complete the alignment, your thread will crash.

The above is a sketch. I suggest you write a sample function with two aligned _mm256 variables as the first two arguments. Build the function, then see how the compiler generates the code for stack frame alignment.

Jim Dempsey