- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm trying to use SSE intrinsics in the linux Kernel following a previous post in this forum: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/543853
I've included x86intrin.h as described above, and called kernel_fpu_begin before calling my intrinsics. However, I get a General Protection Fault(0) when I try to run the instruction movdqa.
Basically, what my C code is doing is:
const u8 *someFunction(...) { const __m128i var = _mm_setzero_si128(); const __m128i var2 = _mm_set1_epi8(0xf); ..... __m128i var3 = _mm_loadu_si128(some_pointer); .... }
And the corresponding faulty ASM instructions given are:
All code ======== 0: 00 48 c7 add %cl,-0x39(%rax) 3: c1 (bad) 4: f0 fe lock (bad) 6: f4 hlt 7: 81 ba 11 06 00 00 eb cmpl $0x2e66a6eb,0x611(%rdx) e: a6 66 2e 11: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 18: 00 19: 0f 1f 00 nopl (%rax) 1c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 21: 55 push %rbp 22: 48 8d 2c 24 lea (%rsp),%rbp 26: 48 8d 64 24 e0 lea -0x20(%rsp),%rsp 2b:* 66 0f 7f 45 f0 movdqa %xmm0,-0x10(%rbp) <-- trapping instruction 30: 48 85 ff test %rdi,%rdi 33: 66 0f 7f 4d e0 movdqa %xmm1,-0x20(%rbp) 38: 0f 84 f0 01 00 00 je 0x22e 3e: 48 rex.W 3f: 85 .byte 0x85 Code starting with the faulting instruction =========================================== 0: 66 0f 7f 45 f0 movdqa %xmm0,-0x10(%rbp) 5: 48 85 ff test %rdi,%rdi 8: 66 0f 7f 4d e0 movdqa %xmm1,-0x20(%rbp) d: 0f 84 f0 01 00 00 je 0x203 13: 48 rex.W 14: 85 .byte 0x85
It seems that the data I give to movdqa is not aligned but I don't really know how to check that?
According to the panic report, it happens right before i call _mm_setzero_si128. To make my code work, I had to add -mpreferred-stack-boundary=4 for compiling the unit containing the SSE instructions. I tried to use mstackrealign in case it was my stack who was not aligned but with no effect. So basically my compiling command line is:
gcc ... (default kernel for Atom CPU) -fno-strict-aliasing -fno-common -mpreferred-stack-boundary=3 -march=atom -mtune=atom -m64 -mno-red-zone -mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args -fstack-protector -fno-omit-frame-pointer -fno-optimize-sibling-calls (added compiling arguments) -mpreferred-stack-boundary=4 -mstackrealign
Does anyone would have a similar problem or an idea to debug further? At least how to know how to check if the address given to movdqa are aligned or not..
Thanks!
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Are you sure you're using _mm_loadu_si128 ? As that should compile to movdqu, which shouldn't cause a fault on unaligned addresses.
Is it possible the compiler somehow thinks the address is aligned so it's using movdqa when it really isn't ?
If you're loading values off the stack you want 16 byte alignment for SSE ideally, not 4.
You can test for alignment as follows;
uintptr_t align = (uintptr_t)some_ptr; align &= (16-1); // where 16 byte alignment is required. if(align) log("address is not aligned to 16 bytes");
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
All code ======== ... 1c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) Function starts on next instruction 21: 55 push %rbp save (push) outer scope base pointer 22: 48 8d 2c 24 lea (%rsp),%rbp set base pointer to new stack frame 26: 48 8d 64 24 e0 lea -0x20(%rsp),%rsp reserve 32 bytes on stack for local variables *** note stack is not aligned here trapping instruction 2b:* 66 0f 7f 45 f0 movdqa %xmm0,-0x10(%rbp) aligned move of 16 bytes starting at 16 byte below unaligned base pointer 30: 48 85 ff test %rdi,%rdi 33: 66 0f 7f 4d e0 movdqa %xmm1,-0x20(%rbp) *** this will trap as well ...
The problem is, you did not declare your local variables which are to be used for aligned AVX use as being aligned. If this is an assembler routine you constructed then the alignment is your responsibility. C++ (most) have a means to declare aligned data structures. The _mm256 types should be aligned, you may be casting an array to _mm256 which will not align the data location.
Self alignment may look like this:
All code ======== ... 1c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) Function starts on next instruction 21: 55 push %rbp save (push) outer scope base pointer %rbp is now a free register *** do not perform the following in the stack register 22: 48 8d 2c 24 lea -0x1F(%rsp),%rbp set base pointer to stack pointer -31 and -0x20,%rbp base pointer points to desire stack (32 byte aligned) note (%rbp) change to next instruction 26: 48 8d 64 24 e0 lea -0x20(%rbp),%rsp reserve 32 bytes on stack for local variables ...
Note, do not attempt to perform the alignment directly in the stack pointer. Should an external interrupt interrupt your thread before you complete the alignment, your thread will crash.
The above is a sketch. I suggest you write a sample function with two aligned _mm256 variables as the first two arguments. Build the function, then see how the compiler generates the code for stack frame alignment.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page