My setup: Linux 3.13 kernel, gcc 4.8.2. Ubuntu on core i7
How to compile avx intrinsics in linux device driver? Any exact gcc compiler flags (makefile) and what header files to include in c source?
In order to use x86 intrinsics, you should include <x86intrin.h>
However, it uses mm_malloc.h which, in turn uses stdlib.h
Also, Linux explicitly disables all SIMD extensions in order to force
compiler not to use [XYZ]MM registers and you should use them really
So you need two things:
Here is an example which compiles for me for arbitrary driver (Makefile wasn't changed):
$ diff -p linux-3.15.Os/drivers/parport/ieee1284.c linux-3.15.O2/drivers/parport/ieee1284.c *** linux-3.15.Os/drivers/parport/ieee1284.c 2014-06-20 14:26:09.581496982 +0400 --- linux-3.15.O2/drivers/parport/ieee1284.c 2015-03-26 18:39:44.310801033 +0400 *************** *** 25,30 **** --- 25,39 ---- #include <linux/timer.h> #include <linux/sched.h> + #include <asm/i387.h> + + #pragma GCC push_options + #pragma GCC target ("mmx", "avx") + #define _MM_MALLOC_H_INCLUDED + # include <x86intrin.h> + #undef _MM_MALLOC_H_INCLUDED + #pragma GCC pop_options + #undef DEBUG /* undef me for production */ #ifdef CONFIG_LP_CONSOLE *************** int parport_poll_peripheral(struct parpo *** 166,171 **** --- 175,181 ---- * 10ms, waking up if an interrupt occurs. */ + #pragma GCC target ("mmx", "avx") int parport_wait_peripheral(struct parport *port, unsigned char mask, unsigned char result) *************** int parport_wait_peripheral(struct parpo *** 175,180 **** --- 185,194 ---- unsigned long deadline; unsigned char status; + kernel_fpu_begin (); + _mm256_zeroupper (); + kernel_fpu_end (); + usec = port->physport->spintime; /* usecs of fast polling */ if (!port->physport->cad->timeout) /* A zero timeout is "special": busy wait for the
And generates reasonable code:
0000000000000290 <parport_wait_peripheral>: 290: e8 00 00 00 00 callq 295 <parport_wait_peripheral+0x5> 295: 55 push %rbp 296: 48 89 e5 mov %rsp,%rbp 299: 41 57 push %r15 29b: 41 89 d7 mov %edx,%r15d 29e: 41 56 push %r14 2a0: 41 89 f6 mov %esi,%r14d 2a3: 41 55 push %r13 2a5: 41 89 d5 mov %edx,%r13d 2a8: 41 54 push %r12 2aa: 41 89 f4 mov %esi,%r12d 2ad: 53 push %rbx 2ae: 48 89 fb mov %rdi,%rbx 2b1: 48 83 ec 18 sub $0x18,%rsp 2b5: e8 00 00 00 00 callq 2ba <parport_wait_peripheral+0x2a> 2ba: 84 c0 test %al,%al 2bc: 0f 84 68 01 00 00 je 42a <parport_wait_peripheral+0x19a> 2c2: e8 00 00 00 00 callq 2c7 <parport_wait_peripheral+0x37> 2c7: c5 f8 77 vzeroupper ### WORKS ### 2ca: e8 00 00 00 00 callq 2cf <parport_wait_peripheral+0x3f>
However, this patch looks suspicious to me since it actually enables compiler to generate
AVX/SSE/MMX code in whole <parport_wait_peripheral> and this code will fail w/o
proper FPU guards.
BTW, I see no reason, why not to cover "mm_malloc.h" inclusion which stdlib check
in GCC source tree - will try to investigate.
inline assembler seems normal in Linux device driver. I found the following code: http://lxr.free-electrons.com/source/lib/raid6/avx2.c
There is a makefile in the raid6 directory, too. As I see they do not add any special compile flags as inline assembly should be clear enough for compiler when it comes to instructions.
When it comes to linux drivers, it's always a good idea, to look for existing drivers that do similar things or use similiar functions. The kernel_fpu_begin(); and kernel_fpu_end(); are quite important too because of the state. As far as I know basically kernel does not handle FPU state etc as most of the time FPU computing is not used, so you need to use this functions if you do.