Showing results for 
Search instead for 
Did you mean: 

How to compile avx intrinsics in linux device driver?

My setup: Linux 3.13 kernel, gcc 4.8.2. Ubuntu on core i7

How to compile avx intrinsics in linux device driver? Any exact gcc compiler flags (makefile) and what header files to include in c source?


0 Kudos
3 Replies


In order to use x86 intrinsics, you should include <x86intrin.h>

However, it uses mm_malloc.h which, in turn uses stdlib.h

Also, Linux explicitly disables all SIMD extensions in order to force

compiler not to use [XYZ]MM registers and you should use them really


So you need two things:

  1. Include intrinsics header w/o stdlib stuff. You may cheat preprocessor by defining mm_malloc guard before "x86intrin.h" inclusion.
  2. Minimize AVX-enabled area. You are able to enable some extensions only for given routine using "GCC target ()" pragma.

Here is an example which compiles for me for arbitrary driver (Makefile wasn't changed):

$ diff -p linux-3.15.Os/drivers/parport/ieee1284.c linux-3.15.O2/drivers/parport/ieee1284.c
*** linux-3.15.Os/drivers/parport/ieee1284.c    2014-06-20 14:26:09.581496982 +0400
--- linux-3.15.O2/drivers/parport/ieee1284.c    2015-03-26 18:39:44.310801033 +0400
*** 25,30 ****
--- 25,39 ----
  #include <linux/timer.h>
  #include <linux/sched.h>

+ #include <asm/i387.h>
+ #pragma GCC push_options
+ #pragma GCC target ("mmx", "avx")
+ #  include <x86intrin.h>
+ #pragma GCC pop_options
  #undef DEBUG /* undef me for production */

*************** int parport_poll_peripheral(struct parpo
*** 166,171 ****
--- 175,181 ----
   *    10ms, waking up if an interrupt occurs.

+ #pragma GCC target ("mmx", "avx")
  int parport_wait_peripheral(struct parport *port,
                            unsigned char mask,
                            unsigned char result)
*************** int parport_wait_peripheral(struct parpo
*** 175,180 ****
--- 185,194 ----
        unsigned long deadline;
        unsigned char status;

+       kernel_fpu_begin ();
+       _mm256_zeroupper ();
+       kernel_fpu_end ();
        usec = port->physport->spintime; /* usecs of fast polling */
        if (!port->physport->cad->timeout)
                /* A zero timeout is "special": busy wait for the


And generates reasonable code:

0000000000000290 <parport_wait_peripheral>:
 290:   e8 00 00 00 00          callq  295 <parport_wait_peripheral+0x5>
 295:   55                      push   %rbp
 296:   48 89 e5                mov    %rsp,%rbp
 299:   41 57                   push   %r15
 29b:   41 89 d7                mov    %edx,%r15d
 29e:   41 56                   push   %r14
 2a0:   41 89 f6                mov    %esi,%r14d
 2a3:   41 55                   push   %r13
 2a5:   41 89 d5                mov    %edx,%r13d
 2a8:   41 54                   push   %r12
 2aa:   41 89 f4                mov    %esi,%r12d
 2ad:   53                      push   %rbx
 2ae:   48 89 fb                mov    %rdi,%rbx
 2b1:   48 83 ec 18             sub    $0x18,%rsp
 2b5:   e8 00 00 00 00          callq  2ba <parport_wait_peripheral+0x2a>
 2ba:   84 c0                   test   %al,%al
 2bc:   0f 84 68 01 00 00       je     42a <parport_wait_peripheral+0x19a>
 2c2:   e8 00 00 00 00          callq  2c7 <parport_wait_peripheral+0x37>
 2c7:   c5 f8 77                vzeroupper ### WORKS ###
 2ca:   e8 00 00 00 00          callq  2cf <parport_wait_peripheral+0x3f>

However, this patch looks suspicious to me since it actually enables compiler to generate

AVX/SSE/MMX code in whole <parport_wait_peripheral> and this code will fail w/o

proper FPU guards.

BTW, I see no reason, why not to cover "mm_malloc.h" inclusion which stdlib check

in GCC source tree - will try to investigate.

Black Belt

It may be easier to use inline assembler in a Linux device driver than to use the AVX intrinsics.   This will eliminate the header file problems.



inline assembler seems normal in Linux device driver. I found the following code:

There is a makefile in the raid6 directory, too. As I see they do not add any special compile flags as inline assembly should be clear enough for compiler when it comes to instructions.

When it comes to linux drivers, it's always a good idea, to look for existing drivers that do similar things or use similiar functions. The kernel_fpu_begin(); and kernel_fpu_end(); are quite important too because of the state. As far as I know basically kernel does not handle FPU state etc as most of the time FPU computing is not used, so you need to use this functions if you do.