Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Disable SSE* instructions

Hsunwei_H_
Beginner
2,351 Views

Hello,

I am trying to prevent GCC from generating SSE* related instructions. However, SSE uops are still observed using Oprofile.

I used the following GCC flags to do so:  -march=i386 -mno-mmx -mno-sse -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4.1 -mno-sse4.2 -mfpmath=387

Oprofile outputs:

        Event                                          Count                          % time counted
        FP_COMP_OPS_EXE:0x1           3,554,989,165,876        26.67
        FP_COMP_OPS_EXE:0x2           5,571                           26.67
        FP_COMP_OPS_EXE:0x4           0                                 26.67
        FP_COMP_OPS_EXE:0x8           18,729,332                   26.67
        FP_COMP_OPS_EXE:0x10          0                                 26.67
        FP_COMP_OPS_EXE:0x20          0                                 26.67
        FP_COMP_OPS_EXE:0x40          0                                 26.67
        FP_COMP_OPS_EXE:0x80          0                                 26.68
        SIMD_INT_128                             0                                 26.67
        SIMD_INT_64                               0                                 26.67
        SSEX_UOPS_RETIRED:0x1         56,507,076                   26.67
        SSEX_UOPS_RETIRED:0x2         783,193                        26.67
        SSEX_UOPS_RETIRED:0x4         0                                  26.67
        SSEX_UOPS_RETIRED:0x8         47,643                          26.67
        SSEX_UOPS_RETIRED:0x10       39,842,775                    26.67

The following outputs are from the execution using compilation flags: -march=native  (corei7)

        Event                                          Count                         % time counted
        FP_COMP_OPS_EXE:0x1           226,437,482,607          26.67
        FP_COMP_OPS_EXE:0x2           4,922                          26.67
        FP_COMP_OPS_EXE:0x4           3,319,227,676,538        26.67
        FP_COMP_OPS_EXE:0x8           20,148,824                   26.67
        FP_COMP_OPS_EXE:0x10          0                                 26.67
        FP_COMP_OPS_EXE:0x20          3,318,136,953,656        26.67
        FP_COMP_OPS_EXE:0x40          0                                 26.67
        FP_COMP_OPS_EXE:0x80          3,319,097,939,337        26.67
        SIMD_INT_128                             0                                 26.67
        SIMD_INT_64                               0                                 26.67
        SSEX_UOPS_RETIRED:0x1         63,739,766                   26.67
        SSEX_UOPS_RETIRED:0x2         1,042,903                     26.67
        SSEX_UOPS_RETIRED:0x4         836,680,511                  26.67
        SSEX_UOPS_RETIRED:0x8         7,691,378,757,211         26.67
        SSEX_UOPS_RETIRED:0x10        50,823,945                   26.67

There are indeed some difference. However, I would like to preventing using SSE at all.

Is this possible? How can I do that?

Thanks!

0 Kudos
6 Replies
Bernard
Valued Contributor I
2,351 Views

I do not think that at hardware level you can disable SSEn instruction.I suppose that you mean  disabling emission of SSEn code at compiler level.

You can be counting also uops generated by the other code.I would advise to double check your result with VTune, unless Oprofile can track instruction pointer.

0 Kudos
Hsunwei_H_
Beginner
2,351 Views

Thanks for the quick reply.

So it is possible that other non-SSE instructions can generate SSE-uops in hardware. Is there any documentation regarding what instructions will behave this way?

I was trying to assess the performance improvement provided by SSEn instructions. Is there any publication documenting this?

Thanks!

0 Kudos
Vladimir_Sedach
New Contributor I
2,351 Views

Compiler never uses x87 FPU at least in x64 mode.
Take a look at http://en.wikipedia.org/wiki/X86-64
and find 'x87'.
You can easily check this by analyzing the assembly code generated for simple scalar operations.
 

0 Kudos
MarkC_Intel
Moderator
2,351 Views

Consider if your app is linking against any runtime libraries. The standard system runtime libraries will use SSE* instructions.

0 Kudos
Bernard
Valued Contributor I
2,351 Views

Vladimir is right. By reading x86-64 ABI you can see that SSE instructions are generated for floating point code.

>>>So it is possible that other non-SSE instructions can generate SSE-uops in hardware>>>

I do not think so . It could be other run-time libraries which have different path of execution.

0 Kudos
McCalpinJohn
Honored Contributor III
2,351 Views

I was able to get gcc to generate x87 code using cygwin under Windows 7.  The OS is 64-bit, but the cygwin/gcc compiler was probably generating a 32-bit executable.  The last time I looked at the assembly output (yesterday) it also included SSE instructions, and I have not tried to generate code without them.

With the Intel C compiler (v13 at least -- not sure about other versions), the option "-ffreestanding" will eliminate some calls to external libraries (such as the replacement of array copy operations with calls to an Intel-optimized memcpy function, which is a very reasonable place to find non-computational SSE instructions), but there is lots of external code over which you will have no control (e.g., printf).  I don't know if gcc does any similar idiom substitutions, but if you are doing whole-program monitoring in user+kernel space I would not be at all surprised to see some SSE instructions used for zeroing pages and such.

Some recent Linux kernels set the CR4.PCE processor configuration bit, which allows you to execute the RDPMC instruction in user mode.  If you do this before and after the specific code sections that you are interested in, you can avoid calls to external libraries (but depending on your environment, you may get counts from other processes if they are run on the core you are measuring during the measurement interval).   Using this approach on Sandy Bridge cores, I routinely see zero counts for the FP_COMP_OPS_EXE sub-events on codes compiled with AVX and zero for the SIMD_FP_256 (AVX) sub-events on codes compiled with SSE, but both of these events are limited to the computational operations, and I was not trying to run x87 code as a comparison.

The zero counts for all of the FP_COMP_OPS_EXE SSE sub-events (except for the SIMD integer sub-event) suggest that the options used have successfully suppressed SSE code generation for all the arithmetic portions of the code.   The non-zero counts for SSEX_UOPS_RETIRED may be due to an errata present in both the Nehalem and Westmere Core i7 processors. For the Nehalem (Xeon 5500), errata AAK8 says that event C7h (SSEX_UOPS_RETIRED) can count other types of retired instructions and therefore give higher than expected values.  For the Westmere (Xeon 5600), errata BD4 says the same thing.    So the SSEX_UOPS_RETIRED counts in the first case (with the x87 target) may indicate that SSE instructions are being executed, but also may be spurious counts due to this error. 

When considering the impact of these SSE instructions, note that in the first set of results (with the x87 target), the count for the largest of the SSEX_UOPS_RETIRED sub-events is about 1/63000th of the x87 operation count (56 million / 3.5 trillion).   These counts may be due to the overcounting errata or they may be due to actual SSE code being generated, but it is hard to imagine that one SSE instruction for every 63000 x87 instructions is going to cause a detectable performance difference.  I noticed that the largest count of SSEX_UOPS_RETIRED was for Umask 0x01, which is packed single-precision instructions, while both cases showed exactly zero packed single computational instructions.  I have noticed that compilers almost always generate packed single SSE instructions for copying packed double data (I think the instruction has one less prefix byte than the packed double version?), so this pattern is consistent with the SSE instructions being used for data movement and not for computation.

0 Kudos
Reply