SSE/AVX on multicore

magicfoot · ‎07-13-2011

My SSE on an a single Nehalem core shows a 100% speedup for my program. The speedup contributed by SSEdecreases radically as one uses more cores on one CPU. i.e. every core uses the SSE from the cores on 1 chipusing an MPI program. If I run the same program on several Nehalems joined by Infiniband and use only 1 core per CPU(x4 or x8)I can once again show that the SSE provides a good speed-up.

I believe that the SSE slow down on a multicorecpu, when compared to using cores from multiple cpus is due to the saturation of the memory bandwidth. How can I prove or disprove this? Is there a non-complex way to detecting when the chip memory saturates. example, detect vm page faults or something.

I would imagine that this effect is even more severe with the AVX, as the AVX will demand even more memory bandwidthI am still stuck with the AVX emulator but will get round to testing it in practce when I figure out how to fit 1155 pins into a 1156 socket.

SHIH_K_Intel · ‎07-14-2011

As noted there may be a memory bandwidth problem in your current implementation. If the access pattern is inefficient relative to the cache hierarchy, the problem can manifest across any vector length or data element granularity.

A cacheline being 64 bytes long, capacity issue is strongly influenced by your access pattern. It is possible evan for 4-byte-granular or 8-byte-granuar access patterns to saturate the memory system. Generally, bandwidth saturation manifest in some non-linear manner but not a step function.

If you got 2x speed-up going from 8-byte-granular to 16-byte-granular, you may want to double-check your synopsis of saturating memory bandwidth.

If you got 2X speed-up going from 4-byte-granular to 16-byte-granular patterns, both your scalar and SSE implementation may already be on differnt spot of the saturation curve.

The first thing to consider is to change your access patterns to move away from the rising saturation curve. It may be possible to re-order your loop sequencing or re-arrange your data layout to feed the out-of-order core from cache instead of from memory, with the aid of hardware prefetchers.

BTW, personally, I wouldn't want to spare a $60-70 new MB for the risk of ruining a good CPU:)

bradjordan111 · ‎09-09-2011

I have seenseveral posts where SSE support is mentioned with respect to AMD/Intel chips when using explicit OpenCL Vectors. Speed ups within my own code seem to agree.

Ex)
double2 d2 = (double2) (1.2, 3.4);
double2 d3 = 3.14*d2;

What I have read indicates that this code will be compiled into a single SSE instruction. I would assume that higher multiples of 128-bits would be broken up into multiple passes (I.e. double4, double8, float16, etc.)

SSE, however, is limited to 128-bits (float4/double2 in one clock cycle), while AVX can handle up to 256-bits (float8/double4 in one clock cycle). I have looked around the internet and seen that AMD is going to be adding support for Intel's AVX instructions to their CPU's. (I don't know if this has already been completed) When this happens, when will OpenCL add support for AVX accelerated Vector Instructions?

Also, Since OpenCL has already been ported to run on Intel CPU's does the Stream SDK 2.3 support AVX accelerated Vector Instructions on Intel (Sandy Bridge) Processors? If not, (given thatIntel'sSandy Bridge CPU released relatively recently)is this something which is scheduled to be added in the next SDK release, or something which will only be added when AMD chips support it?

thank you,
Brad

short haircuts