use of Vector instructions

John_Campbell · ‎12-09-2013

I have been attempting to investigate the benefit of AVX instructions on my i5 notebook, in comparison my Xeon desktop, which does not support AVX instructions. The results I have obtained indicate that I am not utilising the benefit of a 256 bit AVX processor.

My approach, which may be flawed, has been to generate different .obj files that utilize different instructions. I have written basic DO loops for Dot_Product ( s = . ) and also for vector addition ( = + const * ) and enclosed them in functions with names identifying the compiler option, eg Subroutine Vec_Add_SSE ( b, a, const, n).
I have then compiled each of these files with appropriate compilation options, as:

echo =========================================================== >> ftn90.tce
now >>ftn90.tce
del *.obj
del *.mod
rem
rem build
ifort /c /O2 /Tf quick_test.f90 /free /QxHost         >> ftn90.tce 2>&1
ifort /c /O2 /Tf AVX_lib.f90    /free /QxAVX          >> ftn90.tce 2>&1
ifort /c /O2 /Tf DO_lib.f90     /free /Qvec-          >> ftn90.tce 2>&1
ifort /c /O2 /Tf F90_lib.f90    /free /QxHost         >> ftn90.tce 2>&1
ifort /c /O2 /Tf SSE_lib.f90    /free /QxSSE2         >> ftn90.tce 2>&1
ifort /c /O2 /Tf SSE4_lib.f90   /free /QxSSE4.2       >> ftn90.tce 2>&1
ifort /c /O2 /Tf clock_qp.f90   /free /QxHost         >> ftn90.tce 2>&1

del quick_test.map
del quick_test.exe
ifort *.obj /exe:quick_test.exe /map:quick_test.map >> ftn90.tce 2>&1

quick_test >> ftn90.tce 2>&1

notepad ftn90.tce

The test is based on performing multiple calls to dot_product or vector addition, as occur in linear equation solution. I use varying size 2x arrays for n = 100,2100,500. As the arrays are 2D and the sizes are a multiple of 4, there should not be any alignment problems (?)

When I run this test I do get a difference between /Qvec- and other options, but little difference between SSE2 and AVX options.
The AVX option provides about a 40% reduction in run time, indicating nothing like the 4 x improvement I was hoping for.

Is this the performance improvement I should expect from an i5 notebook processor or am I not activating the AVX operation with this test approach ?
Does the approach of mixing compilation options /QxSSE2, /QxSSE4.2 and /QxAVX produce the result I am trying to achieve ?

Attached is the build and test for both the Xeon and i5 machines.

John

The following is a trace of the run on the i5.

Vec_SUM   600    5.49 mb
set A using Random     0.0077
set b : dot_product     0.1126   5.398E+07 /QxHost dot product intrinsic in 2 loops
set b : Vec_Sum_SSE2    0.1197   0.000E+00 /QxSSE2
set b : Vec_Sum_SSE4    0.1195   0.000E+00 /QxSSE4.2
set b : Vec_Sum_AVX     0.1428   9.915E-09 /QxAVX
set b : Vec_Sum_F90     0.1257   0.000E+00 using array syntax and /QxHost
set b : Vec_Sum_DO      0.2460   3.095E-08 using /Qvec-
=========== End ==============

Vec_ADD   600    8.24 mb
set A    0.0051
set b : vector loop     0.1293   5.408E+07 using 3 loops and /QxHost
set b : Vec_Sub_SSE     0.1287   0.000E+00 /QxSSE2
set b : Vec_Sub_SSE4    0.1302   0.000E+00 /QxSSE4.2
set b : Vec_Sub_AVX     0.1124   0.000E+00 /QxAVX
set b : Vec_ADD_AVX     0.1118   0.000E+00 /QxAVX
set b : Vec_Sub_F90     0.1295   0.000E+00 using array syntax and /QxHost
set b : Vec_Sub_DO      0.1997   0.000E+00 /Qvec-
=========== End ==============

Vec_SUM 2100   67.29 mb
set A using Random     0.0554
set b : dot_product     5.5604   2.316E+09 /QxHost dot product intrinsic in 2 loops
set b : Vec_Sum_SSE2    5.8971   0.000E+00 /QxSSE2
set b : Vec_Sum_SSE4    5.8415   0.000E+00 /QxSSE4.2
set b : Vec_Sum_AVX     5.7882   6.878E-07 /QxAVX
set b : Vec_Sum_F90     5.8383   0.000E+00 using array syntax and /QxHost
set b : Vec_Sum_DO     10.6352   2.271E-06 using /Qvec-
=========== End ==============

Vec_ADD 2100 100.94 mb
set A    0.0542
set b : vector loop     6.8352   2.316E+09 using 3 loops and /QxHost
set b : Vec_Sub_SSE     6.8283   0.000E+00 /QxSSE2
set b : Vec_Sub_SSE4    6.8301   0.000E+00 /QxSSE4.2
set b : Vec_Sub_AVX     6.0705   0.000E+00 /QxAVX
set b : Vec_ADD_AVX     6.0874   0.000E+00 /QxAVX
set b : Vec_Sub_F90     6.8151   0.000E+00 using array syntax and /QxHost
set b : Vec_Sub_DO      9.2732   0.000E+00 /Qvec-
=========== End ==============

I am using Intel(R) 64, Version 12.1.5.344 and Windows 7.
ifort is installed on a Xeon W3520 ( for /QxHost )

TimP · ‎12-10-2013

Running this on my "refurbished" i5-4200U laptop, it looks like the 2100 cases may be memory bandwidth limited (or at least limited by bandwidth of L2 and L3 cache access), so don't see much difference among the instruction sets.

SSE4.1 appears to be faster than SSE4.2 in some cases.

I think your comparisons may be more interesting with /align:array32byte unless there is a reason this option can't be used. Remember that the default alignment is 16 bytes, and even SSE may see a significant advantage from 32-byte alignment.

If you are interested in exploring the potential of AVX, besides setting data alignments, you will need alignment assertions where the alignments aren't visible to the compiler. These have the greatest effect on the shorter loops which fit in cache. I see 20% speedups from that alone in some of your cases.

ifort doesn't perform the detailed optimizations which would be needed for maximum in-cache performance of /QxHost on my i7-4 laptop. Laptop targets realistically won't benefit from such special treatment. How many laptops are sold on Fortran performance?

!dir$ vector aligned is the longest standing ifort loop alignment assertion. Current compilers should offer alternatives including !$omp simd aligned([operand list]) which will be recognized by other new compilers.

Even magic compilation options don't change the basic design of AVX instruction sets where 128 bits are stored by parallel simd per instruction cycle, and both SSE and AVX instructions can take advantage of the possibility to load 2 separate 64- or 128-bit operands per cycle. So there will be many situations where SIMD will be limited to vector speedup of 2x even with cache locality.

Vec_SUM 600 5.49 mb
set A using Random 0.0098
set b : dot_product 0.1239 5.398E+07 /QxHost dot product intrinsic in 2 lo
ops
set b : Vec_Sum_SSE2 0.1154 8.675E-09 /QxSSE2
set b : Vec_Sum_SSE4 0.1161 8.675E-09 /QxSSE4.2
set b : Vec_Sum_AVX 0.1398 9.357E-09 /QxAVX
set b : Vec_Sum_F90 0.0944 0.000E+00 using array syntax and /QxHost
set b : Vec_Sum_DO 0.1985 8.892E-09 using /Qvec-
=========== End ==============

Vec_ADD 600 8.24 mb
set A 0.0056
set b : vector loop 0.1550 5.408E+07 using 3 loops and /QxHost
set b : Vec_Sub_SSE 0.1245 2.909E-09 /QxSSE2
set b : Vec_Sub_SSE4 0.1572 2.909E-09 /QxSSE4.2
set b : Vec_Sub_AVX 0.1124 2.909E-09 /QxAVX
set b : Vec_ADD_AVX 0.1151 2.909E-09 /QxAVX
set b : Vec_Sub_F90 0.0940 0.000E+00 using array syntax and /QxHost
set b : Vec_Sub_DO 0.2154 2.909E-09 /Qvec-
=========== End ==============

John_Campbell · ‎12-10-2013

Tim,

Thanks very much for your comments. You have addressed a few issues that I should consider more.

From my position as a Fortran programmer, I want to know what I can achieve using /QxAVX and from this latest test it is not very much.

Your results show the best is achieved by using array syntax and /QxHost, which is better than 2x the /Qvec- performance. All selections of instruction sets appear to perform slightly worse, although that may be due to using a DO loop rather than array syntax.

The reason for this test example is I have been trying to see what performance improvements I can achieve in direct equation solution of large sets of linear equations.

OpenMP appears to fail by memory bandwidth limits with Gauss algorithm, while the Crout algorithm is not easily adapted to OpenMP.

That makes vector instructions appear to be an easier alternative, but my tests to date have not achieved any better than 2x, which is available from SSE2.

I would have thought that the two calculations of Vec_Sum and Vec_Add would be most suited to the AVX instruction set. I retained DO loop syntax as I thought the compiler could optimise this for the selected instruction set.

It is interesting you are again identifying memory accessing speeds as a limitation, which was a significant limitation for OpenMP.

I shall look more closely at smaller tests that fit in cache and look to control alignment to see if I can achieve the 4x performance target. I might try to compile the main quick_test.f90 using /QxAVX. (/QxAVX should imply the 32byte alignment ?).

Your comment about “How many laptops are sold on Fortran performance?” is worth noting. While this would be limited to us few Fortran users, my choice of notebook and ifort compiler was certainly made with an aim of testing what I could achieve with the new features. Unfortunately the compiler and hardware I am using is not the latest, which is a consequence of expenditure cut-backs over the last few years. Keeping up with the latest hardware is always a budget challenge, (and difficult to justify if the results don't follow.)

Thanks again for your comments and I will certainly try to investigate the issues you have highlighted.

John

TimP · ‎12-10-2013

I can think of just 2 or 3 typical situations where array assignments pose difficulty for auto-vectorization optimization

a) cases of possible data overlap where ifort may unnecessarily use a temporary array: these have decreased over the years

b) multiple assignments where fusion is needed to optimize memory access: ifort does better than competing compilers, particularly when helped by alignment assertions. I still prefer to write DO loops which don't depend on compiler fusion.

c) OpenMP still doesn't work well with array assignments in the parallelized loop. omp parallel workshare was intended for that purpose but has fallen short.

OpenMP isn't an alternative to vectorization. In fact, OpenMP 4.0 has new features designed to help with vectorization, both with and without combined threaded parallelism. Increasing parallel capability in recent CPU architectures has increased the importance of combining several levels of parallelism. The slogan of 2 decades ago, "concurrent outer, vector inner" has come back as parallel outer, vector inner for nested loops, often with another outer level of parallelism such as MPI.

It's difficult to get the best out of many standard algorithms which depend on effective combined vectorization and threaded parallelism. Thus the existence for decades of performance libraries such as Intel MKL.

32-byte data alignments are required for full performance of AVX-256. Compilers vary in their ability to choose AVX-128 for cases where that may be the better way to deal with lesser alignments. AVX compile options don't imply 32-byte alignment; that's the reason for ifort introduction of /align:array32byte and alignment assertions. Perhaps surprisingly, 32-byte alignments were already important for the original core i7 before AVX became available.

John_Campbell · ‎12-10-2013

Tim,

When was /align:array32byte introduced. I assume this applies to ALLOCATE arrays.

Can you test this being achieved by calculating mod ( loc (array), 32 ) and testing if it = 0. ( or 1?)

To overcome this problem, I could allocate a larger vector and to the MOD test, to transfer an appropriate address.

John

John_Campbell · ‎12-10-2013

I applied a change to fix the alignment problem. The resulting array memory addresses are 32 byte aligned.

The resulting run times did not show a significant change, although there was a small improvement on average.

John

TimP · ‎12-11-2013

John Campbell wrote:

Tim,

When was /align:array32byte introduced. I assume this applies to ALLOCATE arrays.

Can you test this being achieved by calculating mod ( loc (array), 32 ) and testing if it = 0. ( or 1?)

To overcome this problem, I could allocate a larger vector and to the MOD test, to transfer an appropriate address.

John

ifort had /align:array32byte by version 13.0. Earlier versions had local align directives.

It doesn't (yet) work for labeled COMMON. It should work for allocate.

Tests with loc (or the standard compliant c_loc) and mod would work to verify alignment. If you didn't trust the compiler to optimize this case of mod, you could use and( c_loc(array), 31).

When I began working at Intel, explicitly allocating space and adjusting to an even address was advocated routinely. The message got through eventually that less cumbersome methods are needed.

Since you mentioned OpenMP, in a parallel loop, several steps would be needed to get benefit of alignment. Applying an alignment assertion may cause the code to fail if not all steps are implemented.

a) the entire array is aligned

b) loop count is a multiple of number of threads times 32 bytes (64 bytes to cover future AVX-512)

So it may frequently be impractical to align parallel vectorized loops. As a result, some OpenMP chunks may be aligned better than others.

The current compilers have an additional directive !dir$ unaligned which requests generation of code to takes care of unaligned data without using a scalar peeling adjustment. This might even out the variation in time taken by OpenMP chunks of varying alignment.

John_Campbell · ‎12-11-2013

Tim,

Could you test the revised quick_test.f90 from #6 and see if it improves performance on the i7-4 laptop.
I don't get any significant improvement when this alignment is applied; no indication that 256 bit instructions are being utilised.
That's assuming I have the correct approach to confirming and adjusting alignment.

John

TimP · ‎12-11-2013

I have seen the tactic of 32-byte alignment of arrays passed to VC++ /arch:AVX (VS2012) produce significant performance improvement (for cases where a loop stores into several arrays), but in your case most of the improvement (25% in a few cases) required that I also add the !dir$ vector aligned directives in the test loops.

The alignment improvements aren't necessarily reproducible, as you have about 50% probability of good alignment even if you allow it to default to 16-byte alignment. This is sometimes a source of performance changes between compiler versions or as an accidental side effect of your own source code changes elsewhere. Those can't be blamed on or credited to the compiler or hardware.

John_Campbell · ‎12-12-2013

Tim,

I have further tested changes to my example code to adjust the alignment, but I do not get any significant improvement in the AVX performance I am achieving. I can confirm AVX instructions are present, as the resulting .exe will not run on my Xeon processor.

I am still unable to achieve any benefit of using /QxAVX in comparison to /QxSSE2.

At the moment I can't put forward a recommendation of buying AVX capable PC's if I can't get them to work.

John

John_Campbell · ‎02-19-2014

I have continued to investigate why I don’t appear to get AVX instruction performance improvements and also OpenMP performance improvements.

Considering Jim and Tim’s comments about a memory access bottleneck, I have investigated the impact of memory usage size on vector and OpenMP performance improvements; with some success.

I have taken a recent example I have posted of OpenMP usage and run it for:

Different vector instruction compilation options, and
Varying memory usage footprints.

I have also carried out these tests on two processors that I have available:

Intel® Xeon® W3520 @ 2.67 GHz with 8mb cache, 12.0 GB memory and 120GB SSD
Intel® Core™ i5 2540M @ 2.60 GHZ with 3mb cache, 8.0 GB memory and 128GB SSD

For those who know processors, both of these are cheap and have relatively poor performance for their processor class, so I am investigating the performance improvements that can be achieved for low specked Intel processors.

Apart from processor class (for available instruction set) and processor clock rate, the other important influences on performance are:

Processor cache size, (8mb and 3mb)
Memory access rate (1066mhz and 1333mhz)

I presume cache size is defined by the processor chip, while memory access is defined by the pc manufacturer?

Unfortunately, I am not lucky enough to test a range of these options, but perhaps others can.

Compiler Options

I have investigated compiler options for vector instructions and for OpenMP calculation. 6 options have been used. For vector instructions I have used:

/O2 /Qxhost (include best vector instructions available AVX on i5 and SSE? On Xeon)

/O2 /QxSSE2 (limit vector instructions to SSE2)

/O2 /Qvec- (no vector instructions)

These have been combined with /OpenMP to identify the combined performance improvement that could be possible.

Memory Options

A range of memory footprints from 0.24 MB up to 10 GB have been tested, although the performance levels out once the memory usage footprint significantly exceeds the cache capacity.

For subsequent tests, the array dimension N is increased by 25% for each successive test:

x = x * 1.25

n(i) = nint (x/4.) * 4 ! adjust for 32 byte boundary

call matrix_multiply ( n(i), times(:,i), cycles(i) )

The sample program I am using, calculates a matrix multiplication, where (real(8)) = + [A’]..

The advantage of this computation is that OpenMP can be applied at the outer loop, providing maximum efficiency for potential multi-processing. When run it is always achieving about 99% CPU in task manager.

For small matrix sizes, the matrix multiply computation is cycled, although the OpenMP loop is inside the cycle loop. This appears to be working, with a target elapse time of at least 5 seconds (10 billion operations) being achieved.

Idealised Results

From an idealised estimate of performance improvement:

SSE2 should provide 2 x improvement over Qvec-.

AVX should provide 4 x improvement. (assuming 256 bit; the matrix size has been sized as a multiple or 4 so that dot_product calls are on 32-byte alignment)

OpenMP should provide 4 x improvement for 4 CPU’s

This potentially implies up to 16 x for /QxAVX /Openmp, although this never occurs !!

Actual Results

Results have been assessed, based on run time (QueryPerformanceCounter)

Performance has also been calculated as Mflops (million floating point operations) where I have defined a single floating point operation as “s = s + A(k,i)*B(k,j)” (floating point multiplication), although this could be described as 2 operations as there is now little difference between multiplication and addition.

Performance has also been reported as a performance ratio in comparison to /O2 /Qvec- for the same memory size. This gives a relative performance factor for the combination of vector or OpenMP improvement.

The results show that some performance improvement is being achieved by vector or OpenMP computation but not near as good as the ideal case.

While OpenMP always shows a 4x increase in cpu usage, the run time improvement is typically much less. This can best be assessed by comparing the run-time performance of OpenMP to the single cpu run time performance.

The biggest single influence on achieved performance improvement is the memory footprint. For the hardware I am using there is little (none!) improvement from AVX instructions once the calculation is no longer cached. My previous testing for large memory tests appeared to show that I was not getting the AVX instructions to work.

I would have to ask does AVX work for non-cached computation, as I have not shown this occurring with my hardware. Also, if AVX instructions only work from the cache, what is all the talk about alignment, as I do not understand the relationship between memory and cache alignment of vectors.

This AVX operation for non-cached calculations can be masked by the memory access speeds. I need to test with faster memory access speeds using faster memory access hardware.

Reporting

I am preparing some tables and charts of the performance improvements I have identified.

The vertical axis is Mflops or performance ratios.

The horizontal axis is memory footprint. When reported as a log scale, the impact of cache and lack of AVX benefit for large memory runs, as a log scale

Summary

The memory access bottleneck is apparent, but I don’t know how to overcome it.
From the i5 performance, AVX performance does not appear to be realised.
The influence of cache size and memory access rates can be seen in the Mflop charts below.

These tests are probably identifying that notionally better processors are only any good if the main bottleneck on performance is related to the improvement these better processors provide. At the stage I have tested, I appear to get minimal benefit from AVX due to the memory access rate limit.

I would welcome any comments on these results or hope people could run the tests on alternative hardware configurations, noting the main hardware features identified above, including cache size and memory access speed.\

( see attached document for charts and test suite )

John

I5 Mflop Results

Xeon Mflop Results

I5 Performance Improvement

Xeon Performance improvement

TimP · ‎02-20-2014

If you were trying to maximize performance of matrix multiplication, I suppose you would use some of the options built into ifort and other compilers, including -align:array32byte and MKL library or MATMUL optimization with -O3 or -Qopt-matmul.

It looks as if you have already applied most reasonable optimization short of using the special matrix multiplication facilities.

Matrix multiplication offers more opportunities than many applications to improve performance by cache blocking, which is about the only likely way to economize on memory bandwidth. As I think you understand, when you set up your test to be limited by memory bandwidth it doesn't make much difference how much instruction level parallelism you engage.

The newer "Haswell" CPUs should exhibit better L2 cache performance, but taking advantage of the higher peak flops/clock rate depends on the new optimizations built into MKL. It's difficult for a compiler to use the new instructions with consistent effectiveness; for example gfortran and ifort excel on different benchmarks or different approaches to the same benchmark.

TimP · ‎02-21-2014

Dual CPU platforms support a significantly higher memory bandwidth usage in NUMA mode by balancing memory access between CPUs with multiple threads (but not likely to double the bandwidth achieved by a single thread).

I don't know the current state of the market, but a single socket workstation labeled Xeon by one OEM is not necessarily different from one labeled corei7 by another.

The first core i7 to include AVX was corei7-2. While the dual CPU Xeon server platforms lagged the single CPU corei7 in introduction of newer CPU generations, AVX servers have been on the market for 2 years, and the Haswell corei7-4 servers are close to introduction. These improve performance of AVX applications with medium cache footprint (offering fairly complete support for 256-bit data movement in L1 and L2), but obviously don't eliminate the role of cache in taking advantage of AVX.

Memory performance continually increases. DDR3-1867 on the corei7-3 Xeon server showed improvement over the DDR3-1333 which was in use for some time on Xeon, not to mention the DDR3-1066 typically used on the first corei7. Again, on real applications, you won't see performance fully proportional to memory speed.

If your applications are all performance limited by memory bandwidth, you would choose the least expensive models which offer full memory speed, as additional cores, or higher CPU clock speed multipliers, wouldn't pay off.

John_Campbell · ‎02-24-2014

Tim,

Thanks for your comments.

I have choosen matrix multiplication as it is the least complicated calculation I can find that works for OpenMP and Dot_Product will be one of the basic calculations that I will continue to use. I have assumed in the tests that Dot_Product would respond to the /QxHost, /QxSSE2 and /Qvec- compiler requests and not provide an optimised calculation.

What I am finding is that AVX instructions are not providing any significant benefit on the i5, due to either the Dot_Product function does not implement them with /QxHost on the i5 or I am always hitting the memory access speed limits. For a long time I have assumed that AVX might double performance from SSE2, but I have not identified any significant performance improvements. AVX looks more like marketing than substance.
Cacheing has some benefit, but this runs out very quickly as the problem size increases. I don't know of any way to guide what arrays should be cached although I doubt if I'd get that right !

What I am concluding is the memory access rate is the most significant influence on performance. Might it be an improvement if we could parallel memory access, especially for reads, when doing OpenMP?

At present the organisation I work for has simplified their PC purchases and get only Xeon workstations. I have been suggesting that the Core i7 would be better as it has AVX. I don't have much to prove my case at the moment.

The aim of this study has been to investigate the benefit of vector and parallel computation for low end PC's. i7 chips notionally have 8x CPU and 4x AVX, which could offer a low cost higher performance solution, but again they are limited by memory access rates. I have not been very successful so far on the Core i5, as the best I am getting is 2x for the problem size I am targeting.

It would be good if others could run the test example on different hardware (especially noteing the memory access speed) to confirm if this remains as the bottleneck.
(an updated zip file of a simplified batch file test is attached if others could run the test to demonstrate performance on alternative hardware configurations. Returning the file openmp_test.log summarises the performance.)

Again Tim, thanks for your comments and I'd love to know where I am going wrong,

John