Re: vmlSetMode

nick__ · ‎05-11-2009

I've noticed that the VML functions have a way to control precision - High Accuracy (HA), Low Accuracy (LA), and Enhanced Performance (EP) - of various functions (http://www.intel.com/software/products/mkl/data/vml/vmldata.htm) via vmlSetMode. I was wondering if other library functions supported this. In particular I am trying to speed up dense matrix multiplication and was wondering if this is possible via vmlSetMode or a similar function.
Thanks,
Nick

TimP · ‎05-11-2009

You have a limited number of options for trading accuracy vs precision in matrix multiplication. If your application promotes the data for the matrix multiplication from Fortran single/C float to double for the matrix multiplication, you could remove that promotion and use the single precision version, e.g. SGEMM rather than DGEMM.
In marginal cases, you could substitute a version of SGEMM which uses extra precision dot product accumulation for each result; in many cases, that might be as time consuming as DGEMM, so it's generally a do-it-yourself undertaking.
You would want to check that your threading and affinity arrangement is close enough to optimum for your problem size, and use source-inlined versions (e.g. Fortran MATMUL) when your case is too small to benefit from MKL. MKL is specifically optimized for various CPU models, so you may want to check how that is working out.

nick__ · ‎05-11-2009

Quoting - tim18

You have a limited number of options for trading accuracy vs precision in matrix multiplication. If your application promotes the data for the matrix multiplication from Fortran single/C float to double for the matrix multiplication, you could remove that promotion and use the single precision version, e.g. SGEMM rather than DGEMM.
In marginal cases, you could substitute a version of SGEMM which uses extra precision dot product accumulation for each result; in many cases, that might be as time consuming as DGEMM, so it's generally a do-it-yourself undertaking.
You would want to check that your threading and affinity arrangement is close enough to optimum for your problem size, and use source-inlined versions (e.g. Fortran MATMUL) when your case is too small to benefit from MKL. MKL is specifically optimized for various CPU models, so you may want to check how that is working out.

Thanks for this answer - I am currently using single values and SGEMM throughout. Is it possible to do mat-mult with fixed point numbers and would increase speed at all? Also, is it possible to obtain source code for SGEMM to try to make small changes (see below)?

The matrix mult I have is A: kxN, B: kxM, compute C=A' * B, where each colomn of A and B represents a unit length vector in the positive orthant. So C is N x M matrix where C(i,j) is the cosine of the angle between A(:,i) and B(:,j), which lies between 0 and 1. I was thinking of representing the numbers using 16bit unsigned integers representing the integer * 2^-15, which would give sufficient precision for my problem. For multiplications between elements I would multiply the two 16bit integers and keep only the top 16 bits (e.g. using mmx's pmulhw), and addition would just be parrallel addition (e.g. mmx's paddusw).

My question is if I can represent numbers with only two bytes and do integer arithmetic as above would there be a significant speed improvement? It seems that smaller size would be better since more elements can fit in cache and parralel integer multiplication/addition should be quicker than parralel single precision mult/add. Would any of these make a significant difference (speculatively) or is is not worth the effort? I would consider anything less than a 2x speed improvemnt not worth the effort (unless relatively simple).

Thanks,
Nick

TimP · ‎05-11-2009

Your browser search engine should find you the open source code for SGEMM.
Fixed point is unlikely to gain performance on any computer of the last 15 years.
You may gain performance if you have a specialized usage of SGEMM, such as one of the dimensions being too small for full performance of MKL, and you can optimize your SGEMM appropriately. However, in such a case, it would be simpler to use Fortran MATMUL, if you have fixed dimensions for which the compiler could optimize in-lined code.
MKL makes full use of the parallel SIMD instructions, when both dimensions of the matrix are suitable, so you don't have much chance to improve on it with your own compilation.