Parallel computing & array multiplication problem, any library?

zlzlzlz7 · ‎03-06-2012

Hi there, I'm a newbie in parallel computing, and met a project that requires using C/C++ to rewrite a fortran program and finally runs on a multi-core Intel CPU + GPU server node or even a cluster.

The program involves much matrix/multidimensional array multiplication. In fortran it easy to do row/column assignment, matrix addition, subtraction, multiplication etc., but not so simple for C - it needs to be enhanced to have functions like fortran or matlab.

So I read a bit about CUDA, OpenCL about GPU and OpenMP, MPI about parallel machines. There are only low level APIs, and don't have the functions above. If I make some libraries for C myself, I'm afraid they won't be reliable or efficient enough. So I wonder if there're any opensource libraries. I saw nvidia CUDA topics recommended a library called arrayfire, but is commercial. And I tried gsl, and also heard about mtl, octave, I'd like to ask which is the best option? or Intel mkl?

Thanks!

TimP · ‎03-06-2012

In Intel Fortran, the option /Qopt-matmul should make an automatic substitution of the MKL BLAS ?gemm function where appropriate to implement Fortran MATMUL. In C or C++, you can call ?gemm directly as F77 function or indirectly by cblas interfaces. There's not much point (at least not from a performance point of view) in packaged functions for matrix addition/subtraction. If the problem is suited for threaded parallel, either OpenMP or Cilk+ can do the job easily. Unless your problem is so big that it's advantageous to spread across a cluster, MPI doesn't fit well with basic matrix algegra.
For the operations you are talking about, Matlab is simply another user interface for MKL library (when run on the appropriate platforms).
For CUDA, a usual tactic for matrix multiply is to call into the cudablas library in similar fashion. To some extent, this avoids spending excessive effort on non-portable code.
Other libraries you mention evidently have devoted users, but not simply on account of basic matrix algebra.

SergeyKostrov · ‎03-06-2012

Quoting zlzlzlz7

...The program involves much matrix/multidimensional array multiplication. In fortran it easy to do row/column assignment, matrix addition, subtraction, multiplication etc., but not so simple for C...

It depends on an algorithm(s) you're going to select.Classic algorithms formatrix multiplication, addition,
etc are very simple when implementedinC or C++. Performance of these algorithms goes up as soon
asloops are unrolled, orSSE is used, or calculations done in parallel, or a pirority of the process boosted to
a realtime. Once again, speaking about implementation complexity, they are simple.

Everything changes when a size ofa matrix isgreater than1024x1024. In that case Strassen-like
algorithms must be used and performance gains are significant becausetime complexity of the
Strassen-like algorithmsis better than a Calssic one. They multiply matrices significantly faster but it
comes at a price, that is, implementation complexity of the Strassen-like algorithmsin C or C++is higher.

Hereis someinformation regarding Time Complexity for Matrix Multiplication Algorithms:

Virginia Vassilevska Williams...O( n^2.3727 )
Coppersmith-Winograd..............O( n^2.3760 )
Strassen........................................O( n^2.8070 )
Strassen-Winograd.....................O( n^2.8070 )
Classic...........................................O( n^3.0000 )

Let me know if you would like to see performance numbers, Classic vs. Strassen ( Recursive Heap-Based Complete).

Best regards,
Sergey

zlzlzlz7 · ‎03-07-2012

to TimP (Intel),

Thank you very much for the reply!

I'm sure Intel MKL is a powerful tool, and with blas as a base, the performance must be very good. But as you said, there're user interfaces. I noticed that many math libraries requires user to define a matrix struct or something. My project is on physics, and I'm not professional, so I really don't wish to write many for loops to manipulate a matrix.

In case I can't explain myself well, here is what I was going to do with some packaged functions:

eg. 1

Fortran

mat(:,1) = 0. !set the 1st column of matrix mat to all 0s (1-index)

C

vector_set_zero(vecttemp);//get a zeros vector

matrix_set_col(mat, 0, vecttemp);//set the 1st column of matrix mat to all 0s (0-index)

eg. 2

Fortran

vecttemp=matmul(mat,vect); !built-in function of fortran, matrix multiplication

C

blas_dgemv (CblasNoTrans, 1.0, mat, vect, 0.0, vecttemp);//using cblas

so is there something for C/C++ to make it as easy to use as fortran or matlab, yet still have high performance? or some higher level APIs for MKL?

zlzlzlz7 · ‎03-07-2012

Quoting Sergey Kostrov

Quoting zlzlzlz7
...The program involves much matrix/multidimensional array multiplication. In fortran it easy to do row/column assignment, matrix addition, subtraction, multiplication etc., but not so simple for C...

It depends on an algorithm(s) you're going to select.Classic algorithms formatrix multiplication, addition,
...

Also thanks a lot for the reply!

The optimization in the aspect of algorithms is also our concern. The serial program for 1 dimention physical model takes only several seconds, but for 3 diemention it takes many hours. As the 3D model isn't finished yet, I'm not sure if the matrix size will be very large. The direction on data size and algorithm is very helpful!

jimdempseyatthecove · ‎03-07-2012

>>but for 3 dimension it takes many hours.

With RAM abundant, consider maintaining the portion of the 3D model that is use in matrix multiplication as three seperate copies: one with X as the minor index, one with Y as the minor index, one with Z as the minor index (yes three instances of those data).

What you will be trading off is 3 writes/cell against the product of the size of each dimension. e.g. 3 writes against 1000x1000x1000 reads per (output)cell(for a cubewith side of1000).

With the rotated arrays you can now perform DOT products of sequential data (memory access friendly).

Jim Dempsey

TimP · ‎03-07-2012

You may say there is plenty of address space, if you use 64-bit OS, but you may find that Fortran MATMUL incurs significantly more cache misses than BLAS dgemv, due to allocating an additional array for the result of MATMUL. gfortran will avoid making a temporary when the MATMUL result is assigned directly to an array; other compilers may not.
I don't know what you consider easy to do in C or C++, but you can zero out a single column of a matrix stored as BLAS expects by a single memset() or fill(), or by an equivalent for(). It's up to you if you want to hide it under another name so as to avoid the standard library function names. I'm not one to advocate the idea that C or C++ can be made as easy to use as Fortran, when taking into account performance, but I don't favor implicit style rules which make one or the other more difficult.

SergeyKostrov · ‎03-07-2012

Quoting zlzlzlz7

...
As the 3D model isn't finished yet, I'm not sure if the matrix size will be very large.
The direction on data size and algorithm is very helpful!
...

What is the matrix size?
What precision do you need, that is, a single or adouble?

Also, take into account limitations of IEEE 754 Standard. In case of matrix multiplication there are some
accuracy issues even with small matrices. I could give an example of a rounding problem whentwo 8x8 matrices are multiplied.

SergeyKostrov · ‎03-08-2012

It is not related to the subject of your post. This is simply an example of a rounding problem
in case of a single-precision 'float'data type.

// Matrix A - 8x8( 'float' type ):

101.0 201.0 301.0 401.0 501.0 601.0 701.0 801.0
901.0 1001.0 1101.0 1201.0 1301.0 1401.0 1501.0 1601.0
1701.0 1801.0 1901.0 2001.0 2101.0 2201.0 2301.0 2401.0
2501.0 2601.0 2701.0 2801.0 2901.0 3001.0 3101.0 3201.0
3301.0 3401.0 3501.0 3601.0 3701.0 3801.0 3901.0 4001.0
4101.0 4201.0 4301.0 4401.0 4501.0 4601.0 4701.0 4801.0
4901.0 5001.0 5101.0 5201.0 5301.0 5401.0 5501.0 5601.0
5701.0 5801.0 5901.0 6001.0 6101.0 6201.0 6301.0 6401.0

// Matrix B - 8x8( 'float' type ):

101.0 201.0 301.0 401.0 501.0 601.0 701.0 801.0
901.0 1001.0 1101.0 1201.0 1301.0 1401.0 1501.0 1601.0
1701.0 1801.0 1901.0 2001.0 2101.0 2201.0 2301.0 2401.0
2501.0 2601.0 2701.0 2801.0 2901.0 3001.0 3101.0 3201.0
3301.0 3401.0 3501.0 3601.0 3701.0 3801.0 3901.0 4001.0
4101.0 4201.0 4301.0 4401.0 4501.0 4601.0 4701.0 4801.0
4901.0 5001.0 5101.0 5201.0 5301.0 5401.0 5501.0 5601.0
5701.0 5801.0 5901.0 6001.0 6101.0 6201.0 6301.0 6401.0

// Matrix C = Matrix A * Matrix B( 8x8 - 'float' type ):

13826808.0 14187608.0 14548408.0 14909208.0 15270008.0 15630808.0 15991608.0 16352408.0
32393208.0 33394008.0 34394808.0 35395608.0 36396408.0 37397208.0 38398008.0 39398808.0
50959604.0 52600404.0 54241204.0 55882004.0 57522804.0 59163604.0 60804404.0 62445204.0
69526008.0 71806808.0 74087608.0 76368408.0 78649208.0 80930008.0 83210808.0 85491608.0
88092408.0 91013208.0 93934008.0 96854808.0 99775608.0 102696408.0 105617208.0 108538008.0
106658808.0 110219608.0 113780408.0 117341208.0 120902008.0 124462808.0 128023608.0 131584408.0
125225208.0 129426008.0 133626808.0 137827616.0 142028400.0 146229216.0 150430000.0 154630816.0
143791600.0 148632416.0 153473200.0 158314016.0 163154800.0 167995616.0 172836416.0 177677200.0

Iunderlined all incorrect values, like'50959604.0' and due to a rounding it is not '50959608.0'.

In all cases last two digits of any value in the Matrix C must be '...08'. It can't be '...00', or '...04',or '...16'.

The test case is reproducible on many platforms when compiled with different C/C++ compilers and
a default Floating-Point Unit settings.

jimdempseyatthecove · ‎03-08-2012

Sergey,

float (32-bit) has a 23-bit fraction with 24 bits of precision. Resulting in slightly more than 7 digits of precision. The multiplications you are preforming requires slightly more than 8 digits of precision (IOW the mantissa would require another 4-bits of precision (27-bit fraction with implied 1).

The results of the 8x8 multiplication is acceptible assuming that you can accept ~7.1 digits of accuracy. If not, then you must use a floating point format that uses more bits in the mantissa (64-bit format uses 52-bit mantissa providing 53-bits of precision).

Exponent overflow/underflow is a seperate issue.

Jim Dempsey

SergeyKostrov · ‎03-08-2012

Jim,

All these details are not new for me. That test case I providedcaused some "chaos" on a project back in
September 2011. That is,a complete code review, re-testing, etc, ofa matrix multiplication subsystem based
on a Strassen algorithm. As soon as the problem was understood aset set of changes / recommendations,
like use'double' type instead of 'float' type to improve accuracy of computations, was made.

Unfortunately, as soon as 'double' type is used twice more memory is needed for a matrix. The situation
escalates on32-bit platforms when 1024x1024, orgreaterdimensions,matrices need to be multiplied.

Thanks for the feedback.

Best regards,
Sergey

jimdempseyatthecove · ‎03-09-2012

Round-off error is almost a rule excepting for specialized input data.

1024x1024 requires 1024 additions along each DOT product. Assuming even distribution of (integer) numbers, the average product will expand 1024x. IOW 10-bits (on average) will be required of the result over the average bits of the products going into the DOT product. Should 32 bits be used as unsigned int, then statistically the numbers in the cells prior to the multiplication, can be expressed using 11 bits. These are average numbers.

Assuming double with 53 bits of precision (52+1), then (on average), 1024x1024 requires

10-bits for accumulation
leaving 42-bits for products (on average)
and limiting terms for product of ~21 bits.
Or just over 6 digits of accuracy in the input datafor multiplications containing worst case (valid) data.

Even doubles may not maintain the accuracy you desire.
i.e. do not expect 15 digits of precision using doubles for a 1024x1024 mat mul.
Best cases may approach that precision, worst case may be on the order of 6 digits.

Jim Dempsey

SergeyKostrov · ‎03-09-2012

Quoting zlzlzlz7

...I really don't wish to write many for loops to manipulate a matrix...

[SergeyK]In case of a Classic matrix multiplicationthree loops are needed.
In case of an Element-by-element addition orsubstraction one loop is needed.

...so is there something for C/C++ to make it as easy to use as fortran or matlab, yet still have high performance? or some higher level APIs
for MKL?...

[SergeyK] Another option to consideris IPP library.The libraryhas Matrix Processing andVector
Math domains. All functions in IPP are highly optimized and it is really hard to outperform them
withregular C/C++ functions.

Best regards,
Sergey

parallel computing & array multiplication problem, any library?