- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi there, I'm a newbie in parallel computing, and met a project that requires using C/C++ to rewrite a fortran program and finally runs on a multi-core Intel CPU + GPU server node or even a cluster.

The program involves much matrix/multidimensional array multiplication. In fortran it easy to do row/column assignment, matrix addition, subtraction, multiplication etc., but not so simple for C - it needs to be enhanced to have functions like fortran or matlab.

So I read a bit about CUDA, OpenCL about GPU and OpenMP, MPI about parallel machines. There are only low level APIs, and don't have the functions above. If I make some libraries for C myself, I'm afraid they won't be reliable or efficient enough. So I wonder if there're any opensource libraries. I saw nvidia CUDA topics recommended a library called arrayfire, but is commercial. And I tried gsl, and also heard about mtl, octave, I'd like to ask which is the best option? or Intel mkl?

Thanks!

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

For the operations you are talking about, Matlab is simply another user interface for MKL library (when run on the appropriate platforms).

For CUDA, a usual tactic for matrix multiply is to call into the cudablas library in similar fashion. To some extent, this avoids spending excessive effort on non-portable code.

Other libraries you mention evidently have devoted users, but not simply on account of basic matrix algebra.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

*...The program involves much*

**matrix/multidimensional array multiplication**. In fortran it easy to do row/column assignment, matrix addition, subtraction, multiplication etc.,**but not so simple for C**...

It depends on an algorithm(s) you're going to select.**Classic** algorithms formatrix multiplication, addition,

etc are very simple when implementedin**C** or **C++**. Performance of these algorithms goes up as soon

asloops are unrolled, or**SSE** is used, or calculations done in parallel, or a pirority of the process boosted to

a realtime. Once again, speaking about implementation complexity, they are simple.

Everything changes when a size ofa matrix isgreater than**1024x1024**. In that case **Strassen**-like

algorithms **must be used** and performance gains are significant becausetime complexity of the**Strassen**-like algorithmsis better than a **Calssic** one. They multiply matrices significantly faster but it

comes at a price, that is, implementation complexity of the **Strassen**-like algorithmsin **C** or **C++**is higher.

Hereis someinformation regarding **Time Complexity** for Matrix Multiplication Algorithms:

Virginia Vassilevska Williams...O( n^**2.3727** )

Coppersmith-Winograd..............O( n^**2.3760** )

Strassen........................................O( n^**2.8070** )

Strassen-Winograd.....................O( n^**2.8070** )

Classic...........................................O( n^**3.0000** )

Let me know if you would like to see performance numbers, **Classic** vs. **Strassen** ( Recursive Heap-Based Complete).

Best regards,

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

to TimP (Intel),

Thank you very much for the reply!

I'm sure Intel MKL is a powerful tool, and with blas as a base, the performance must be very good. But as you said, there're user interfaces. I noticed that many math libraries requires user to define a matrix struct or something. My project is on physics, and I'm not professional, so I really don't wish to write many for loops to manipulate a matrix.

In case I can't explain myself well, here is what I was going to do with some packaged functions:

eg. 1

Fortran

mat(:,1) = 0. !set the 1st column of matrix mat to all 0s (1-index)

C

vector_set_zero(vecttemp);//get a zeros vector

matrix_set_col(mat, 0, vecttemp);//set the 1st column of matrix mat to all 0s (0-index)

eg. 2

Fortran

vecttemp=matmul(mat,vect); !built-in function of fortran, matrix multiplication

C

blas_dgemv (CblasNoTrans, 1.0, mat, vect, 0.0, vecttemp);//using cblas

so is there something for C/C++ to make it as easy to use as fortran or matlab, yet still have high performance? or some higher level APIs for MKL?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

*Quoting zlzlzlz7*

*...The program involves much*

**matrix/multidimensional array multiplication**. In fortran it easy to do row/column assignment, matrix addition, subtraction, multiplication etc.,**but not so simple for C**...

It depends on an algorithm(s) you're going to select.**Classic** algorithms formatrix multiplication, addition,

...

Also thanks a lot for the reply!

The optimization in the aspect of algorithms is also our concern. The serial program for 1 dimention physical model takes only several seconds, but for 3 diemention it takes many hours. As the 3D model isn't finished yet, I'm not sure if the matrix size will be very large. The direction on data size and algorithm is very helpful!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

With RAM abundant, consider maintaining the portion of the 3D model that is use in matrix multiplication as three seperate copies: one with X as the minor index, one with Y as the minor index, one with Z as the minor index (yes three instances of those data).

What you will be trading off is 3 writes/cell against the product of the size of each dimension. e.g. 3 writes against 1000x1000x1000 reads per (output)cell(for a cubewith side of1000).

With the rotated arrays you can now perform DOT products of sequential data (memory access friendly).

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I don't know what you consider easy to do in C or C++, but you can zero out a single column of a matrix stored as BLAS expects by a single memset() or fill(), or by an equivalent for(). It's up to you if you want to hide it under another name so as to avoid the standard library function names. I'm not one to advocate the idea that C or C++ can be made as easy to use as Fortran, when taking into account performance, but I don't favor implicit style rules which make one or the other more difficult.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

*...*

As the 3D model isn't finished yet, I'm not sure if

The direction on data size and algorithm is very helpful!

...

As the 3D model isn't finished yet, I'm not sure if

**the matrix size will be very large**.

The direction on data size and algorithm is very helpful!

...

What is the matrix size?

What precision do you need, that is, a **single** or a**double**?

Also, take into account limitations of **IEEE 754 Standard**. In case of matrix multiplication there are some

accuracy issues even with small matrices. I could give an example of a rounding problem whentwo **8x8** matrices are multiplied.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

It is not related to the subject of your post. This is simply an example of a rounding problem

in case of a single-precision 'float'data type.

// Matrix A - 8x8( 'float' type ):

101.0 201.0 301.0 401.0 501.0 601.0 701.0 801.0

901.0 1001.0 1101.0 1201.0 1301.0 1401.0 1501.0 1601.0

1701.0 1801.0 1901.0 2001.0 2101.0 2201.0 2301.0 2401.0

2501.0 2601.0 2701.0 2801.0 2901.0 3001.0 3101.0 3201.0

3301.0 3401.0 3501.0 3601.0 3701.0 3801.0 3901.0 4001.0

4101.0 4201.0 4301.0 4401.0 4501.0 4601.0 4701.0 4801.0

4901.0 5001.0 5101.0 5201.0 5301.0 5401.0 5501.0 5601.0

5701.0 5801.0 5901.0 6001.0 6101.0 6201.0 6301.0 6401.0

// Matrix B - 8x8( 'float' type ):

101.0 201.0 301.0 401.0 501.0 601.0 701.0 801.0

901.0 1001.0 1101.0 1201.0 1301.0 1401.0 1501.0 1601.0

1701.0 1801.0 1901.0 2001.0 2101.0 2201.0 2301.0 2401.0

2501.0 2601.0 2701.0 2801.0 2901.0 3001.0 3101.0 3201.0

3301.0 3401.0 3501.0 3601.0 3701.0 3801.0 3901.0 4001.0

4101.0 4201.0 4301.0 4401.0 4501.0 4601.0 4701.0 4801.0

4901.0 5001.0 5101.0 5201.0 5301.0 5401.0 5501.0 5601.0

5701.0 5801.0 5901.0 6001.0 6101.0 6201.0 6301.0 6401.0

// Matrix C = Matrix A * Matrix B( 8x8 - 'float' type ):

13826808.0 14187608.0 14548408.0 14909208.0 15270008.0 15630808.0 15991608.0 16352408.0

32393208.0 33394008.0 34394808.0 35395608.0 36396408.0 37397208.0 38398008.0 39398808.0

50959604.0 52600404.0 54241204.0 55882004.0 57522804.0 59163604.0 60804404.0 62445204.0

69526008.0 71806808.0 74087608.0 76368408.0 78649208.0 80930008.0 83210808.0 85491608.0

88092408.0 91013208.0 93934008.0 96854808.0 99775608.0 102696408.0 105617208.0 108538008.0

106658808.0 110219608.0 113780408.0 117341208.0 120902008.0 124462808.0 128023608.0 131584408.0

125225208.0 129426008.0 133626808.0 137827616.0 142028400.0 146229216.0 150430000.0 154630816.0

143791600.0 148632416.0 153473200.0 158314016.0 163154800.0 167995616.0 172836416.0 177677200.0

Iunderlined all incorrect values, like'50959604.0' and due to a rounding it is not '50959608.0'.

In all cases last two digits of any value in the Matrix C must be '...**08**'. It can't be '...**00**', or '...**04**',or '...**16**'.

The test case is reproducible on many platforms when compiled with different C/C++ compilers and

a default Floating-Point Unit settings.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

float (32-bit) has a 23-bit fraction with 24 bits of precision. Resulting in slightly more than 7 digits of precision. The multiplications you are preforming requires slightly more than 8 digits of precision (IOW the mantissa would require another 4-bits of precision (27-bit fraction with implied 1).

The results of the 8x8 multiplication is acceptible assuming that you can accept ~7.1 digits of accuracy. If not, then you must use a floating point format that uses more bits in the mantissa (64-bit format uses 52-bit mantissa providing 53-bits of precision).

Exponent overflow/underflow is a seperate issue.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

All these details are not new for me. That test case I providedcaused some "chaos" on a project back in

September 2011. That is,a complete code review, re-testing, etc, ofa matrix multiplication subsystem based

on a

**Strassen**algorithm. As soon as the problem was understood aset set of changes / recommendations,

like use'

**double**' type instead of '

**float**' type to improve accuracy of computations, was made.

Unfortunately, as soon as '

**double**' type is used twice more memory is needed for a matrix. The situation

escalates on

**32-bit**platforms when

**1024x1024**, orgreaterdimensions,matrices need to be multiplied.

Thanks for the feedback.

Best regards,

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

1024x1024 requires 1024 additions along each DOT product. Assuming even distribution of (integer) numbers, the average product will expand 1024x. IOW 10-bits (on average) will be required of the result over the average bits of the products going into the DOT product. Should 32 bits be used as unsigned int, then statistically the numbers in the cells prior to the multiplication, can be expressed using 11 bits. These are average numbers.

Assuming double with 53 bits of precision (52+1), then (on average), 1024x1024 requires

10-bits for accumulation

leaving 42-bits for products (on average)

and limiting terms for product of ~21 bits.

Or just over 6 digits of accuracy in the input datafor multiplications containing worst case (valid) data.

Even doubles may not maintain the accuracy you desire.

i.e. do not expect 15 digits of precision using doubles for a 1024x1024 mat mul.

Best cases may approach that precision, worst case may be on the order of 6 digits.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

*...I really don't wish to write many for loops to manipulate a matrix...*

[

**SergeyK**]In case of a Classic matrix multiplicationthree loops are needed.

In case of an Element-by-element addition orsubstraction one loop is needed.

*...so is there something for C/C++ to make it as easy to use as fortran or matlab, yet still have high performance? or some higher level APIs*

for MKL?...

[

for MKL?...

**SergeyK**] Another option to consideris

**IPP**library.The libraryhas

**Matrix Processing**and

**Vector**

Mathdomains. All functions in

Math

**IPP**are highly optimized and it is really hard to outperform them

withregular

**C/C++**functions.

Best regards,

Sergey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page