In ifort, all arrays large

mohanmuthu · ‎10-17-2012

Related to one of my earlier post on MATMUL, I am looking for some guidance to see how can I speed-up the multiplication of 2 matrices (A,B). A is in order of size 1,000,000 x 100 and B is of size 100 x 10. I tried testing couple of combinations like MATMUL(A,B), MATMUL(B',A') tranposed before and suprised to see the difference in speed. Adding to that, some compiler options /fast, /Qopt-matmul, /Qipo (one each time) lead to even increased time. I see MKL libraries available in my package as well.

Is there any recommendation, that I should have matrices A & B in certain format for better performance or alternatives like OpenMP format might do better job? Please comment.

Thanks, Mohan.

TimP · ‎10-17-2012

I believe opt-matmul (implying a call to mkl_dgemm) invokes OpenMP threading. Not knowing whether it applies to your cases, I've seen cases where opt-matmul implies allocation of a temporary matrix to contain the result (even in the case where the result is assigned directly to a matrix which you have declared. In such cases, you would want to minimize the size of that matrix, or call ?GEMM directly so as to avoid the temporary local allocation. Note that ?GEMM has explicit arguments to deal with transposing, without actually creating a transposed copy. If your time is spent mostly in MKL (whether by an explicit call or by opt-matmul), aggressive compiler optimizations aren't relevant. In the case where MATMUL is expanded in line, specifying an architecture (e.g. /QxHost) and /O3 will be important (but /O3 implies opt-matmul, if you don't over-ride).

John_Campbell · ‎10-17-2012

If you are proposing to produce the matrix product x, then if is stored in the transposed form ', then the computation should be performed more efficiently. This would require a variant of MATMUL, say MatMulT that recognises the first array is transposed. Otherwise the use of a temporary row storage can improve performance, by reducing the extended memory addressing associated with a large matrix. I think these options have been demonstrated before. Utilising this approach assumes that you are able to store in the transposed form, without an adverse impact on performance in other parts of the code. I'm not sure if an optimised MATMULT using OpenMP is readily available, but it should be. John

jimdempseyatthecove · ‎10-18-2012

>> A is in order of size 1,000,000 x 100 and B is of size 100 x 10. Consider transposing B to B' once to seperate storage (as opposed transposing to temporary). Then using B' perform the DOT products to produce the 10 x 100 output (tile). If you are walking B' over A (e.g. filter or pattern search), then you can parallelize the tile walk. IOW parallel-outer - vector inner. If you can, consider making the dimensions multiples of your target platform small vector size: SSE has 4 floats or 2 doubles, AVX has 8 floats, 4 doubles per small vector. N.B. Although MKL will likely internally transpose one of the matricies (hopefully the smaller B) to produce a nice fast DGEMM, you should note that should you be performing a tile walk of your larger array A, that taking the simple coding route would force MKL to make 100,000 transpositions of B. This will be larger should you overlap your tiles. Jim Dempsey

John_Campbell · ‎10-18-2012

Jim, . My understanding is that it is that must be transposed, so that the key dot_product is of dimension 100. (Does MKL assume Fortran storage and not C ?) For MATMUL, MKL might generate a temporary row of to apply as a dot_product with the columns of . . ( A long time ago I wrote a subroutine MulATB, which recognised A was supplied as A', which in those days minimised paging delays.) As in this case A is in order of size 1,000,000 x 100, this is either 400mb or 800mb, so this might have some memory/paging issues on a 32-bit PC. . John

jimdempseyatthecove · ‎10-18-2012

John, You can transpose either of the matricies. The output write order of the DOT products would vary depending on which input was transposed. In the above case, you would have two vectors of lengths 100, memory ordered for SIMD, resulting in one scalar write. Depending on how you juxtapose the inputs this may result in: 10*100*(100/w) reads, + 10*100 writes or 10*(100/w)*100 reads, +10*100/w writes per tile, where w == SIMD vector width. I haven't run the code to see if transposing B uses the greator or lessor writes. However... Added to this is the transposition overhead. Transposing B costs no greater than 10*100 reads, 100/4*10 writes (once) ... and uses small temporary (that likely will fit into L1 cache) Transposing A costs no greater than 1,000,000*100 reads, 100/4*1,000,000 writes (once) ... and uses large temporary (that likely will NOT fit, and flush L3 cache). The acid test though, is to try it both ways. Jim Dempsey

John_Campbell · ‎10-18-2012

I have tried to outline some coding alternatives available for matrix multiplication. The options I have identified are: Transposing A so that dot_product can be used; this would suit AVX vector instructions. Changing the loop order so that A is processed sequentially; this might suit an OpenMP approach for factored vector addition. Using a temporary row of A also provides for dot_product and AVX efficiency. I did not identify any significant benefit from transposing B, especially given it's small size. The code example is listed below. Someone might identify other possible improvements. PS: I have now attached an updated test example. I'd be interested to see how easy and effective it could be to parallelise the vector addition approach. subroutine matmul_test (A,B,C, l,m,n) ! integer*4 l,m,n, i,j,k real*8 A(l,m), B(m,n), C(l,n), row(m), s ! A(1000000,100), 800 mb ! B(100,10), 8 kb ! C(1000000,10) 80 mb ! ! Conventional Matrix Multiplication, using accumulator "s" do i = 1,l do j = 1,n s = 0 do k = 1,m s = s + A(i,k) * B(k,j) end do C(i,j) = s end do end do ! ! If s is not used - more general ( loop order swap for addressing A sequentially ) ! ( no need to transpose B ) C = 0 do k = 1,m do j = 1,n ! do i = 1,l ! C(i,j) = C(i,j) + A(i,k) * B(k,j) ! end do C(:,j) = C(:,j) + A(:,k) * B(k,j) end do end do ! ! If row of A is used - sequential dot product, suits vector instruction ! do i = 1,l row(1:m) = A(i,1:m) do j = 1,n ! do k = 1,m ! s = s + A(i,k) * B(k,j) ! end do C(i,j) = Dot_Product (row, B(:,j)) end do end do ! ! If A is transposed - sequential dot product, suits vector instruction ! C = 0 do i = 1,l do j = 1,n ! do k = 1,m ! C(i,j) = C(i,j) + A(k,i) * B(k,j) ! end do C(i,j) = Dot_Product (A(:,i), B(:,j)) end do end do ! ! If B is transposed - vector addition ! C = 0 do k = 1,m do j = 1,n ! do i = 1,l C(:,j) = C(:,j) + A(:,k) * B(j,k) ! end do end do end do ! end (It's a shame I lost all my indenting in the code I wrote !!)

mohanmuthu · ‎10-19-2012

Thank you Jim and John. Based on your recommendations/examples, I tried few more options. DOT_PRODUCT put on FORALL loop seems best without any compiler option, but with /fast option, MATMUL performs best. However, the change done is to transpose the matrices (as MATMUL(B',A')). Though it works well in test case, when I put on the main code along with several other variables, it started throwing stack overflow error #170. I am looking at three options, 1) increse stack, 2) re-visit the active variables, 3) explore MKL functions. Will keep you posted my findings.

jimdempseyatthecove · ‎10-19-2012

Don't discount MKL, you would be hard pressed to best it for a single instance of MATMUL. This said, one must be aware that the sum of the best is not necessarily the best for the overall applicaiton. MKL has a high degree of functionality, it may have precisely what you are looking for. Look at the activity in your application _outside_ where you are performing the MATMUL. If (when) one of these matricies is invariant for numerous iteratons then that matrix is the one you should consider transposing. When both matricies vary for each MATMUL then use the MKL function for MATMUL (?GEMM). I am not an MKL user, someone else here can pipe in, I would guess there is an MKL call for MATMUL(A,B) MATMUL(A',B) MATMUL(A,B') MATMUL(A,B)' MATMUL(A',B)' MATMUL(A,B')' Meaning ?GEMM has flags for output transposition as well as indicators as to if either or both inputs are to be interpreted as transposed. Jim Dempsey

John_Campbell · ‎10-21-2012

I have continued to review the coding options I presented and also change the compilation options to include /Qparallel: set options=/source:%1.f95 /free /O2 /QxHost /Qparallel . I also: - changed the timing routine to report both CPU time and elapsed time. -included a FORALL option for A' and some other vector subroutines. The results are presented for both ifort Ver 11.1 and Ver 12.1 . These results show that - /Qparallel does achieve significant improvement in this case where there can be clearly identified parallel computation, for this size of matrix. - Ver 12 offers some changes from Ver 11, including for my subroutine approach I have discussed previously. - FORALL did not achieve the best results, although this code structure was designed to facilitate parallel computing options. - the intrinsic MATMUL does not improve with /Qparallel. - the comparison of CPU and elapse times are also interesting to understand. . It is pleasing to see that with Ver 12.1, my preferred coding technique (VecSum and A transposed) is still effective. . The matrix computation C = A x B is ideally suited to /Qparallel, although the parallel loops are small (m=100 itterations in each inner loop) From this it appears that /Qparallel is of benefit for well suited computation. For this itteration size (m=100) and row size (l=1000000) the temorary ROW is effective without /Qparallel, but not suited to /Qparallel, while the full matrix transpose is very effective in both cases. . Summary best elapse times are: Ver 11.1 without /Qparallel 0.577 ( MatMul intrinsic 1.48 ) Ver 12.1 without /Qparallel 0.531 ( MatMul intrinsic 2.14 ) Ver 11.1 with /Qparallel 0.156 ( MatMul intrinsic 1.48 ) Ver 12.1 with /Qparallel 0.140 ( MatMul intrinsic 2.14 ) . ( the intrinsic MATMUL performs very poorly, as I have used it. Not sure if this is my mistake or poorly chosen compiler options.) . I am attaching the latest test program and the resulting trace of the run times on my Xeon 4 processor PC. I am yet to use MKL effectively, so that is for the next test. Note: These results for A(1000000,100) could be considered unusual, as 1000000 imposes a significant memory footprint for a non-transposed case. For cases where the footprint is much smaller, say A(100,100), the alternative approach presented here would not be ass effective. . John

TimP · ‎10-22-2012

I'm curious whether your remarks about FORALL are meant to verify that FORALL isn't generally amenable to parallel optimizations. Did DO CONCURRENT suit you better? Note that VECTOR ALIGNED and the like, as well as /Qparallel, can be used with DO CONCURRENT. The need for directives detracts from the main advantage of DO CONCURRENT, that it can faciliate cleaner syntax.

John_Campbell · ‎10-22-2012

Tim, . Thanks for your advice. . As all array dimensions are a multiple of 4, I think that all vector operations are aligned, although I think my Xeon W3520 does not support AVX instructions. ( I was hoping to transfer the test to my i5 to confirm; see below) . My comment regarding FORALL is that when it was introduced in Fortran95, it was supposedly to identify code that was suitable for parallel computing. To me, it has never really achieved anything that a DO loop does not provide. Certainly the performance results in the example I have provided show a poor comparison, so I continue to use DO structures. All other compilers I have used also show no benefit from coding FORALL. . I am not familiar with DO CONCURRENT, so would welcome you providing an example in the code I have provided. (Shouldn’t FORALL have this capability, as this could be argued is the basis of it’s definition?) . My example consists of source code and a batch file (xx.bat(.txt!!)) which I use to document the compiler options and the trace of the compile, link and run. This runs in a cmd.exe box, with the appropriate SET and PATH variables. You might note that when it runs, the compiler response does not go to standard output, while the linker does. Is there a way of sending the compiler reports to >>%1.tce? . The performance of MATMUL is disappointing. It appears to be worse than the conventional approach. Are there any improvements implemented in MATMUL or is this A(1000000,100) an unusual case? . Another related problem to the use of /Qparallel; I was previously (with ifort Ver 11.1) able to transfer .exe files to another PC. This might no longer be the case. Is there a minimum install for other PC’s that do not have access to the licence, so that the .exe can be run, including /Qparallel requirement? Possibly the reason it did not run is I also included /Qxhost. In my case ifort is installed on a Xeon and the other PC is an i5, which I expected was upward compatible with /Qxhost. . John

TimP · ‎10-22-2012

In ifort, all arrays large enough to be eligible for vectorization will be at least 16-byte aligned. Beginning with ifort 13.0. there is the option to default to higher alignments, e.g. in support of AVX. Even without AVX, 32-byte alignment sometimes helps performance. When the arrays are passed as dummy arguments, alignment isn't known unless via interprocedural analysis or by assertions (vector aligned directive or assume_aligned). My understanding of FORALL is that it was introduced (along with WHERE...) for compatibility with HPF. However, the more fully thought out syntax requirements of DO CONCURRENT, loosely similar to DO ACROSS from the HPF era, don't require IVDEP directive as FORALL does, and allow for parallelization of a block of statements. As it takes nearly a page of text for expert authors to explain the requirements of each, you really should refer to textbook references. Even DO CONCURRENT tends to need VECTOR ALIGNED directives and the like (it doesn't necessarily add any optimization you don't get with DO loops). The syntax is so close that you can make a literal text substitution between FORALL and DO CONCURRENT in the case where you satisfy the requirements of both. I've already mentioned the difficulties of controlling temporary matrix allocations and handling the automatic opt-matmul aspect of MATMUL together with -O3 in ifort. I don't think you've made it clear whether that is where you're seeing difficulty. It might be easier if opt-matmul didn't come in by default at -O3.

John_Campbell · ‎10-22-2012

Tim, . I tried to implement DO CONCURRENT, but was not sucessfull. . The following gave incorrect results for a variety of compilation options (attached) do j = 1,n do concurrent (i = 1:l) C(i,j) = dot_product (A(:,i), B(:,j)) end do end do . While the following reported : catastrophic error: **Internal compiler error: internal abort** Please report this error along with the circumstances in which it occurred in a Software Problem Report. Note: File and line given may not be explicit cause of this error. compilation aborted for matmul_test3.f95 (code 1) do concurrent (i = 1:l, j=1:n) C(i,j) = dot_product (A(:,i), B(:,j)) end do . Any ideas? attached is the latest code, batch file and some run traces for variants of the first option. . One +ve; the use of PURE improved with /O3 . John . ps: can the list of possible upload file name extensions be updated to include typical files used in Fortran.

John_Campbell · ‎10-22-2012

I investigated my attempted use of DO CONCURRENT and concluded I must have the syntax or understanding wrong, as array chk is being corrupted. Attached is a change to the program, identifying that the values to chk are changed. The program finally crashes. I'm not sure what I have done wrong. . John

TimP · ‎10-23-2012

As the echo inidcates, any internal error from the compiler is a reportable bug, even if provoked by some other error. It would be better to check the latest 12.1 or 13.0 compiler; I don't remember whether you are using a current one. It might not be surprising to find out that the combination of do concurrent inside a do loop with /Qparallel hadn't been tested. It might be more idiomatic to put the 2 levels of loops in the do concurrent rather than using the outer DO. Evidently, if you want better control of parallelism you will use OpenMP and will avoid f95 and later iterative constructs. File suffixes such as .f95 are supported by the /extfor option of ifort Windows. Many people would argue with your "commonly used" characterization, as you might see even on these forum sections.

Steven_L_Intel1 · ‎10-23-2012

We've asked the forum owners to expand the list of attachments to include Fortran file types .f90, ./f and .for. .f95 is not a recognized Fortran file type (and its use indicates a misunderstanding of how these file types are interpreted.) The source you attached compiles without error with the current compiler.

John_Campbell · ‎10-23-2012

Steve and Tim, FORALL was introduced with Fortran 95. Should I change the extension to .F08 to indicate when DO CONCURRENT was introduced to ifort ? I don't think the file extension name is the source of the problem, and there is a problem with DO CONCURRENT. The compiler I am using to test this is: Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.5.344 Build 20120612. I have tried a variety of DO CONCURRENT constructs, but the only one which compiles is: do j = 1,n do concurrent (i = 1:l) C(i,j) = dot_product (A(:,i), B(:,j)) end do end do . Those that failed include: do concurrent (i = 1:l) do j = 1,n . do concurrent (i = 1:l) do concurrent (j = 1:n) . do concurrent (i = 1:l; j = 1,n) . (I would have expected the first of these to work.) Is it possible to suggest a DO CONCURRENT structure that suits the aim of this project ? I also tried a variety of compilation options: set options=/Tf %1.f08 /free /O2 /QxHost set options=/Tf %1.f08 /free /O2 /QxHost /Qparallel set options=/Tf %1.f08 /free /O3 /QxHost /Qparallel set options=/Tf %1.f08 /free /O3 /QxHost These did not change the either compiler error for other concurrent structures or the run error when compiled successfully. . The run error is that the contents of array chk(:,1) is changed from 100 to 1000 when DO CONCURRENT runs. This is demonstrated by the array scan test in the latest version. Is it possible to confirm if this is the case with your Ver 12 or Ver 13 ? . The performance of DO CONCURRENT also does not appear to be good in comparison to the other options coded for A’. I have tried to provide good documentation of what I am doing so that it might be reproduced, being: xx.bat is the batch file which I run in a cmd.exe box, identifying the compiler options I am using. matmul_test3.f08 is the source code of the latest working version. matmul_test3.tce is the standard output trace of all recent runs, providing the run times and linker reports. The compiler options are also echoed. Matmul_test3.log is the error output trace; to provide the compiler version and compiler error reports. ( Why can’t this go to standard output ?) The purpose of this review has been to identify how I can achieve improved performance, while limiting the use of non-standard approaches. Some of the claims of ifort Ver 12 (2011) was an improved support for parallel computing, with the suggestion that it is easier. It has been disappointing that most of the comments I have received on my posts for understanding how parallel works start with the implication that I should know all the features of ifort. As a new user to ifort, I don’t. I do not know OpenMP or have a history of use of $DEC directives. I have been hoping to learn what could be achieved by using /Qparallel, and I’m sure there are other ifort users in a similar position to myself. . As an overview of this problem and based on what FORALL was originally proposed to achieve with Fortran 95, the combination of /Qparallel and FORALL should be sufficient to imply a near optimal performance from ifort : forall (i=1:l, j=1:n) C(i,j) = dot_product (A(:,i), B(:,j)) end forall At the moment, /O3 /Qparallel and DO loops with an intrinsic Dot_Product provide the best performance I have achieved (0.109 seconds) while FORALL is 0.399 seconds. (24 hours ago it was a PURE Function that gave the best performance.) . I have also been trying to transfer the .exe file to another PC which does not have an ifort license installed. Can you indicate what needs to be installed on this other PC for a .exe that uses /Qparallel? I am assuming there are some environment variables to be set and appropriate .dll’s required. . In summary there are a number of points I was hoping for feedback on: Can the ifort compiler report be directed to standard output? What DO CONCURRENT coding structures are suited to 2 level Do loops ? Can you reproduce the error I am reporting with DO CONCURRENT using ifort Version 12.1.5.344 Build 20120612 ? If OpenMP is to be used, could you suggest a coding of the DO loops to perform this ? What do I need as a minimal install on other PC’s ? . I look forward to what information you could provide. . John

Steven_L_Intel1 · ‎10-24-2012

Do not change the source file type - this does not indicate standards level. The file type is used only to distinguish fixed-form from free-form.

John_Campbell · ‎10-24-2012

I should add, that the results obtained so far with and without /Qparallel I find very interesting. If you compare the CPU times and the run times for the different options, with and without /Qparallel, there are some interesting outcomes. Elapsed time is most important, but comparing CPU to elapse with /Qparallel indicates the number of cpu's in use. . Using /Qparallel /O3 (best .109s, as A .483s, matmul .702s) If you can store A in a transposed form, then the use of Dot_Product or an equivalent PURE function produce the best results with /Qparallel. If you can't then the conventional approach of an inner loop accumulator, or Dot_Product using array sections works well (see 6) in latest attached), (surprisingly these are better than a temporary row array, which probably limits /Qparallel ?) /Qparallel is very effective and easy to use in this case. The times are on my Xeon which supports 4 CPU's. . without /Qparallel, all this changes: (best .421s, as A .671s, matmul 2.247s) If you can store A in a transposed form, the same preferred result occurs, however if stored as A, the temporary row is now best, while the "s" accumulator performs much worse. These computations still benefit from the available vector instruction set. . In all cases (as tested); FORALL, DO CONCURRENT and MATMUL do not achieve a best result. The available results also change with a different processor and probably with different background processes. . The DO CONCURRENT might not be a fair test as it does NOT work in the example provided. It could be good to know why. . I tried to provide results by minimising the amount of other libraries. Transferring the .exe to another PC identifies this limitation for some of the alternatives that have been tested or suggested. I now know more about performance of a "PURE" computation that is well suited to /Qparallel. . The table below tracks the changed run time for each option for differing compiler options. It is interesting to see what coding techniques respond to different options. O3 appears to support the old DO loop approach which might imply it is a good fit for old F77 code. Clearly MATMUL is not suited/tuned to this matrix size A. The three coding approaches that show poor improvement with /Qp change for the transpose case. . I hope these comparitive results are useful to others. . John Table of comparison (might look better if imported to excel or improved forum !!) /O2 /O3 /O2 /Qp /O3 /Qp cpu/Elapse Description Elapse Elapse Elapse Elapse Ratio 0) classic tripple loops 2.184 0.921 2.152 0.843 1.5 1) Matmul Intrinsic 2.168 2.247 2.121 0.702 3.9 2) conventional loops 1.872 1.856 0.483 0.483 4.0 3) vector addition 2.090 2.262 1.420 1.389 4.0 4) row dot_product 0.670 0.671 0.671 0.687 1.2 5) vector addition call 2.090 2.184 2.293 2.060 1.0 6) dot_product section 1.856 1.872 0.702 0.484 4.0 Transpose t0) transpose tripple loop 0.421 0.421 0.124 0.125 4.0 t1) transpose dot_product 0.436 0.437 0.172 0.109 4.0 t2) transpose FORALL ij 0.951 0.998 0.624 0.390 3.5 t3) transpose Vec_Sum 0.468 0.468 0.156 0.125 4.0 t4) transpose Pure_Sum 0.452 0.437 0.219 0.109 4.0 t5) transpose d_p loop order 1.014 1.014 0.452 0.390 4.0 t6) transpose FORALL ji 0.811 0.905 0.624 0.390 3.9 t7) DO CONCURRENT 0.889 0.998 0.702 0.577 4.0

John_Campbell · ‎10-24-2012

My success with /Qparallel appears to be short lived. I converted the earlier program of providing the calculation loops to call subroutines which then perform the loops, eg: ! 4) Pure vector_sum ! C = 0 call time_step (ts) do i = 1,l do j = 1,n C(i,j) = Pure_Sum (A(1,i), B(1,j), m) end do end do call time_step (te) ; t(:,4) = te - ts write (*,*) t(:,4), 't4) transpose Pure_Sum ', err_max (c, chk, l,n) . Run: 0.4992032 0.1250000 t4) transpose Pure_Sum 2.4868996E-14 . is changed to . ! 4) Pure vector_sum (test repeated k times) ! C = 0 call time_step (ts) do i = 1,k call puresum_loop_t (A,B,C, l,m,n) end do call time_step (te) ; t(:,4) = te - ts write (*,*) t(:,4), 't4) transpose Pure_Sum ', err_max (c, chk, l,n) . Run: 3.291621 2.808000 t4) transpose Pure_Sum 2.4868996E-14 (for k=5) . where the routine looks like: . subroutine puresum_loop_t (A,B,C, l,m,n) integer*4 l,m,n, i,j real*8, intent(IN) :: A(m,l), B(m,n) real*8, intent(OUT) :: C(l,n) ! INTERFACE PURE real*8 function pure_Sum (U, V, n) integer*4, intent(IN) :: n real*8, intent(IN) :: U(n), V(n) end function pure_sum END INTERFACE ! do i = 1,l do j = 1,n C(i,j) = Pure_Sum (A(1,i), B(1,j), m) end do end do end . The net result is that the parallel performance is now significantly deminished, both for total CPU time and for the effective ratio of CPU/elapse. It appears that by introducing further levels of call, /Qparallel no longer identifies it is safe to parallel the calculations. I included identifying a PURE function in the inner loop and INTENT for the arguments, which I thought was sufficient for /Qparallel. . Does this imply that /Qparallel can not be used in a program with significant levels of calls for the arrays being used? If this is the case, I presume that I will have to implement an OpenMP approach, which I was trying to avoid. . This would mean that the performance improvement approach I previously identified is not easily transferred from the trivial test example. Is this a limitation of /Qparallel or do I need to better identify that parallel runing is possible? . John

MATMUL - Best Practices & Alternatives