do k = 1 , 3 do j = 1 , 3 d(1) = c(1,1)*qm(1,j,k) + c(1,2)*qm(2,j,k) + c(1,3)*qm(3,j,k) d(2) = c(2,1)*qm(1,j,k) + c(2,2)*qm(2,j,k) + c(2,3)*qm(3,j,k) d(3) = c(3,1)*qm(1,j,k) + c(3,2)*qm(2,j,k) + c(3,3)*qm(3,j,k) do i =1,j kth(i,j)=kth(i,j)+qm(1,i,k)*d(1)+qm(2,i,k)*d(2)+qm(3,i,k)*d(3) kth(j,i)=kth(i,j) end do end do end do
This has got to be a matrix operation - doe it not?
do k = 1 , 3 do j = 1 , 3 d(1) = c(1,1)*qm(1,j,k) + c(1,2)*qm(2,j,k) + c(1,3)*qm(3,j,k) d(2) = c(2,1)*qm(1,j,k) + c(2,2)*qm(2,j,k) + c(2,3)*qm(3,j,k) d(3) = c(3,1)*qm(1,j,k) + c(3,2)*qm(2,j,k) + c(3,3)*qm(3,j,k) end do end do !The above is below BUT! ! the "kth" array gets overwriten on each K loop... ! I guess there is some other code missing? do k = 1 , 3 d3X3 = matmul( C, qm(:,:,k) ) ! d3X3(3,3) enddo
Original code is on my other computer at home I will put it up tonight.
I am just getting to the printout the answers stage on this subroutine so I will try your thoughts. I am slowly tracking through it to understand the math --
Your rework is missing the do i= loop.
Given the original code, and assuming this is a hot spot, I would experiment with:
cTranspose = transpose(c) do k = 1 , 3 do j = 1 , 3 d(1) = dot_product(cTranspose(1:3,1),qm(1:3,j,k)) d(2) = dot_product(cTranspose(1:3,2),qm(1:3,j,k)) d(3) = dot_product(cTranspose(1:3,3),qm(1:3,j,k)) select case(j) case(1) kth(1,j)=kth(1,j)+qm(1,1,k)*d(1)+qm(2,1,k)*d(2)+qm(3,1,k)*d(3) kth(j,1)=kth(1,j) case(2) kth(1:2,j)=kth(1:2,j)+qm(1,1:2,k)*d(1)+qm(2,1:2,k)*d(2)+qm(3,1:2,k)*d(3) kth(j,1:2)=kth(1:2,j) case(3) kth(1:3,j)=kth(1:3,j)+qm(1,1:3,k)*d(1)+qm(2,1:3,k)*d(2)+qm(3,1:3,k)*d(3) kth(j,1:3)=kth(1:3,j) end select end do end do
I would try array syntax or some basic vector routines, such as the following and see if they have any effect. The chances are that ifort would already do as well as has been suggested, but sometimes we still can try. I find daxpy, when vectorised should work well for matrix operations, but not sure of small vector performance. Does AVX 256 work well with a real(8) d(3) ?
! initial code do k = 1 , 3 do j = 1 , 3 d(1) = c(1,1)*qm(1,j,k) + c(1,2)*qm(2,j,k) + c(1,3)*qm(3,j,k) d(2) = c(2,1)*qm(1,j,k) + c(2,2)*qm(2,j,k) + c(2,3)*qm(3,j,k) d(3) = c(3,1)*qm(1,j,k) + c(3,2)*qm(2,j,k) + c(3,3)*qm(3,j,k) do i =1,j kth(i,j)=kth(i,j)+qm(1,i,k)*d(1)+qm(2,i,k)*d(2)+qm(3,i,k)*d(3) kth(j,i)=kth(i,j) end do end do end do ! array syntax, assuming d(3) and c(3,3) do k = 1 , 3 do j = 1 , 3 d(:) = c(:,1)*qm(1,j,k) + c(:,2)*qm(2,j,k) + c(:,3)*qm(3,j,k) do i = 1,j kth(i,j) = kth(i,j) + dot_product ( qm(:,i,k), d ) kth(j,i) = kth(i,j) end do end do end do ! or vector routines do k = 1 , 3 do j = 1 , 3 d = 0 do i = 1,3 call daxpy ( d, c(1,i), qm(i,j,k), 3 ) end do do i = 1,j kth(i,j) = kth(i,j) + dot_product ( qm(:,i,k), d ) kth(j,i) = kth(i,j) end do end do end do ! or use matmul do k = 1 , 3 do j = 1 , 3 d = matmul ( c, qm(:,j,k) ) do i = 1,j kth(i,j) = kth(i,j) + dot_product ( qm(:,i,k), d ) kth(j,i) = kth(i,j) end do end do end do
Thank you for the replies.
The code is from an 18 degree of freedom triangular FEM element developed by Carlos Felippa.
He originally developed the code in Fortran 77, which he ported to Mathematica in the '90s so his students would not have to learn Fortran programming -- give me a break.
His TA/RA in the early 2000's ported it to Mathematica for the same reason - one has access to a PDF file with the Mathematic Code but not the Fortran code.
Sorry hit the wrong button.
I enclose a copy of the Golmon version of the program.
I got the model coded over the weekend and started to run some tests on it. I have two interesting problems, the first is to make sure global answers are somewhat correct and the second is to make the code a bit more succinct. I started playing with matmul and dot product, but a lot of this stuff could be made more abstract with some subroutines instead of repeating the same commands many times. I also have to check that the changes do not change the answers. Which is not to hard as you can mirror the results before you make the change.
It is so many years since I sat in a Linear Algebra lecture and now I am back in the middle of it -- in those days in Pure Math we used to ask what was the purpose - now I See
Interesting article. I think your best approach is to perform the conversion to Fortran in three steps.
1) Without regard to implications on vectorization or parallelization make the conversion such that the results agree. Then
2) rework the Fortran code to be amenable to vectoization This generally means converting from AOS to SOA, assure results equivalent. Then
3) If necessary, rework the code to be amenable to parallelization. This generally means partitioning the data, handling special cases for interior and boundary conditions, and handling force propagations across boundaries. This is usually not as easy as it looks.
Performing the work in phases will easing debugging should (when) errors occur. Consistency of results after step 1 will prove correctness in conversion. Step 2 may produce equivalent results though not necessarily exactly the same results. It may take some analysis to assert the equivalent results are correct. Step 3 will permit you to model larger numbers of elements. Though this too may result in equivalent but inexact results.
I agree - step 1 is complete, and Professor Felippa - code author just sent me the original Fortran - so I can move to Step 2.
No-one every believes FEM code.
INTEGER, PARAMETER :: dp = selected_real_kind(15, 307) INTEGER, PARAMETER :: mn_3 = 3,mn_9 = 9, mn_2 = 2, mn_10 = 10 Real (kind = dp) em,nu,h, c, db(mn_3,mn_3) c = em*h**3/(12.d0*(1.-nu**2.0))
Slowly working my way through my code and Felippa's original not many differences:
Equation for c I got wrong - was out by 144 - so I had the 12 in the wrong spot.
Question : Is dp the same as calling double precision?
He has a neat way of comparing two arrays - he adds them up and compares the difference - not exact but a good start.
db(1,1) 3.00000015646220 REAL(8)
Next question - on comparing the values for Db(1,1) I get this value from Felippa's code and exactly the same value from my code. Why are the errors at the end of 15646220 not random? I would expect some form of round off error by now - given the calculation range, but the numbers are exact to 13 places --
call SM3SHMEMBB (xlp, ylp, dm, 1.5d0,1.0d0, le, sm, nd, status) write(*,*)sm call SM3SHMEMBH (xlp, ylp, dm, betab, le, sm, nd, status) call SM3SHBENDB (xlp, ylp, db,1.0d0,clr,cqr, lb, sm, nd, status) call SM3SHBENDH (xlp, ylp, db,1.0d+0, lb, sm, nd, status) C
Thinking ahead - there is a lot of time spent in these routines - each one generates an independent section of the the stiffness matrix sm. Is this is the sort of thing we send parallel?
The first level of optimization you should consider is vectorization. This may require a complete rework of your array layouts. And as a consequence may significantly change how you parallelize the code.
Note, the article you posted concerned itself with one node of your surface. The real art in optimizing this code vectorization as well as parallelization, is in performing the (correct) calculations of the entire surface. The vectorization should be incorporated when you consider the entire surface.
I have a lot to learn. Wonder if there is enough time left.
I am loath to put up Professor Felippa's original code without his permission, but once I have mine correct I will put up a correct version, which is derived from a published version.
I have finished two of the 4 routines - the first one was exact to 15 significant figures - but the second started to disagree at the 8 and 9th place.
Looking at the code, Felippa determined the length of the sides using sqrt whereas I left them as length squared and used the L squared in the next calculation and he resquared the result -- do you think this will make a difference at the 8th figure?
Unfortunately (my opinion), the default is set to sacrifice precision over speed. My position is for the default to provide the precision, then have the programmer (incrementally) request the additional optimizations.
It took about 6 hours to debug my code. Not many mistakes, just annoyances, interestingly - 1 was mine I left out a block of code and 2 equations incorrect in the Mathematica Copy of the original Fortran X23 instead of X32 things - so the number was negative instead of positive, One was hard to track down as it only affected one number of the 324 and then only in a small way
EXAM DIFF is useful for looking at two printouts of matrices and telling the differences.
Weirdly at the finish the two files were identical to 15 places on all numbers - the Fortran God is strange or maybe Steve's weekly sacrifice worked this week - that is a joke.
Prof. Felippa included an Eigen solver - with another element program - this comes from a 1975 book and appears to have been written in Algol - I think from a note in the files. I used it to look at the eigenvalues for this element.
Now the good part making it work with multiple elements, Pardiso and FEAST.
The version I created from the published Mathematica Code is attached -- Prof. Felippa actually made less use of variables and more use of arrays in his code - the Mathematica was altered code in interesting ways.
Thought you might like a look - before I start butchering it further.
Quite nifty actually the code and the idea -- humans are surprisingly inventive.
No - the Mathematica code was correct - it was my copying that was wrong.
It took a while to decipher the Transpose(I)AI multiplication on the 3rd page, but now I have it - nifty way to create a thicker element with different properties. Quite smart.
"He originally developed the code in Fortran 77, which he ported to Mathematica in the '90s "
He originally developed the element in 1960's so it predates F77. There are certainly earlier versions available than a Mathematica version.
I am fairly sure that the QTSHEL thin shell routine in SAP4 from UCB is based on Felippa's shell element and so is an early publication and use of his shell element. It is a quadrilateral element using 4 triangles with a central node that is reduced out during the formulation. It basically combines a 3-node membrane triangle element and a 3-node bending triangle element to produce a 5 dof per node performance. Mid side rotation freedoms could also be included, but were not very practical.
I have used this element for many years. While it does not have the quadratic displacement field of the 8-node shells, this element is a very robust element and manages many of the areas where higher order elements can fail. It provides a predictable accuracy for reasonable mesh sizes.
The code is interesting in the use of many local variables, which avoids the matrix notation solution and minimises operations count. It is quite complex and a code I have never attempted to clean-up. However for most finite element calculations, the generation of element stiffness, stress and load matrices is a small proportion of (< 5%) of the total run time so a more concise matrix notation is preferred and much easier to audit.
Try and find the error in this almost original routine, obtained in 1978. ( I think I used a screen editor to improve the layout, which was more difficult with a card deck!)
C C CALLED BY: QTSHEL C C THIS SUBROUTINE FORMS THE PLATE BENDING STIFFNESS AND/OR THE C CONSISTENT LOAD VECTOR OF A LINEAR CURVATURE COMPATIBLE TRIANGLE C (LCCT) WITH 6, 5, 4 OR 3 NODAL POINTS C C C * * * * * * * * * * * * * * * INPUTS * * * * * * * * * * * * * * * * C C M NUMBER OF MIDPOINT DEGREES OF FREEDOM (M =3,2,1,0). C NOTE.. MIDPOINTS 4-5-6 (IF INCLUDED) ARE LOCATED ON C SIDES 2-3, 3-1 AND 1-2, RESPECTIVELY. C C KKK OPERATION FLAG C KKK LE 0 = FORM STIFFNESS MATRIX AND LOAD VECTOR. C KKK GT 0 = FORM LOAD VECTOR ONLY. C C A(I),B(I) I=1...3 PROJECTIONS OF SIDES 2-3, 3-1 AND 1-2 ONTO C X AND -Y, RESPECTIVELY. C C C(I,J) I=1...3, J=1...3 PLANE STRESS MATERIAL MATRIX. C C H(I) I=1...3 CORNER THICKNESSES (LINEAR VARIATION ASSUMED). C C PT(I) I=1...3 CORNER VALUES OF LATERAL DISTRIBUTED LOAD C (LINEAR VARIATION ASSUMED). C C BMT(I,J) I=1...3, J=1...3 INITIAL BENDING MOMENT COMPONENTS C MOM-XX (J=1), MOM-YY (J=2) AND MOM-XY (J=3) AT THE C CORNERS I=1...3 (LINEAR VARIATION ASSUMED). C C C * * * * * * * * * * * * * * * OUTPUTS * * * * * * * * * * * * * * * * C C ST(I,J) I=1..NDF, J=1..NDF WITH NDF (NUMBER OF DOF) = 9+M, IS C THE ELEMENT STIFFNESS MATRIX ASSOCIATED WITH THE NODAL C DISPLACEMENT ORDERING C W(1),RX(1),RY(1),W(2), .... RY(3),RM(1), ... RM(M) C WHERE RM(1), ... RM(M), IF M GT 0, ARE MIDPOINT C DEVIATIONS FROM NORMAL SLOPE LINEARITY C C FT(I) I=1..NDF CONSISTENT NODAL FORCE VECTOR ASSOCIATED C WITH THE NODAL DISPLACEMENT ORDERING DESCRIBED ABOVE. C C SUBROUTINE SLCCT (M,KKK) IMPLICIT REAL*8(A-H,O-Z) COMMON /TRIARG/ A(3),B(3), HMT(3), H(3), C(3,3), SMT(3,3), 1 BMT(3,3), FT(12), PX(3),PY(3),PT(3),RM(3), ST(12,12) DIMENSION P(21,12), G(21), Q(3,6), QB(3,6), T(3), U(3), HT(3), 1 TX(3), TY(3), IPERM(3), XM(3,3), XM0(3) EQUIVALENCE (CM11,C(1)),(CM12,C(2)),(CM13,C(3)),(CM22,C(5)), 1 (CM23,C(6)),(CM33,C(9)) LOGICAL NOS, FLAT DATA IPERM/2,3,1/ H0 = (H(1)+H(2)+H(3))/3. IF (H0.LE.0.) GO TO 1000 NDF = 9 + M NOS = KKK.GT.0 FLAT = (H(1).EQ.H(2)).AND.(H(2).EQ.H(3)) AREA = A(3)*B(2)-A(2)*B(3) FAC = H0**3*AREA/864. PTF = AREA/6480. T(3) = 1. DO 150 I = 1,3 J = IPERM(I) K = IPERM(J) X = A(I)**2+B(I)**2 U(I) = -(A(I)*A(J)+B(I)*B(J))/X X =DSQRT(X) Y = 2.*AREA/X HT(I) = 2.*Y TX(I) = Y*A(I)/X TY(I) = -Y*B(I)/X A1 = A(I)/AREA A2 = A(J)/AREA B1 = B(I)/AREA B2 = B(J)/AREA Q(1,I) = B1*B1 Q(2,I) = A1*A1 Q(3,I) = 2.*A1*B1 Q(1,I+3) = 2.*B1*B2 Q(2,I+3) = 2.*A1*A2 Q(3,I+3) = 2.*(A1*B2+A2*B1) DO 120 N = 1,3 120 XM(N,I) = BMT(N,I)*AREA/72. IF (FLAT) GO TO 150 DO 140 N = 1,3 L = IPERM(N) T(1) = H(N)/H0 T(2) = H(L)/H0 IF (T(1).GT.0.) XM(N,I) = XM(N,I)/T(1)**3 C1 = T(I) C2 = T(J) C3 = T(K) C4 = C2+C3 C11 = C1*C1 C23 = C2*C3 C5 = C4*(3.*C1+C4) + 6.*C11 - 2.*C23 C6 = C5 + 3.*C4*C4 - 4.*(C11+C23) QB(N,I ) = (C1*(10.*C11-3.*C23)+C4*C5)/17.5 - 2.0 140 QB(N,I+3) = (C1*(C11-2.*C23)+C4*C6)/35.0 - 1.0 150 CONTINUE DO 200 I = 1,3 J = IPERM(I) K = IPERM(J) II = 3*I JJ = 3*J KK = 3*K A1 = A(I) A2 = A(J) A3 = A(K) B1 = B(I) B2 = B(J) B3 = B(K) U1 = U(I) U2 = U(J) U3 = U(K) W1 = 1.-U1 W2 = 1.-U2 W3 = 1.-U3 B1D = B1 + B1 B2D = B2 + B2 B3D = B3 + B3 A1D = A1 + A1 A2D = A2 + A2 A3D = A3 + A3 C21 = B1-B3*U3 + TX(K) C22 = -B1D+B2*W2+B3*U3 + TX(J)-TX(K) C31 = A1-A3*U3 + TY(K) C22 = -B1D+B2*W2+B3*U3 + TX(J)-TX(K) C31 = A1-A3*U3 + TY(K) C32 = -A1D+A2*W2+A3*U3 + TY(J)-TY(K) C51 = B3*W3-B2 + TX(K) C52 = B2D-B3*W3-B1*U1 + TX(I)-TX(K) C61 = A3*W3-A2 + TY(K) C62 = A2D-A3*W3-A1*U1 + TY(I)-TY(K) C81 = B3-B2D-B2*U2 + TX(J) C82 = B1D-B3+B1*W1 + TX(I) C91 = A3-A2D-A2*U2 + TY(J) C92 = A1D-A3+A1*W1 + TY(I) P1 = PT(I)*PTF P2 = PT(J)*PTF P3 = PT(K)*PTF U37 = 7.*U3 W27 = 7.*W2 W24 = 4.*W2 U34 = 4.*U3 C1 = 54.+W27 C2 = 54.+U37 C3 = 15.+W24 C4 = 39.+U37 C5 = 39.+W27 C6 = 15.+U34 TXS = TX(J)+TX(K) TYS = TY(J)+TY(K) FT(II-2) = 6.*((90.+U37+W27)*P1+(36.+U37+W24)*P2+(36.+U34+W27)*P3) FT(II-1) = (C1*B2-C2*B3+7.*TXS)*P1 + (C3*B2-C4*B3+4.*TXS+ 1 3.*TX(K))*P2 + (C5*B2-C6*B3+4.*TXS+3.*TX(J))*P3 FT(II) = (C1*A2-C2*A3+7.*TYS)*P1 + (C3*A2-C4*A3+4.*TYS+ 1 3.*TY(K))*P2 + (C5*A2-C6*A3+4.*TYS+3.*TY(J))*P3 FT(K+9) = (7.*(P1+P2)+4.*P3)*HT(K) XM0(I) = (XM(1,I)+XM(2,I)+XM(3,I))/3. DO 200 N = 1,3 L = 6*(I-1) + N Q11 = Q(N,I) Q22 = Q(N,J) Q33 = Q(N,K) Q12 = Q(N,I+3) Q23 = Q(N,J+3) Q31 = Q(N,K+3) Q2333 = Q23-Q33 Q3133 = Q31-Q33 P(L ,II-2) = 6.*(-Q11+W2*Q33+U3*Q2333) P(L ,II-1) = C21*Q23+C22*Q33-B3D*Q12+B2D*Q31 P(L ,II ) = C31*Q23+C32*Q33-A3D*Q12+A2D*Q31 P(L ,JJ-2) = 6.*(Q22+W3*Q2333) P(L ,JJ-1) = C51*Q2333+B3D*Q22 P(L ,JJ ) = C61*Q2333+A3D*Q22 P(L ,KK-2) = 6.*(1.+U2)*Q33 P(L ,KK-1) = C81*Q33 P(L ,KK ) = C91*Q33 P(L ,I+9 ) = 0. P(L ,J+9 ) = HT(J)*Q33 P(L ,K+9 ) = HT(K)*Q2333 P(L+3 ,II-2) = 6.*(Q11+U3*Q3133) P(L+3 ,II-1) = C21*Q3133-B3D*Q11 P(L+3 ,II ) = C31*Q3133-A3D*Q11 P(L+3 ,JJ-2) = 6.*(-Q22+U1*Q33+W3*Q3133) P(L+3 ,JJ-1) = C51*Q31+C52*Q33+B3D*Q12-B1D*Q23 P(L+3 ,JJ ) = C61*Q31+C62*Q33+A3D*Q12-A1D*Q23 P(L+3 ,KK-2) = 6.*(1.+W1)*Q33 P(L+3 ,KK-1) = C82*Q33 P(L+3 ,KK ) = C92*Q33 P(L+3 ,I+9 ) = HT(I)*Q33 P(L+3 ,J+9 ) = 0. P(L+3 ,K+9 ) = HT(K)*Q3133 P(N+18,II-2) = 2.*(Q11+U3*Q12+W2*Q31) P(N+18,KK-1) = ((B1D-B2D)*Q33+C82*Q23+C81*Q31)/3. P(N+18,KK ) = ((A1D-A2D)*Q33+C92*Q23+C91*Q31)/3. 200 P(N+18,K+9 ) = HT(K)*Q12/3. 300 DO 400 J = 1,NDF DO 340 L = 1,3 II = L KK = L + 18 P3 = P(KK,J) G(KK) = 0. DO 340 N = 1,3 I = IPERM(N) JJ = II + 3 P1 = P(II,J) P2 = P(JJ,J) SUM = P1 + P2 + P3 G1 = SUM + P1 G2 = SUM + P2 G3 = SUM + P3 IF (FLAT) GO TO 320 G1 = G1 + QB(N,1)*P1 + QB(N,6)*P2 + QB(N,5)*P3 G2 = G2 + QB(N,6)*P1 + QB(N,2)*P2 + QB(N,4)*P3 G3 = G3 + QB(N,5)*P1 + QB(N,4)*P2 + QB(N,3)*P3 320 G(II) = G1 G(JJ) = G2 G(KK) = G3 + G(KK) II = II + 6 340 FT(J) = FT(J) - XM(N,L)*G1 - XM(I,L)*G2 - XM0(L)*G3 IF (NOS) GO TO 400 DO 360 N = 1,19,3 G1 = G(N) G2 = G(N+1) G3 = G(N+2) G(N) = CM11*G1 + CM12*G2 + CM13*G3 G(N+1) = CM12*G1 + CM22*G2 + CM23*G3 360 G(N+2) = CM13*G1 + CM23*G2 + CM33*G3 DO 390 I = 1,J X = 0. DO 380 N = 1,21 380 X = X + G(N)*P(N,I) X = X*FAC ST(I,J) = X 390 ST(J,I) = X 400 CONTINUE 1000 RETURN END
Very little code developed at UCB in 70's adopted Fortran 77 syntax. The following is the title block for the version of SAP4 I have. Even today, Feap shows this history and still uses the pre F90 memory management approaches. I remain very impressed by what these codes have achieved and still use an adaptation of the skyline solver from these codes and Wilson's shifted subspace eigen-solver that produces reliable results for the larger problems attempted today.
C ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** * C C SAP4 C A STRUCTURAL ANALYSIS PROGRAM C FOR STATIC AND DYNAMIC RESPONSE OF LINEAR SYSTEMS C C K.J. BATHE , E.L. WILSON , F.E. PETERSON C UNIVERSITY OF CALIFORNIA , BERKELEY C C IBM CONVERSION BY UNIVERSITY OF SOUTHERN CALIFORNIA C AUGUST, 1973 C REVISED JULY, 1974 C C PRIME CONVERSION FROM IBM C BY C LARRY J. FEESER AND WILLIAM E. BOND C DEPARTMENT OF CIVIL ENGINEERING C RENSSELAER POLYTECHNIC INSTITUTE C TROY, NEW YORK 12181 C JULY, 1978 C C ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** *