Matrix Operation - Page 2

JohnNichols · ‎03-31-2016

do k = 1 , 3
        do j = 1 , 3
            d(1) = c(1,1)*qm(1,j,k) + c(1,2)*qm(2,j,k) + c(1,3)*qm(3,j,k)
            d(2) = c(2,1)*qm(1,j,k) + c(2,2)*qm(2,j,k) + c(2,3)*qm(3,j,k)
            d(3) = c(3,1)*qm(1,j,k) + c(3,2)*qm(2,j,k) + c(3,3)*qm(3,j,k)
            do i =1,j
                kth(i,j)=kth(i,j)+qm(1,i,k)*d(1)+qm(2,i,k)*d(2)+qm(3,i,k)*d(3)
                kth(j,i)=kth(i,j)
            end do
        end do
    end do

This has got to be a matrix operation - doe it not?

jimdempseyatthecove · ‎04-06-2016

John,

The code posted to date appears to model a single node. Wouldn't a proper simulation have 100's, 1000's, (more) nodes?

If so, then any approach to vectorizing and parallelizing this code should view this from the perspective of the entire simulation model.

Parallelization should be performed at/near the outer most layer, while depending on code, vectorization can be performed either at the inner most layer .OR. at a higher layer.

An example of a higher layer vectorization: If (when) your inner most code layer is scalar (nearly/mostly/entirely) you can use the technique of consolidating multiple nodes into vectors such that 2, 4, 8, 16 nodes can be computed at the same time in the same instructions (same thread). This will require a different layout and code changes to remove and/or relocate conditional branches (e.g. error tests).

Jim Dempsey

JohnNichols · ‎04-06-2016

John:

The group of Berkeley under Powell were light years ahead of anybody else and in some areas they still are. My first boss, Geoff Rees, did his Structural Masters at UCB in late70's and brought back a huge deck of punch cards with a lot of programs. I grabbed them and got them onto a Control Data computer and used them for serious analysis of heavy mining structures in the Hunter Valley Coalfields - 4000 tonne bins etc. The programs allowed us to reduce the mass of the standard designs at the time, by a lot.

I went to a FEM conference in 1982? at Sydney Uni with my boss and he wanted to know why I kept quiet, I looked at him and said - we are making money why share it.

ULARC is still the finest structural program ever written, not because it could do everything it cannot - but what it does it does better than anything else - including ABACUS etc. I tracked the author down many years ago only to learn that he died young, I wrote to his lovely wife and told her that I considered him one of the finest engineers that ever lived. She wrote back a very nice thank you letter.

AXISHELL is another great program from that group, the author now works for Halliburton in Houston, he recently gave me a copy of his thesis. I made a lot of money from that program in the 80's.

I finally ported the programs to MICROSOFT Fortran and then got sidetracked. For reasons I do not understand PowerStation had a huge problem with these programs and in the end I just stopped - I got sick of looking for obscure errors in the code.

I now have two engineers providing very high quality experimental data and I need programs that can get into the data. We are talking about simply supported beams and thermal loading - as simple as it gets and the analysis really challenging, but fun.

So with this new data it was a toss up between ULARC and Harrison's 73 code as to where to start, I considered Drain3DX - but the effort involved is not worth it - better to start simple. I enclose a sample of the sort of stuff you can do with ULARC -- very neat - my problem is I started with the best and then had to look at the others. I do use STRAND7 - but it is a pain with the tables and one is limited on making changes. I used ABACUS for a simple problem a few years ago and the access issues are a killer - if it is not simple I do not use it.

I looked at FEAP, but prefer to use Harrison, Pardiso and FEAST and then suck the goodies out of ULARC. One does not need a lot of GUI keeping you out from the good stuff.

Benedetti in Italy is checking this stuff with Strand7, so it is well understood.

Harrison does not include a plate element, so I needed to add one to allow me to look at some recent bridge data. In order to attach it to the Harrison beams, I need 18 dof freedom in triangular elements. There are two published sources that I could see, I asked UMN for their code and it is not available, the Mathematica code looked interesting so I tried it, took a while to understand it - but it was worth the effort. Felippa supplied the working code so I could check mine and understand his underlying theory - it will take a while to simplify it - but it is good enough for the moment - I am looking at very few structures and most have simple configuration.

Now to get the plate element into the Harrison code. I will download your element and look at it - not sure but I will try.

The interesting observation is that one person did not develop these elements - a whole tribe did - a bit like it takes a village to educate a child.

Of course we are looking to take these elements into the nonlinear range - ie E varies linearly

But it is fun and frustrating and fun

John

John Campbell wrote:

John,

"He originally developed the code in Fortran 77, which he ported to Mathematica in the '90s "

He originally developed the element in 1960's so it predates F77. There are certainly earlier versions available than a Mathematica version.

I am fairly sure that the QTSHEL thin shell routine in SAP4 from UCB is based on Felippa's shell element and so is an early publication and use of his shell element. It is a quadrilateral element using 4 triangles with a central node that is reduced out during the formulation. It basically combines a 3-node membrane triangle element and a 3-node bending triangle element to produce a 5 dof per node performance. Mid side rotation freedoms could also be included, but were not very practical.

I have used this element for many years. While it does not have the quadratic displacement field of the 8-node shells, this element is a very robust element and manages many of the areas where higher order elements can fail. It provides a predictable accuracy for reasonable mesh sizes.

The code is interesting in the use of many local variables, which avoids the matrix notation solution and minimises operations count. It is quite complex and a code I have never attempted to clean-up. However for most finite element calculations, the generation of element stiffness, stress and load matrices is a small proportion of (< 5%) of the total run time so a more concise matrix notation is preferred and much easier to audit.

Try and find the error in this almost original routine, obtained in 1978. ( I think I used a screen editor to improve the layout, which was more difficult with a card deck!)

C
C     CALLED BY:  QTSHEL
C
C     THIS SUBROUTINE FORMS THE PLATE BENDING STIFFNESS AND/OR THE
C     CONSISTENT LOAD VECTOR OF A LINEAR CURVATURE COMPATIBLE TRIANGLE
C     (LCCT) WITH 6, 5, 4 OR 3 NODAL POINTS
C
C
C * * * * * * * * * * * * * * * INPUTS  * * * * * * * * * * * * * * * *
C
C      M       NUMBER OF MIDPOINT DEGREES OF FREEDOM (M =3,2,1,0).
C              NOTE.. MIDPOINTS 4-5-6 (IF INCLUDED) ARE LOCATED ON
C              SIDES 2-3, 3-1 AND 1-2, RESPECTIVELY.
C
C     KKK      OPERATION FLAG
C                KKK LE 0 = FORM STIFFNESS MATRIX AND LOAD VECTOR.
C                KKK GT 0 = FORM LOAD VECTOR ONLY.
C
C  A(I),B(I)   I=1...3   PROJECTIONS OF SIDES 2-3, 3-1 AND 1-2 ONTO
C              X AND -Y, RESPECTIVELY.
C
C   C(I,J)     I=1...3, J=1...3   PLANE STRESS MATERIAL MATRIX.
C
C    H(I)      I=1...3   CORNER THICKNESSES (LINEAR VARIATION ASSUMED).
C
C    PT(I)     I=1...3   CORNER VALUES OF LATERAL DISTRIBUTED LOAD
C              (LINEAR VARIATION ASSUMED).
C
C   BMT(I,J)   I=1...3, J=1...3   INITIAL BENDING MOMENT COMPONENTS
C              MOM-XX (J=1), MOM-YY (J=2) AND MOM-XY (J=3) AT THE
C              CORNERS I=1...3  (LINEAR VARIATION ASSUMED).
C
C
C * * * * * * * * * * * * * * * OUTPUTS * * * * * * * * * * * * * * * *
C
C   ST(I,J)    I=1..NDF, J=1..NDF  WITH NDF (NUMBER OF DOF) = 9+M, IS
C              THE ELEMENT STIFFNESS MATRIX ASSOCIATED WITH THE NODAL
C              DISPLACEMENT ORDERING
C                W(1),RX(1),RY(1),W(2), .... RY(3),RM(1), ... RM(M)
C              WHERE RM(1), ... RM(M), IF M GT 0, ARE MIDPOINT
C              DEVIATIONS FROM NORMAL SLOPE LINEARITY
C
C    FT(I)     I=1..NDF   CONSISTENT NODAL FORCE VECTOR ASSOCIATED
C              WITH THE NODAL DISPLACEMENT ORDERING DESCRIBED ABOVE.
C
C
      SUBROUTINE SLCCT (M,KKK)
      IMPLICIT REAL*8(A-H,O-Z)
      COMMON /TRIARG/ A(3),B(3), HMT(3), H(3),  C(3,3), SMT(3,3),
     1 BMT(3,3), FT(12),    PX(3),PY(3),PT(3),RM(3), ST(12,12)
      DIMENSION  P(21,12), G(21), Q(3,6), QB(3,6), T(3), U(3), HT(3),
     1 TX(3), TY(3), IPERM(3), XM(3,3), XM0(3)
      EQUIVALENCE (CM11,C(1)),(CM12,C(2)),(CM13,C(3)),(CM22,C(5)),
     1 (CM23,C(6)),(CM33,C(9))
      LOGICAL NOS, FLAT
      DATA IPERM/2,3,1/
      H0 = (H(1)+H(2)+H(3))/3.
      IF (H0.LE.0.)  GO TO 1000
      NDF = 9 + M
      NOS = KKK.GT.0
      FLAT = (H(1).EQ.H(2)).AND.(H(2).EQ.H(3))
      AREA = A(3)*B(2)-A(2)*B(3)
      FAC = H0**3*AREA/864.
      PTF = AREA/6480.
      T(3) = 1.
      DO 150  I = 1,3
      J = IPERM(I)
      K = IPERM(J)
      X = A(I)**2+B(I)**2
      U(I) = -(A(I)*A(J)+B(I)*B(J))/X
      X =DSQRT(X)
      Y = 2.*AREA/X
      HT(I) =  2.*Y
      TX(I) =  Y*A(I)/X
      TY(I) = -Y*B(I)/X
      A1 = A(I)/AREA
      A2 = A(J)/AREA
      B1 = B(I)/AREA
      B2 = B(J)/AREA
      Q(1,I)   = B1*B1
      Q(2,I)   = A1*A1
      Q(3,I)   = 2.*A1*B1
      Q(1,I+3) = 2.*B1*B2
      Q(2,I+3) = 2.*A1*A2
      Q(3,I+3) = 2.*(A1*B2+A2*B1)
      DO 120  N = 1,3
  120 XM(N,I) = BMT(N,I)*AREA/72.
      IF (FLAT)  GO TO 150
      DO 140  N = 1,3
      L = IPERM(N)
      T(1) = H(N)/H0
      T(2) = H(L)/H0
      IF (T(1).GT.0.)  XM(N,I) = XM(N,I)/T(1)**3
      C1 = T(I)
      C2 = T(J)
      C3 = T(K)
      C4 = C2+C3
      C11 = C1*C1
      C23 = C2*C3
      C5 = C4*(3.*C1+C4) + 6.*C11 - 2.*C23
      C6 = C5 + 3.*C4*C4 - 4.*(C11+C23)
      QB(N,I  ) = (C1*(10.*C11-3.*C23)+C4*C5)/17.5 - 2.0
  140 QB(N,I+3) = (C1*(C11-2.*C23)+C4*C6)/35.0 - 1.0
  150 CONTINUE
      DO 200  I = 1,3
      J = IPERM(I)
      K = IPERM(J)
      II = 3*I
      JJ = 3*J
      KK = 3*K
      A1 = A(I)
      A2 = A(J)
      A3 = A(K)
      B1 = B(I)
      B2 = B(J)
      B3 = B(K)
      U1 = U(I)
      U2 = U(J)
      U3 = U(K)
      W1 = 1.-U1
      W2 = 1.-U2
      W3 = 1.-U3
      B1D = B1 + B1
      B2D = B2 + B2
      B3D = B3 + B3
      A1D = A1 + A1
      A2D = A2 + A2
      A3D = A3 + A3
      C21 = B1-B3*U3         + TX(K)
      C22 = -B1D+B2*W2+B3*U3 + TX(J)-TX(K)
      C31 = A1-A3*U3         + TY(K)
      C22 = -B1D+B2*W2+B3*U3 + TX(J)-TX(K)
      C31 = A1-A3*U3         + TY(K)
      C32 = -A1D+A2*W2+A3*U3 + TY(J)-TY(K)
      C51 = B3*W3-B2         + TX(K)
      C52 = B2D-B3*W3-B1*U1  + TX(I)-TX(K)
      C61 = A3*W3-A2         + TY(K)
      C62 = A2D-A3*W3-A1*U1  + TY(I)-TY(K)
      C81 = B3-B2D-B2*U2     + TX(J)
      C82 = B1D-B3+B1*W1     + TX(I)
      C91 = A3-A2D-A2*U2     + TY(J)
      C92 = A1D-A3+A1*W1     + TY(I)
      P1 = PT(I)*PTF
      P2 = PT(J)*PTF
      P3 = PT(K)*PTF
      U37 = 7.*U3
      W27 = 7.*W2
      W24 = 4.*W2
      U34 = 4.*U3
      C1 = 54.+W27
      C2 = 54.+U37
      C3 = 15.+W24
      C4 = 39.+U37
      C5 = 39.+W27
      C6 = 15.+U34
      TXS = TX(J)+TX(K)
      TYS = TY(J)+TY(K)
      FT(II-2) = 6.*((90.+U37+W27)*P1+(36.+U37+W24)*P2+(36.+U34+W27)*P3)
      FT(II-1) = (C1*B2-C2*B3+7.*TXS)*P1 + (C3*B2-C4*B3+4.*TXS+
     1 3.*TX(K))*P2 + (C5*B2-C6*B3+4.*TXS+3.*TX(J))*P3
      FT(II)   = (C1*A2-C2*A3+7.*TYS)*P1 + (C3*A2-C4*A3+4.*TYS+
     1 3.*TY(K))*P2 + (C5*A2-C6*A3+4.*TYS+3.*TY(J))*P3
      FT(K+9)  = (7.*(P1+P2)+4.*P3)*HT(K)
      XM0(I) = (XM(1,I)+XM(2,I)+XM(3,I))/3.
      DO 200  N = 1,3
      L = 6*(I-1) + N
      Q11 = Q(N,I)
      Q22 = Q(N,J)
      Q33 = Q(N,K)
      Q12 = Q(N,I+3)
      Q23 = Q(N,J+3)
      Q31 = Q(N,K+3)
      Q2333 = Q23-Q33
      Q3133 = Q31-Q33
      P(L   ,II-2) = 6.*(-Q11+W2*Q33+U3*Q2333)
      P(L   ,II-1) = C21*Q23+C22*Q33-B3D*Q12+B2D*Q31
      P(L   ,II  ) = C31*Q23+C32*Q33-A3D*Q12+A2D*Q31
      P(L   ,JJ-2) = 6.*(Q22+W3*Q2333)
      P(L   ,JJ-1) = C51*Q2333+B3D*Q22
      P(L   ,JJ  ) = C61*Q2333+A3D*Q22
      P(L   ,KK-2) = 6.*(1.+U2)*Q33
      P(L   ,KK-1) = C81*Q33
      P(L   ,KK  ) = C91*Q33
      P(L   ,I+9 ) = 0.
      P(L   ,J+9 ) = HT(J)*Q33
      P(L   ,K+9 ) = HT(K)*Q2333
      P(L+3 ,II-2) = 6.*(Q11+U3*Q3133)
      P(L+3 ,II-1) = C21*Q3133-B3D*Q11
      P(L+3 ,II  ) = C31*Q3133-A3D*Q11
      P(L+3 ,JJ-2) = 6.*(-Q22+U1*Q33+W3*Q3133)
      P(L+3 ,JJ-1) = C51*Q31+C52*Q33+B3D*Q12-B1D*Q23
      P(L+3 ,JJ  ) = C61*Q31+C62*Q33+A3D*Q12-A1D*Q23
      P(L+3 ,KK-2) = 6.*(1.+W1)*Q33
      P(L+3 ,KK-1) = C82*Q33
      P(L+3 ,KK  ) = C92*Q33
      P(L+3 ,I+9 ) = HT(I)*Q33
      P(L+3 ,J+9 ) = 0.
      P(L+3 ,K+9 ) = HT(K)*Q3133
      P(N+18,II-2) = 2.*(Q11+U3*Q12+W2*Q31)
      P(N+18,KK-1) = ((B1D-B2D)*Q33+C82*Q23+C81*Q31)/3.
      P(N+18,KK  ) = ((A1D-A2D)*Q33+C92*Q23+C91*Q31)/3.
  200 P(N+18,K+9 ) = HT(K)*Q12/3.
  300 DO 400  J = 1,NDF
      DO 340  L = 1,3
      II = L
      KK = L + 18
      P3 = P(KK,J)
      G(KK) = 0.
      DO 340  N = 1,3
      I = IPERM(N)
      JJ = II + 3
      P1 = P(II,J)
      P2 = P(JJ,J)
      SUM = P1 + P2 + P3
      G1 = SUM + P1
      G2 = SUM + P2
      G3 = SUM + P3
      IF (FLAT)  GO TO 320
      G1 = G1 + QB(N,1)*P1 + QB(N,6)*P2 + QB(N,5)*P3
      G2 = G2 + QB(N,6)*P1 + QB(N,2)*P2 + QB(N,4)*P3
      G3 = G3 + QB(N,5)*P1 + QB(N,4)*P2 + QB(N,3)*P3
  320 G(II) = G1
      G(JJ) = G2
      G(KK) = G3 + G(KK)
      II = II + 6
  340 FT(J) = FT(J) - XM(N,L)*G1 - XM(I,L)*G2 - XM0(L)*G3
      IF (NOS)  GO TO 400
      DO 360 N = 1,19,3
      G1 = G(N)
      G2 = G(N+1)
      G3 = G(N+2)
      G(N)   = CM11*G1 + CM12*G2 + CM13*G3
      G(N+1) = CM12*G1 + CM22*G2 + CM23*G3
  360 G(N+2) = CM13*G1 + CM23*G2 + CM33*G3
      DO 390  I = 1,J
      X = 0.
      DO 380  N = 1,21
  380 X = X + G(N)*P(N,I)
      X = X*FAC
      ST(I,J) = X
  390 ST(J,I) = X
  400 CONTINUE
 1000 RETURN
      END

Very little code developed at UCB in 70's adopted Fortran 77 syntax. The following is the title block for the version of SAP4 I have. Even today, Feap shows this history and still uses the pre F90 memory management approaches. I remain very impressed by what these codes have achieved and still use an adaptation of the skyline solver from these codes and Wilson's shifted subspace eigen-solver that produces reliable results for the larger problems attempted today.

C  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  *
C
C                                 SAP4
C                     A STRUCTURAL ANALYSIS PROGRAM
C           FOR STATIC AND DYNAMIC RESPONSE OF LINEAR SYSTEMS
C
C                K.J. BATHE , E.L. WILSON , F.E. PETERSON
C                  UNIVERSITY OF CALIFORNIA , BERKELEY
C
C          IBM CONVERSION BY UNIVERSITY OF SOUTHERN CALIFORNIA
C                            AUGUST, 1973
C                         REVISED JULY, 1974
C
C                     PRIME CONVERSION FROM IBM
C                                 BY
C                LARRY J. FEESER AND WILLIAM E. BOND
C                  DEPARTMENT OF CIVIL ENGINEERING
C                  RENSSELAER POLYTECHNIC INSTITUTE
C                       TROY, NEW YORK  12181
C                              JULY, 1978
C
C  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  *

JohnNichols · ‎04-06-2016

John

Line 057 comes straight from the Felippa code - statistically someone borrowed it - to coincidental otherwise

John

JohnNichols · ‎04-06-2016

/
      H0 = (H(1)+H(2)+H(3))/3.
      IF (H0.LE.0.)  GO TO 1000

I am not sure what you mean by errors -- there are two types of errors and then else 3.

1. logic errors

2. coding errors

3. just plain dumb stuff

If H comes in all zero you get an immediate return without a message - dumb stuff -- so can you give me some idea of the common data that gets passed in by way of an example and then we can try to find the 1 or 2 or both

John

TimP · ‎04-06-2016

Was the code was written so as to run only with compile switches such as /Qsave /Qzero? Do those make a difference by restricting the possible values for undefined variables? It appears it may have been written for a specific dialect of IBM Fortran 66. It surely wouldn't have run with the Fortran compilers I had back then. Don't know which of your 3 error types that would be.

JohnNichols · ‎04-06-2016

Jim:

Yes this will have about 100 nodes is my guess based on the size of the problem. This is just the routine to create the stiffness matrix, I needed a special one to use with Harrison's code without introducing a lot of complications - I am trying to see if I can do a "simple" analysis.

Now that I have an element that works I can include 100's in the program - that is next.

John

jimdempseyatthecove wrote:

John,

The code posted to date appears to model a single node. Wouldn't a proper simulation have 100's, 1000's, (more) nodes?

If so, then any approach to vectorizing and parallelizing this code should view this from the perspective of the entire simulation model.

Parallelization should be performed at/near the outer most layer, while depending on code, vectorization can be performed either at the inner most layer .OR. at a higher layer.

An example of a higher layer vectorization: If (when) your inner most code layer is scalar (nearly/mostly/entirely) you can use the technique of consolidating multiple nodes into vectors such that 2, 4, 8, 16 nodes can be computed at the same time in the same instructions (same thread). This will require a different layout and code changes to remove and/or relocate conditional branches (e.g. error tests).

Jim Dempsey

JohnNichols · ‎04-06-2016

Tim P. wrote:

Was the code was written so as to run only with compile switches such as /Qsave /Qzero? Do those make a difference by restricting the possible values for undefined variables? It appears it may have been written for a specific dialect of IBM Fortran 66. It surely wouldn't have run with the Fortran compilers I had back then. Don't know which of your 3 error types that would be.

Logic errors :: if a+b = c and c+d = f then a+b+d should equal f - if not logic error

dumb errors - anything that is not a logic error

John

JohnNichols · ‎04-06-2016

The code compiled on the Intel Complier without error - I am assuming we are looking for logic errors or coding errors that pass the compiler but give incorrect answers.

John

TimP · ‎04-06-2016

The compiler won't try to check for default save or zero issues unless you so specify.

John_Campbell · ‎04-06-2016

I apologise for the poorly worded comment: what I was trying to say when I wrote "Try and find the error in this almost original routine", I was alluding to the problem when writing this style of code and you need to find an error. The risk of a typing error with so many similar variable names must be high. I wonder how many attempts there were in punching all those original cards correctly. Cleaning out this code looks too hard and it works!

It appears that most of the matrix multiplication has been written out explicitly. I much prefer a cleaner layout by defining the matrix definitions then do the '.. calculation, which can take account of the zeros in . In most cases, this approach results in a less than 1% increase in total run time and allows for easier changes to the element definition.

There are some strange constants used in this code that I wonder should have been converted, eg 17.5 on line 102. Anywhere where a multiple of 32 (ft/sec^2) is involved can be a problem and there are others. John, 4000 would be a psi E value and I always found the imperial/US unit system a nightmare, especially for mass units; even psi for E is a mess. What is a poundal and do you use them with kips !

Tim, I think this code history was CDC Fortran > IBM Fortran > Prime Fortran, although there probably was something before the CDC. I always considered default zero to be a coding error and save was identified at F77, so I would have removed any of these requirements from the listing I have when I got it in 1978. Typically with these codes, zero was not an issue and if save was required, the variable would have gone into (another!) named common. When converting from CDC, the biggest hassles were no generic intrinsics and how many characters were stored in a real or integer. There still exist some strange attempts at switching between integer and double precision.

John

John_Campbell · ‎04-06-2016

Jim,

This old style of code is difficult to put into an OMP region, as there is "extensive" use of COMMON for transferring information between routines. COMMON is also used to store local variables when local variables were allocated statically. While the array ST is generated for a single triangle, there is another COMMON /EM/ which stores the element information accumulated for a single element. This common is shared throughout the program for all element types. There are also some parameters for system accumulation mixed in these shared COMMON.

While it may be possible to make this COMMON private, the implications of managing COMMONs for what is local and what is global, makes the application of OMP a challenge and could involve reworking of code that is best left as-is. As this section of the code typically does not contribute a lot to run time, I have given up on the elements and applied OMP to other areas. I converted the shifted subspace eigen solver to OMP in about 5 minutes, by applying !$OMP to solving for the multiple subspace load vectors. Unfortunately the skyline solver took about 4 months and I have stalled on the time step solver, which only uses a single load vector, updated for many solution steps.

In general I have found there is some computation that is very suited to !$OMP but most others where there is too much effort or restructuring required. However, using vector instructions at the inner loop level is much more general and easier to apply, often requiring no more than selecting a compiler option.

John

jimdempseyatthecove · ‎04-07-2016

John,

Several years ago I converted an F77 simulation program (13 projects, 750 files, over 500,000 lines) to use OpenMP. It too was built with data passed in shared commons. To implement this, I used a combination of two techniques:

a) Thread Private data for temporary data that reside in the context of a thread
b) A context pointer to the various objects. (this could be an index, but I had requirements for using a pointer)

From what I've seen of your code, it is principally dealing with one node (triad of points). I would not place !$OMP regions within this code, rather you would make it OMP-safe and instead use OpenMP to iterate you 100 or so nodes (in a thread-safe manner).

Where you have to be careful (thread safeness) is at the juncture of nodes. You cannot accumulate forces or movements into a point in a manner such that more than one thread can be doing this at the same time. While you can protect these occurrences with !$OMP CRITICAL or !$OMP ATOMIC, you want to keep these uses to a minimal (if not removed entirely). This is done by partitioning the work or by using reduction syntax in OpenMP, or by accumulating the point forces (etc) into thread independent areas which are reduced at a later phase in the state machine.

Jim Dempsey

JohnNichols · ‎04-07-2016

Jim:

I started to add in the looping so I could enter multiple elements and do some real analysis -- I am sticking with the weird Felippa element data until I am sure I have not stuffed anything. I am checking I have th 2nd element correct tonight -

I put each element data into a TYPE I created called triangle -- it duplicates node data - but aside from that it simplifies the read and assembly. also makes it easy to pull an element out and not stuff with a whole lot of node and element data separately

I can assemble each individual stiffness matrix 18 by 18 for each element and stick it in the type as well and then store the lot - before I create the global stiffness matrix -- if I have the skills I should be able to go from the small ones to the PARDISO format in a single largish step.

I am using decimal for the equivalent binary to add restraints at the nodes 0 is not restrained 63 is fully restrained - AutoCAD does similar thing.

Is this amenable to parallel or am I wasting my time

John

jimdempseyatthecove · ‎04-08-2016

John,

Looks like you are on your way. I do have a few comments:

Remove nt (number of triangles)
Move Triangles to module Triangle
Read in Triangls from your input data file
Change sTrishells from fixed size array to allocatable array
Allocate sTrishells to newly found Triangles

Place the body of the read in loop into a subroutine inside Triangle.
modify your read in loop to use subroutine

do I = 1,Triangles
  call read_triangle(sTriShells(I))
end do

Note, the read_triangles has dummy argument TriShell of type(Triangle)

You can add error handling (either in subroutine or pass out status from read_triangle)

Place the body of your triangle process loop inside a subroutine and place into Triangle module.
Change your process loop to

do k=1,Triangles
  call process_triangle(sTriShells(k))
end do

The above essentially removes the pointer TriShell and replaces it with dummy argument TriShell. The bulk of your code remains unchanged (placement changes).

You want to future-proof the code. You do not want to be tied down to a fixed number of points. You have allocatables now, put them to use.

After you get the above working, you can then investigate how you can parallelize process loop. This will not be as trivial as it first appears. We can address this in later postings.

Jim Dempsey

JohnNichols · ‎04-08-2016

Jim:

Been at it all day - just finished the last changes - I think I have it correct --

very dirty code - I need to house clean a bit

also added some printout to make sure it is all staying together.

I added a SM Save routine as I start to work out the construct the array to go into PARDISO step - lot of fun.

Thanks for the help,

John

JohnNichols · ‎04-08-2016

jim

You can add error handling (either in subroutine or pass out status from read_triangle)

Got to still do this and make sure I check the input

John

jimdempseyatthecove · ‎04-09-2016

John,

That looks much cleaner now.

I do not fully understand the code, nor do I need to. It looks as if the routines to date effectively compute the stiffness. What I did not see in the data structures (which I may have overlooked), are the variables holding external force values, and potentially velocity and acceleration (and mass at each of the three points). These would be necessary if you were to simulate the model over time. The would be other properties to add as you develop the model. Do you intend on integrating this over time?

Jim Dempsey

JohnNichols · ‎04-09-2016

Jim:

We use a very sensitive accelerometer and measure the FFT data to 1000 Hz - so at the moment I am trying to match the modal elements and not the amplitude. We also use thermal loading as our preferred live load. We prefer quantity at high resolution with low amplitude than short burst of large amplitude for our work. I want this plate element to use with the Harrison Program, which does bridge beams very well in the modal sense, but I had a problem as soon as the experimental team moved onto connected beams with a slab. I am trying to avoid getting tied into a program like Stand7 because you cannot play that well with the internals. So my goal at the moment is static analysis -- eigenvalues and then later in the summer - time series. If you cannot get the modes correct then the time series is a waste of time. Add in the IP issues and it gets interesting.

So at the moment I am setting up the full stiffness array to send to Pardiso - the loading is a simple vector and the mass is not hard. The real issue is an element where we can vary the stiffness matrix slowly. And get the modes accurately to 1000 Hz and then match the real data.

So the short answer is no - the longer answer is yes, but I am just going in little steps. The experimental team throw in new data all the time so it gets quite interesting.

John

JohnNichols · ‎04-09-2016

Jim:

I got the stiffness matrix to assemble, now I need to check I did not stuff up the assemble.

sk(a1:b1,a2:b2) = sk(a1:b1,a2:b2) + asm(1:6,1:6,counter(h1))

One question - is the + operator overloaded for matrix addition or do I have to do it the hard way. It is reasonable easy to break the 18 by 18 dof matrixes into 6 by 6 blocks and then assemble the needed blocks.

I can now put the PARDISO solver to it and see if it will solve.

If we have Ax=b and we reduce A for all the x's that are zeros for the boundary conditions, one ends up with a lot of wasted time moving from A to the reduced A - it would be lot easier to just take a full size A - invert it and set the right x's to zero, before solving for the unknown x's. I will have to get out my pure math book and see if that would work, the density of edge points is so small it is a huge effort to take out the zeros for the problem compared to just inverting a slightly bigger array. One can then stay with simpler operations -

There are 64 different variations on the restraint permutations from 000000 to 111111 and allowing for each one is tedious - with the problems the team faces most are either 000000, 111000 or 1111111, and so far I have coded the first and last.

Happy coding

John

jimdempseyatthecove · ‎04-10-2016

The matrix + operator will either

a) add a scalar to all elements
b) add two matrices of same shape and size

For your above statement to work, b1 would have to be a1+5, and b2 would have to be a2+5, thus specifying (6,6) arrays. The asm component would then reference the counter(h1)'th 6x6 array inside asm.

As for reducing unnecessary moves, consider making a subroutine out of your inner code (post move), then call with the appropriate subsection of the array by reference. Placing the reference on stack and making call may be faster than copy-in, code, copy-out.

As for reducing the number of permutations, I'd have to see the code.

Jim Dempsey

jimdempseyatthecove · ‎04-10-2016

John,

It would be helpful if you were to include the input data files. With these files, the suggestions we make could be tested prior to posting back here.

In SM.F90, subroutine SM3MHFF, line 2105, you have a do 2500 loop that has no exit. Did you miss a line e.g.:

IF(ConvergedExpression) exit

Also, that same loop is using j as an index but j is uninitialized. It would seem that the

do 2500

should be

do 2500 j=1,3

Jim Dempsey