proformance issue for writing back to thread local matrix

fenglai · ‎06-14-2017

Hello Everyone,

I have a scientific program that computes integrals, then combined the integral with the input matrix to form the result and finally write the result to the output matrix. I found that the performance downgrades significantly (slow down for 8-10 times) when the result is writing back to the output matrix. However, the output matrix is private for the given thread, so it should not be false sharing?? I use the Intel® Xeon Phi™ coprocessors 5110P.

Here is example of a piece of code:

aKetExRho is the output matrix, it's memory also alligned to 64 bit and it's created inside the same thread in the upper level. The function of getPtr is to get the proper pointer to write the result back
abcd is the result raw integral, aBraDenPhi contains the input data to combined with the raw integral to form the result.

__attribute__((aligned(64))) Double abcd[36];
__attribute__((aligned(64))) Double aBraDenPhi[6];

   Double* aKetExRho = matrix_phi::getPtr(colLocBasOffset,iGrid,aKetAtomBlockExRho);
   for(UInt j=0; j<6; j++) {
       const Double* abcd_ptr = &abcd[j*6];
       Double result = ZERO;
       result += aBraDenPhi[0]*abcd_ptr[0];
       result += aBraDenPhi[1]*abcd_ptr[1];
       result += aBraDenPhi[2]*abcd_ptr[2];
       result += aBraDenPhi[3]*abcd_ptr[3];
       result += aBraDenPhi[4]*abcd_ptr[4];
       result += aBraDenPhi[5]*abcd_ptr[5];
       aKetExRho += -1.0E0*result;
     }

I found if I comment out the line of " aKetExRho += -1.0E0*result;" the performance increases significantly. However, the output matrix is also private to the thread. How can I solve this problem?

Thank you,

Phoenix

Gregg_S_Intel · ‎06-14-2017

When you comment out that line, the compiler flags the whole loop as dead code and skips it.

fenglai · ‎06-14-2017

Hi Gregg,

Yeah I understand your point. I do not know what's the reason that why it slow down the code. Because the output matrix is created inside the thread so it should be thread local. Why when it's passed into the function then it becomes the trouble maker to significantly slow down the performance?

Is there any possibility that this is the false sharing? This matrix is created locally inside the thread, however the memory corresponding to the matrix is allocated from heap; so there's still a chance to be false sharing?

Thank you for your suggestion!

Phoenix

Gregg_S_Intel · ‎06-14-2017

I really suspect this is a spurious result from dead code elimination. It's easy to fall into this trap. It means the code is not actually faster, it did not get executed at all.

fenglai · ‎06-14-2017

Hi Gregg,

I just posted the performance related codes. Actually before this piece there's large portion of codes. let me pull out the whole code for you to have a look. I do not know whether the compiler will skip "the whole function" because it does not write the result back?

    //
    // declare the variables as result of VRR process
    //
    __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_G_S[15];
    __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_F_S[10];
    __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_D_S[6];

// declare the result and temp density block array
// cartShellBlockDen stores the shell pair block density matrix for Coulomb
// cartShellBlockJMtrx stores the shell pair block result for Coulomb matrix
// aBraDenPhi etc. stores the bra side shell block denPhi value per each grid
// because we use static array, we need to declare it even we may not use it
__attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double abcd[36];
__attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double cartShellBlockDen[36];
__attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double cartShellBlockJMtrx[36];
__attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double aBraDenPhi[6];
__attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double aKetDenPhi[6];
__attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double bBraDenPhi[6];
__attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double bKetDenPhi[6];

// generate the distance square between the two centers
const Double* cenA = atomshell_phi::getCenter(rowAtomShell);
const Double* cenB = atomshell_phi::getCenter(colAtomShell);
Double cenABX      = cenA[0]-cenB[0];
Double cenABY      = cenA[1]-cenB[1];
Double cenABZ      = cenA[2]-cenB[2];
Double AB2         = cenABX*cenABX+cenABY*cenABY+cenABZ*cenABZ;

// now let's loop over the shell pairs belong to the same angular momentum code
for(UInt iSP=0; iSP<nSigSP; iSP++) {
    const CompactSigShellPairInfor& sigSPInfor = compactSigSPVec[sigSPOffset+iSP];
    UInt iLocShell = compactsigshellpairinfor::getRowShellIndex(sigSPInfor);
    UInt jLocShell = compactsigshellpairinfor::getColShellIndex(sigSPInfor);
    UInt iGloShell = atomshell_phi::getGlobalShellOffset(rowAtomShell,iLocShell);
    UInt jGloShell = atomshell_phi::getGlobalShellOffset(colAtomShell,jLocShell);

    // now get the input data for esp
    // the input data will be inversed if the inverseshell is true
    // so that to make the information match the esp integral row/col situation
    UInt rowLocBasOffset = atomshell_phi::getLocCarBasOffset(rowAtomShell,iLocShell);
    UInt colLocBasOffset = atomshell_phi::getLocCarBasOffset(colAtomShell,jLocShell);
    UInt inp             = atomshell_phi::getNPrim(rowAtomShell,iLocShell);
    UInt jnp             = atomshell_phi::getNPrim(colAtomShell,jLocShell);
    const Double* ic     = atomshell_phi::getCoe(rowAtomShell,iLocShell);
    const Double* jc     = atomshell_phi::getCoe(colAtomShell,jLocShell);
    const Double* ie     = atomshell_phi::getExp(rowAtomShell,iLocShell);
    const Double* je     = atomshell_phi::getExp(colAtomShell,jLocShell);
    const Double* A      = atomshell_phi::getCenter(rowAtomShell);
    const Double* B      = atomshell_phi::getCenter(colAtomShell);

    // if we do Coulomb part, now we need to get the block density matrix out
    // we need to consider the case that whether the two shells are inversed
    // also set the pMax value for Coulomb so that to determine the significance
    Double jPMax0 = ONE;
    if (integraljobs::doJ(intJob)) {

      // clear the cartShellBlockJMtrx
      for(UInt j=0; j<6; j++) {
        Double* cartShellBlockJMtrx_ptr = &cartShellBlockJMtrx[j*6];
        cartShellBlockJMtrx_ptr[0] = 0.0E0;
        cartShellBlockJMtrx_ptr[1] = 0.0E0;
        cartShellBlockJMtrx_ptr[2] = 0.0E0;
        cartShellBlockJMtrx_ptr[3] = 0.0E0;
        cartShellBlockJMtrx_ptr[4] = 0.0E0;
        cartShellBlockJMtrx_ptr[5] = 0.0E0;
      }

      // read in density matrix block for Coulomb
      for(UInt j=0; j<6; j++) {
        const Double* cartAtomBlockDen_ptr = matrix_phi::getPtr(rowLocBasOffset,colLocBasOffset+j,cartAtomBlockDen);
        Double* cartShellBlockDen_ptr = &cartShellBlockDen[j*6];
        cartShellBlockDen_ptr[0] = cartAtomBlockDen_ptr[0];
        cartShellBlockDen_ptr[1] = cartAtomBlockDen_ptr[1];
        cartShellBlockDen_ptr[2] = cartAtomBlockDen_ptr[2];
        cartShellBlockDen_ptr[3] = cartAtomBlockDen_ptr[3];
        cartShellBlockDen_ptr[4] = cartAtomBlockDen_ptr[4];
        cartShellBlockDen_ptr[5] = cartAtomBlockDen_ptr[5];
      }

      // do the screening test to get the pMax value for coulomb
      for(UInt i=0; i<36; i++) {
        Double v = fabs(cartShellBlockDen);
        if (v>jPMax0) jPMax0 = v;
      }
    }

    // if we are doing exchange, we may also need to switch the input and output matrix
    const Mtrx_Phi* aBraAtomBlockDenPhi_ptr = &aBraAtomBlockDenPhi;
    const Mtrx_Phi* bBraAtomBlockDenPhi_ptr = &bBraAtomBlockDenPhi;
    const Mtrx_Phi* aKetAtomBlockDenPhi_ptr = &aKetAtomBlockDenPhi;
    const Mtrx_Phi* bKetAtomBlockDenPhi_ptr = &bKetAtomBlockDenPhi;
    Mtrx_Phi* aBraAtomBlockExRho_ptr = &aBraAtomBlockExRho;
    Mtrx_Phi* bBraAtomBlockExRho_ptr = &bBraAtomBlockExRho;
    Mtrx_Phi* aKetAtomBlockExRho_ptr = &aKetAtomBlockExRho;
    Mtrx_Phi* bKetAtomBlockExRho_ptr = &bKetAtomBlockExRho;

// loop over grid points
for(UInt iGrid=0; iGrid<nGrids; iGrid++) {

      SQ_ESP_G_S[0] = 0.0E0;
      SQ_ESP_G_S[1] = 0.0E0;
      SQ_ESP_G_S[2] = 0.0E0;
      SQ_ESP_G_S[3] = 0.0E0;
      SQ_ESP_G_S[4] = 0.0E0;
      SQ_ESP_G_S[5] = 0.0E0;
      SQ_ESP_G_S[6] = 0.0E0;
      SQ_ESP_G_S[7] = 0.0E0;
      SQ_ESP_G_S[8] = 0.0E0;
      SQ_ESP_G_S[9] = 0.0E0;
      SQ_ESP_G_S[10] = 0.0E0;
      SQ_ESP_G_S[11] = 0.0E0;
      SQ_ESP_G_S[12] = 0.0E0;
      SQ_ESP_G_S[13] = 0.0E0;
      SQ_ESP_G_S[14] = 0.0E0;
      SQ_ESP_F_S[0] = 0.0E0;
      SQ_ESP_F_S[1] = 0.0E0;
      SQ_ESP_F_S[2] = 0.0E0;
      SQ_ESP_F_S[3] = 0.0E0;
      SQ_ESP_F_S[4] = 0.0E0;
      SQ_ESP_F_S[5] = 0.0E0;
      SQ_ESP_F_S[6] = 0.0E0;
      SQ_ESP_F_S[7] = 0.0E0;
      SQ_ESP_F_S[8] = 0.0E0;
      SQ_ESP_F_S[9] = 0.0E0;
      SQ_ESP_D_S[0] = 0.0E0;
      SQ_ESP_D_S[1] = 0.0E0;
      SQ_ESP_D_S[2] = 0.0E0;
      SQ_ESP_D_S[3] = 0.0E0;
      SQ_ESP_D_S[4] = 0.0E0;
      SQ_ESP_D_S[5] = 0.0E0;
      // set the final pMax for Coulomb matrix
      Double jPMax = jPMax0;
      if (integraljobs::doJ(intJob)) {
        Double wRhoVal = fabs(wRho[iGrid]);
        if (wRhoVal>jPMax) jPMax = wRhoVal;
      }

      // set the exchange density block for this grid
      Double kPMax = ONE;
      if (integraljobs::doK(intJob)) {

        // get the bra/ket side density block per the given grid point
        const Double* aBraAtomBlockDenPhi_ptr_pgrid = matrix_phi::getPtr(rowLocBasOffset,iGrid,*aBraAtomBlockDenPhi_ptr);
        aBraDenPhi[0] = aBraAtomBlockDenPhi_ptr_pgrid[0];
        aBraDenPhi[1] = aBraAtomBlockDenPhi_ptr_pgrid[1];
        aBraDenPhi[2] = aBraAtomBlockDenPhi_ptr_pgrid[2];
        aBraDenPhi[3] = aBraAtomBlockDenPhi_ptr_pgrid[3];
        aBraDenPhi[4] = aBraAtomBlockDenPhi_ptr_pgrid[4];
        aBraDenPhi[5] = aBraAtomBlockDenPhi_ptr_pgrid[5];

        // get the screening Pmax for the density matrix block
        for(UInt i=0; i<6; i++) {
          Double v = fabs(aBraDenPhi);
          if (v>kPMax) kPMax = v;
        }

if (iGloShell != jGloShell) {

          // get the bra/ket side density block per the given grid point
          const Double* aKetAtomBlockDenPhi_ptr_pgrid = matrix_phi::getPtr(colLocBasOffset,iGrid,*aKetAtomBlockDenPhi_ptr);
          aKetDenPhi[0] = aKetAtomBlockDenPhi_ptr_pgrid[0];
          aKetDenPhi[1] = aKetAtomBlockDenPhi_ptr_pgrid[1];
          aKetDenPhi[2] = aKetAtomBlockDenPhi_ptr_pgrid[2];
          aKetDenPhi[3] = aKetAtomBlockDenPhi_ptr_pgrid[3];
          aKetDenPhi[4] = aKetAtomBlockDenPhi_ptr_pgrid[4];
          aKetDenPhi[5] = aKetAtomBlockDenPhi_ptr_pgrid[5];

          // get the screening Pmax for the density matrix block
          for(UInt i=0; i<6; i++) {
            Double v = fabs(aKetDenPhi);
            if (v>kPMax) kPMax = v;
          }
        }

if (nSpin == 2) {

          // get the bra/ket side density block per the given grid point
          const Double* bBraAtomBlockDenPhi_ptr_pgrid = matrix_phi::getPtr(rowLocBasOffset,iGrid,*bBraAtomBlockDenPhi_ptr);
          bBraDenPhi[0] = bBraAtomBlockDenPhi_ptr_pgrid[0];
          bBraDenPhi[1] = bBraAtomBlockDenPhi_ptr_pgrid[1];
          bBraDenPhi[2] = bBraAtomBlockDenPhi_ptr_pgrid[2];
          bBraDenPhi[3] = bBraAtomBlockDenPhi_ptr_pgrid[3];
          bBraDenPhi[4] = bBraAtomBlockDenPhi_ptr_pgrid[4];
          bBraDenPhi[5] = bBraAtomBlockDenPhi_ptr_pgrid[5];

          // get the screening Pmax for the density matrix block
          for(UInt i=0; i<6; i++) {
            Double v = fabs(bBraDenPhi);
            if (v>kPMax) kPMax = v;
          }
          if (iGloShell != jGloShell) {

            // get the bra/ket side density block per the given grid point
            const Double* bKetAtomBlockDenPhi_ptr_pgrid = matrix_phi::getPtr(colLocBasOffset,iGrid,*bKetAtomBlockDenPhi_ptr);
            bKetDenPhi[0] = bKetAtomBlockDenPhi_ptr_pgrid[0];
            bKetDenPhi[1] = bKetAtomBlockDenPhi_ptr_pgrid[1];
            bKetDenPhi[2] = bKetAtomBlockDenPhi_ptr_pgrid[2];
            bKetDenPhi[3] = bKetAtomBlockDenPhi_ptr_pgrid[3];
            bKetDenPhi[4] = bKetAtomBlockDenPhi_ptr_pgrid[4];
            bKetDenPhi[5] = bKetAtomBlockDenPhi_ptr_pgrid[5];

            // get the screening Pmax for the density matrix block
            for(UInt i=0; i<6; i++) {
              Double v = fabs(bKetDenPhi);
              if (v>kPMax) kPMax = v;
            }
          }
        }

}

      // finally before the loop over primitive pairs, we need to see the
      // screening result of pMax
      Double pMax = kPMax;
      if (jPMax>pMax) pMax = jPMax;

      // initialize the significance check for integral
      // this will determine that whether we skip the following part
      bool isSignificant = false;

      UInt inp2 = inp*jnp;
      for(UInt ip2=0; ip2<inp2; ip2++) {
        UInt ip      = ip2%inp;
        UInt jp      = (ip2-ip)/inp;
        Double ia    = ie[ip];
        Double ja    = je[jp];
        Double onedz = 1.0E0/(ia+ja);
        Double ab    = -ia*ja*onedz;
        Double fbra = exp(ab*AB2)*pow(PI*onedz,1.5E0);
        Double adab = ia*onedz;
        Double bdab = ja*onedz;
        Double PX    = A[0]*adab + B[0]*bdab;
        Double PY    = A[1]*adab + B[1]*bdab;
        Double PZ    = A[2]*adab + B[2]*bdab;
        Double rho   = ia+ja;
        Double sqrho = sqrt(rho);
        Double ic2   = ic[ip]*jc[jp];
        Double oned2z= 0.5E0*onedz;
        Double PAX   = PX - A[0];
        Double PAY   = PY - A[1];
        Double PAZ   = PZ - A[2];
        Double PRX   = PX - R[3*iGrid ];
        Double PRY   = PY - R[3*iGrid+1];
        Double PRZ   = PZ - R[3*iGrid+2];
        Double PR2   = PRX*PRX+PRY*PRY+PRZ*PRZ;
        Double u     = rho*PR2;
        Double squ   = sqrt(u);
        Double pref      = fbra;
        Double prefactor = ic2*pref;

      //
      // here below the code is performing significance test for integrals on
      // primitive integrals. Here we use the overlap integrals to roughly
      // estimate the order of the result integrals
      // the threshold value should be for primitive function quartet, we compare
      // the value against machine precision for significance test
      //
      Double I_ESP_S_S_vrr_IntegralTest = pref;
      if (fabs(ic2)>1.0E0) {
        I_ESP_S_S_vrr_IntegralTest = prefactor;
      }

      // test the integrals with the pMax, which is the maximum value
      // of the corresponding density matrix block(or it may be maximum
      // value pair of the corresponding density matrix block)
      if(fabs(I_ESP_S_S_vrr_IntegralTest*pMax)<intThresh) continue;
      isSignificant = true;

      //
      //
      // now here for maxM>0 to compute the infamous incomplete Gamma function f_{m}(u)
      // the implementation is divided in two situations:
      // 1 if u <=1.8; use power series to get f_{Mmax}(u), then use down recursive
      //    relation to get the rest of incomplete Gamma functions;
      // 2 for u >1.8 and M <= 10 we calculate erf(u), then use up recursive
      //    relation to calculate the rest of results
      // 3 for u> 1.8 and M > 10 we calculate f_{Mmax}(u) then use down
      //    recursive relation to get rest of incomplete Gamma functions
      // The above procedure is tested for u between 0 to 40 with step length 1.0E-6
      // (or 1.0E-5 for float double data), for up recursive relation it shows the error
      // within 1.0E-12 (for M_limit = 12 or error within 1.0E-6 for float type of data
      // For the polynomial expansion and down recursive procedure the error is within
      // 1.0E-14. All of the testing details please refer to the fmt_test folder
      //
      // There's one thing need to note for up recursive process. We found that the up
      // recursive procedure is only stable for maxM<=10 and u>1.8 with double
      // precision data, single precision data will lose accuracy quickly so the result
      // for single precision calculation is not doable. Therefore if the "WITH_SINGLE_PRECISION"
      // is defined, then for erf function calculation as well as up recursive
      // process we will use the double type of data
      //
      //

      Double I_ESP_S_S_vrr = 0.0E0;
      Double I_ESP_S_S_M1_vrr = 0.0E0;
      Double I_ESP_S_S_M2_vrr = 0.0E0;
      Double I_ESP_S_S_M3_vrr = 0.0E0;
      Double I_ESP_S_S_M4_vrr = 0.0E0;

      // declare double type of results to store the accuracy
#ifdef WITH_SINGLE_PRECISION
      double I_ESP_S_S_vrr_d = 0.0E0;
      double I_ESP_S_S_M1_vrr_d = 0.0E0;
      double I_ESP_S_S_M2_vrr_d = 0.0E0;
      double I_ESP_S_S_M3_vrr_d = 0.0E0;
      double I_ESP_S_S_M4_vrr_d = 0.0E0;
#endif

if (u<=1.8E0) {

        // calculate (SS|SS)^{Mmax}
        // use 18 terms power series to expand the (SS|SS)^{Mmax}
        Double u2 = 2.0E0*u;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER43;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER41*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER39*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER37*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER35*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER33*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER31*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER29*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER27*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER25*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER23*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER21*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER19*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER17*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER15*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER13*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = 1.0E0+u2*ONEOVER11*I_ESP_S_S_M4_vrr;
        I_ESP_S_S_M4_vrr = ONEOVER9*I_ESP_S_S_M4_vrr;
        Double eu = exp(-u);
        Double f = TWOOVERSQRTPI*prefactor*sqrho*eu;
        I_ESP_S_S_M4_vrr = f*I_ESP_S_S_M4_vrr;

        // now use down recursive relation to get
        // rest of (SS|SS)^{m}
        I_ESP_S_S_M3_vrr = ONEOVER7*(u2*I_ESP_S_S_M4_vrr+f);
        I_ESP_S_S_M2_vrr = ONEOVER5*(u2*I_ESP_S_S_M3_vrr+f);
        I_ESP_S_S_M1_vrr = ONEOVER3*(u2*I_ESP_S_S_M2_vrr+f);
        I_ESP_S_S_vrr = ONEOVER1*(u2*I_ESP_S_S_M1_vrr+f);

}else{
#ifdef WITH_SINGLE_PRECISION

        // recompute the variable in terms of double accuracy
        double u_d     = u;
        double rho_d   = rho;
        double fac_d   = prefactor;
        double sqrho_d = sqrt(rho_d);
        double squ_d   = sqrt(u_d);

        // use erf function to get (SS|SS)^{0}
        if (fabs(u_d)<THRESHOLD_MATH) {
          I_ESP_S_S_vrr_d = fac_d*sqrho_d*TWOOVERSQRTPI;
        }else{
          I_ESP_S_S_vrr_d = (fac_d*sqrho_d/squ_d)*erf(squ_d);
        }

        // now use up recursive relation to get
        // rest of (SS|SS)^{m}
        double oneO2u = 0.5E0/u_d;
        double eu      = exp(-u_d);
        double f       = TWOOVERSQRTPI*fac_d*sqrho_d*eu;
        I_ESP_S_S_M1_vrr_d = oneO2u*(1.0E0*I_ESP_S_S_vrr_d-f);
        I_ESP_S_S_M2_vrr_d = oneO2u*(3.0E0*I_ESP_S_S_M1_vrr_d-f);
        I_ESP_S_S_M3_vrr_d = oneO2u*(5.0E0*I_ESP_S_S_M2_vrr_d-f);
        I_ESP_S_S_M4_vrr_d = oneO2u*(7.0E0*I_ESP_S_S_M3_vrr_d-f);

        // write the double result back to the float var
        I_ESP_S_S_vrr = static_cast<Double>(I_ESP_S_S_vrr_d);
        I_ESP_S_S_M1_vrr = static_cast<Double>(I_ESP_S_S_M1_vrr_d);
        I_ESP_S_S_M2_vrr = static_cast<Double>(I_ESP_S_S_M2_vrr_d);
        I_ESP_S_S_M3_vrr = static_cast<Double>(I_ESP_S_S_M3_vrr_d);
        I_ESP_S_S_M4_vrr = static_cast<Double>(I_ESP_S_S_M4_vrr_d);

#else

        // use erf function to get (SS|SS)^{0}
        if (fabs(u)<THRESHOLD_MATH) {
          I_ESP_S_S_vrr = prefactor*sqrho*TWOOVERSQRTPI;
        }else{
          I_ESP_S_S_vrr = (prefactor*sqrho/squ)*erf(squ);
        }

        // now use up recursive relation to get
        // rest of (SS|SS)^{m}
        Double oneO2u = 0.5E0/u;
        Double eu     = exp(-u);
        Double f      = TWOOVERSQRTPI*prefactor*sqrho*eu;
        I_ESP_S_S_M1_vrr = oneO2u*(1.0E0*I_ESP_S_S_vrr-f);
        I_ESP_S_S_M2_vrr = oneO2u*(3.0E0*I_ESP_S_S_M1_vrr-f);
        I_ESP_S_S_M3_vrr = oneO2u*(5.0E0*I_ESP_S_S_M2_vrr-f);
        I_ESP_S_S_M4_vrr = oneO2u*(7.0E0*I_ESP_S_S_M3_vrr-f);

#endif

}

        /************************************************************
         * shell quartet name: SQ_ESP_P_S_M3
         * expanding position: BRA1
         * code section is: VRR
         * totally 0 integrals are omitted
         * RHS shell quartet name: SQ_ESP_S_S_M3
         * RHS shell quartet name: SQ_ESP_S_S_M4
         ************************************************************/
        __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_P_S_M3_vrr_array[3];
        SQ_ESP_P_S_M3_vrr_array[0] = PAX*I_ESP_S_S_M3_vrr-PRX*I_ESP_S_S_M4_vrr;
        SQ_ESP_P_S_M3_vrr_array[1] = PAY*I_ESP_S_S_M3_vrr-PRY*I_ESP_S_S_M4_vrr;
        SQ_ESP_P_S_M3_vrr_array[2] = PAZ*I_ESP_S_S_M3_vrr-PRZ*I_ESP_S_S_M4_vrr;

        /************************************************************
         * shell quartet name: SQ_ESP_P_S_M2
         * expanding position: BRA1
         * code section is: VRR
         * totally 0 integrals are omitted
         * RHS shell quartet name: SQ_ESP_S_S_M2
         * RHS shell quartet name: SQ_ESP_S_S_M3
         ************************************************************/
        __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_P_S_M2_vrr_array[3];
        SQ_ESP_P_S_M2_vrr_array[0] = PAX*I_ESP_S_S_M2_vrr-PRX*I_ESP_S_S_M3_vrr;
        SQ_ESP_P_S_M2_vrr_array[1] = PAY*I_ESP_S_S_M2_vrr-PRY*I_ESP_S_S_M3_vrr;
        SQ_ESP_P_S_M2_vrr_array[2] = PAZ*I_ESP_S_S_M2_vrr-PRZ*I_ESP_S_S_M3_vrr;

        /************************************************************
         * shell quartet name: SQ_ESP_D_S_M2
         * expanding position: BRA1
         * code section is: VRR
         * totally 3 integrals are omitted
         * RHS shell quartet name: SQ_ESP_P_S_M2
         * RHS shell quartet name: SQ_ESP_P_S_M3
         * RHS shell quartet name: SQ_ESP_S_S_M2
         * RHS shell quartet name: SQ_ESP_S_S_M3
         ************************************************************/
        __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_D_S_M2_vrr_array[3];
        SQ_ESP_D_S_M2_vrr_array[0] = PAX*SQ_ESP_P_S_M2_vrr_array[0]-PRX*SQ_ESP_P_S_M3_vrr_array[0]+oned2z*I_ESP_S_S_M2_vrr-oned2z*I_ESP_S_S_M3_vrr;
        SQ_ESP_D_S_M2_vrr_array[1] = PAY*SQ_ESP_P_S_M2_vrr_array[1]-PRY*SQ_ESP_P_S_M3_vrr_array[1]+oned2z*I_ESP_S_S_M2_vrr-oned2z*I_ESP_S_S_M3_vrr;
        SQ_ESP_D_S_M2_vrr_array[2] = PAZ*SQ_ESP_P_S_M2_vrr_array[2]-PRZ*SQ_ESP_P_S_M3_vrr_array[2]+oned2z*I_ESP_S_S_M2_vrr-oned2z*I_ESP_S_S_M3_vrr;

        /************************************************************
         * shell quartet name: SQ_ESP_P_S_M1
         * expanding position: BRA1
         * code section is: VRR
         * totally 0 integrals are omitted
         * RHS shell quartet name: SQ_ESP_S_S_M1
         * RHS shell quartet name: SQ_ESP_S_S_M2
         ************************************************************/
        __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_P_S_M1_vrr_array[3];
        SQ_ESP_P_S_M1_vrr_array[0] = PAX*I_ESP_S_S_M1_vrr-PRX*I_ESP_S_S_M2_vrr;
        SQ_ESP_P_S_M1_vrr_array[1] = PAY*I_ESP_S_S_M1_vrr-PRY*I_ESP_S_S_M2_vrr;
        SQ_ESP_P_S_M1_vrr_array[2] = PAZ*I_ESP_S_S_M1_vrr-PRZ*I_ESP_S_S_M2_vrr;

        /************************************************************
         * shell quartet name: SQ_ESP_D_S_M1
         * expanding position: BRA1
         * code section is: VRR
         * totally 2 integrals are omitted
         * RHS shell quartet name: SQ_ESP_P_S_M1
         * RHS shell quartet name: SQ_ESP_P_S_M2
         * RHS shell quartet name: SQ_ESP_S_S_M1
         * RHS shell quartet name: SQ_ESP_S_S_M2
         ************************************************************/
        __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_D_S_M1_vrr_array[4];
        SQ_ESP_D_S_M1_vrr_array[0] = PAX*SQ_ESP_P_S_M1_vrr_array[0]-PRX*SQ_ESP_P_S_M2_vrr_array[0]+oned2z*I_ESP_S_S_M1_vrr-oned2z*I_ESP_S_S_M2_vrr;
        SQ_ESP_D_S_M1_vrr_array[1] = PAY*SQ_ESP_P_S_M1_vrr_array[0]-PRY*SQ_ESP_P_S_M2_vrr_array[0];
        SQ_ESP_D_S_M1_vrr_array[2] = PAY*SQ_ESP_P_S_M1_vrr_array[1]-PRY*SQ_ESP_P_S_M2_vrr_array[1]+oned2z*I_ESP_S_S_M1_vrr-oned2z*I_ESP_S_S_M2_vrr;
        SQ_ESP_D_S_M1_vrr_array[3] = PAZ*SQ_ESP_P_S_M1_vrr_array[2]-PRZ*SQ_ESP_P_S_M2_vrr_array[2]+oned2z*I_ESP_S_S_M1_vrr-oned2z*I_ESP_S_S_M2_vrr;

        /************************************************************
         * shell quartet name: SQ_ESP_F_S_M1
         * expanding position: BRA1
         * code section is: VRR
         * totally 2 integrals are omitted
         * RHS shell quartet name: SQ_ESP_D_S_M1
         * RHS shell quartet name: SQ_ESP_D_S_M2
         * RHS shell quartet name: SQ_ESP_P_S_M1
         * RHS shell quartet name: SQ_ESP_P_S_M2
         ************************************************************/
        __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_F_S_M1_vrr_array[8];
        SQ_ESP_F_S_M1_vrr_array[0] = PAX*SQ_ESP_D_S_M1_vrr_array[0]-PRX*SQ_ESP_D_S_M2_vrr_array[0]+2*oned2z*SQ_ESP_P_S_M1_vrr_array[0]-2*oned2z*SQ_ESP_P_S_M2_vrr_array[0];
        SQ_ESP_F_S_M1_vrr_array[1] = PAY*SQ_ESP_D_S_M1_vrr_array[0]-PRY*SQ_ESP_D_S_M2_vrr_array[0];
        SQ_ESP_F_S_M1_vrr_array[2] = PAZ*SQ_ESP_D_S_M1_vrr_array[0]-PRZ*SQ_ESP_D_S_M2_vrr_array[0];
        SQ_ESP_F_S_M1_vrr_array[3] = PAX*SQ_ESP_D_S_M1_vrr_array[2]-PRX*SQ_ESP_D_S_M2_vrr_array[1];
        SQ_ESP_F_S_M1_vrr_array[4] = PAX*SQ_ESP_D_S_M1_vrr_array[3]-PRX*SQ_ESP_D_S_M2_vrr_array[2];
        SQ_ESP_F_S_M1_vrr_array[5] = PAY*SQ_ESP_D_S_M1_vrr_array[2]-PRY*SQ_ESP_D_S_M2_vrr_array[1]+2*oned2z*SQ_ESP_P_S_M1_vrr_array[1]-2*oned2z*SQ_ESP_P_S_M2_vrr_array[1];
        SQ_ESP_F_S_M1_vrr_array[6] = PAZ*SQ_ESP_D_S_M1_vrr_array[2]-PRZ*SQ_ESP_D_S_M2_vrr_array[1];
        SQ_ESP_F_S_M1_vrr_array[7] = PAZ*SQ_ESP_D_S_M1_vrr_array[3]-PRZ*SQ_ESP_D_S_M2_vrr_array[2]+2*oned2z*SQ_ESP_P_S_M1_vrr_array[2]-2*oned2z*SQ_ESP_P_S_M2_vrr_array[2];

        /************************************************************
         * shell quartet name: SQ_ESP_P_S
         * expanding position: BRA1
         * code section is: VRR
         * totally 0 integrals are omitted
         * RHS shell quartet name: SQ_ESP_S_S
         * RHS shell quartet name: SQ_ESP_S_S_M1
         ************************************************************/
        __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_P_S_vrr_array[3];
        SQ_ESP_P_S_vrr_array[0] = PAX*I_ESP_S_S_vrr-PRX*I_ESP_S_S_M1_vrr;
        SQ_ESP_P_S_vrr_array[1] = PAY*I_ESP_S_S_vrr-PRY*I_ESP_S_S_M1_vrr;
        SQ_ESP_P_S_vrr_array[2] = PAZ*I_ESP_S_S_vrr-PRZ*I_ESP_S_S_M1_vrr;

        /************************************************************
         * shell quartet name: SQ_ESP_D_S
         * expanding position: BRA1
         * code section is: VRR
         * totally 0 integrals are omitted
         * RHS shell quartet name: SQ_ESP_P_S
         * RHS shell quartet name: SQ_ESP_P_S_M1
         * RHS shell quartet name: SQ_ESP_S_S
         * RHS shell quartet name: SQ_ESP_S_S_M1
         ************************************************************/
        __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_D_S_vrr_array[6];
        SQ_ESP_D_S_vrr_array[0] = PAX*SQ_ESP_P_S_vrr_array[0]-PRX*SQ_ESP_P_S_M1_vrr_array[0]+oned2z*I_ESP_S_S_vrr-oned2z*I_ESP_S_S_M1_vrr;
        SQ_ESP_D_S_vrr_array[1] = PAY*SQ_ESP_P_S_vrr_array[0]-PRY*SQ_ESP_P_S_M1_vrr_array[0];
        SQ_ESP_D_S_vrr_array[2] = PAZ*SQ_ESP_P_S_vrr_array[0]-PRZ*SQ_ESP_P_S_M1_vrr_array[0];
        SQ_ESP_D_S_vrr_array[3] = PAY*SQ_ESP_P_S_vrr_array[1]-PRY*SQ_ESP_P_S_M1_vrr_array[1]+oned2z*I_ESP_S_S_vrr-oned2z*I_ESP_S_S_M1_vrr;
        SQ_ESP_D_S_vrr_array[4] = PAZ*SQ_ESP_P_S_vrr_array[1]-PRZ*SQ_ESP_P_S_M1_vrr_array[1];
        SQ_ESP_D_S_vrr_array[5] = PAZ*SQ_ESP_P_S_vrr_array[2]-PRZ*SQ_ESP_P_S_M1_vrr_array[2]+oned2z*I_ESP_S_S_vrr-oned2z*I_ESP_S_S_M1_vrr;

        /************************************************************
         * shell quartet name: SQ_ESP_F_S
         * expanding position: BRA1
         * code section is: VRR
         * totally 0 integrals are omitted
         * RHS shell quartet name: SQ_ESP_D_S
         * RHS shell quartet name: SQ_ESP_D_S_M1
         * RHS shell quartet name: SQ_ESP_P_S
         * RHS shell quartet name: SQ_ESP_P_S_M1
         ************************************************************/
        __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_F_S_vrr_array[10];
        SQ_ESP_F_S_vrr_array[0] = PAX*SQ_ESP_D_S_vrr_array[0]-PRX*SQ_ESP_D_S_M1_vrr_array[0]+2*oned2z*SQ_ESP_P_S_vrr_array[0]-2*oned2z*SQ_ESP_P_S_M1_vrr_array[0];
        SQ_ESP_F_S_vrr_array[1] = PAY*SQ_ESP_D_S_vrr_array[0]-PRY*SQ_ESP_D_S_M1_vrr_array[0];
        SQ_ESP_F_S_vrr_array[2] = PAZ*SQ_ESP_D_S_vrr_array[0]-PRZ*SQ_ESP_D_S_M1_vrr_array[0];
        SQ_ESP_F_S_vrr_array[3] = PAX*SQ_ESP_D_S_vrr_array[3]-PRX*SQ_ESP_D_S_M1_vrr_array[2];
        SQ_ESP_F_S_vrr_array[4] = PAZ*SQ_ESP_D_S_vrr_array[1]-PRZ*SQ_ESP_D_S_M1_vrr_array[1];
        SQ_ESP_F_S_vrr_array[5] = PAX*SQ_ESP_D_S_vrr_array[5]-PRX*SQ_ESP_D_S_M1_vrr_array[3];
        SQ_ESP_F_S_vrr_array[6] = PAY*SQ_ESP_D_S_vrr_array[3]-PRY*SQ_ESP_D_S_M1_vrr_array[2]+2*oned2z*SQ_ESP_P_S_vrr_array[1]-2*oned2z*SQ_ESP_P_S_M1_vrr_array[1];
        SQ_ESP_F_S_vrr_array[7] = PAZ*SQ_ESP_D_S_vrr_array[3]-PRZ*SQ_ESP_D_S_M1_vrr_array[2];
        SQ_ESP_F_S_vrr_array[8] = PAY*SQ_ESP_D_S_vrr_array[5]-PRY*SQ_ESP_D_S_M1_vrr_array[3];
        SQ_ESP_F_S_vrr_array[9] = PAZ*SQ_ESP_D_S_vrr_array[5]-PRZ*SQ_ESP_D_S_M1_vrr_array[3]+2*oned2z*SQ_ESP_P_S_vrr_array[2]-2*oned2z*SQ_ESP_P_S_M1_vrr_array[2];

        /************************************************************
         * shell quartet name: SQ_ESP_G_S
         * expanding position: BRA1
         * code section is: VRR
         * totally 0 integrals are omitted
         * RHS shell quartet name: SQ_ESP_F_S
         * RHS shell quartet name: SQ_ESP_F_S_M1
         * RHS shell quartet name: SQ_ESP_D_S
         * RHS shell quartet name: SQ_ESP_D_S_M1
         ************************************************************/
        __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_G_S_vrr_array[15];
        SQ_ESP_G_S_vrr_array[0] = PAX*SQ_ESP_F_S_vrr_array[0]-PRX*SQ_ESP_F_S_M1_vrr_array[0]+3*oned2z*SQ_ESP_D_S_vrr_array[0]-3*oned2z*SQ_ESP_D_S_M1_vrr_array[0];
        SQ_ESP_G_S_vrr_array[1] = PAY*SQ_ESP_F_S_vrr_array[0]-PRY*SQ_ESP_F_S_M1_vrr_array[0];
        SQ_ESP_G_S_vrr_array[2] = PAZ*SQ_ESP_F_S_vrr_array[0]-PRZ*SQ_ESP_F_S_M1_vrr_array[0];
        SQ_ESP_G_S_vrr_array[3] = PAY*SQ_ESP_F_S_vrr_array[1]-PRY*SQ_ESP_F_S_M1_vrr_array[1]+oned2z*SQ_ESP_D_S_vrr_array[0]-oned2z*SQ_ESP_D_S_M1_vrr_array[0];
        SQ_ESP_G_S_vrr_array[4] = PAZ*SQ_ESP_F_S_vrr_array[1]-PRZ*SQ_ESP_F_S_M1_vrr_array[1];
        SQ_ESP_G_S_vrr_array[5] = PAZ*SQ_ESP_F_S_vrr_array[2]-PRZ*SQ_ESP_F_S_M1_vrr_array[2]+oned2z*SQ_ESP_D_S_vrr_array[0]-oned2z*SQ_ESP_D_S_M1_vrr_array[0];
        SQ_ESP_G_S_vrr_array[6] = PAX*SQ_ESP_F_S_vrr_array[6]-PRX*SQ_ESP_F_S_M1_vrr_array[5];
        SQ_ESP_G_S_vrr_array[7] = PAZ*SQ_ESP_F_S_vrr_array[3]-PRZ*SQ_ESP_F_S_M1_vrr_array[3];
        SQ_ESP_G_S_vrr_array[8] = PAY*SQ_ESP_F_S_vrr_array[5]-PRY*SQ_ESP_F_S_M1_vrr_array[4];
        SQ_ESP_G_S_vrr_array[9] = PAX*SQ_ESP_F_S_vrr_array[9]-PRX*SQ_ESP_F_S_M1_vrr_array[7];
        SQ_ESP_G_S_vrr_array[10] = PAY*SQ_ESP_F_S_vrr_array[6]-PRY*SQ_ESP_F_S_M1_vrr_array[5]+3*oned2z*SQ_ESP_D_S_vrr_array[3]-3*oned2z*SQ_ESP_D_S_M1_vrr_array[2];
        SQ_ESP_G_S_vrr_array[11] = PAZ*SQ_ESP_F_S_vrr_array[6]-PRZ*SQ_ESP_F_S_M1_vrr_array[5];
        SQ_ESP_G_S_vrr_array[12] = PAZ*SQ_ESP_F_S_vrr_array[7]-PRZ*SQ_ESP_F_S_M1_vrr_array[6]+oned2z*SQ_ESP_D_S_vrr_array[3]-oned2z*SQ_ESP_D_S_M1_vrr_array[2];
        SQ_ESP_G_S_vrr_array[13] = PAY*SQ_ESP_F_S_vrr_array[9]-PRY*SQ_ESP_F_S_M1_vrr_array[7];
        SQ_ESP_G_S_vrr_array[14] = PAZ*SQ_ESP_F_S_vrr_array[9]-PRZ*SQ_ESP_F_S_M1_vrr_array[7]+3*oned2z*SQ_ESP_D_S_vrr_array[5]-3*oned2z*SQ_ESP_D_S_M1_vrr_array[3];

      /************************************************************
       * shell quartet name: SQ_ESP_G_S
       * doing contraction work for VRR part
       * totally 0 integrals are omitted
       ************************************************************/
      SQ_ESP_G_S[0] += SQ_ESP_G_S_vrr_array[0];
      SQ_ESP_G_S[1] += SQ_ESP_G_S_vrr_array[1];
      SQ_ESP_G_S[2] += SQ_ESP_G_S_vrr_array[2];
      SQ_ESP_G_S[3] += SQ_ESP_G_S_vrr_array[3];
      SQ_ESP_G_S[4] += SQ_ESP_G_S_vrr_array[4];
      SQ_ESP_G_S[5] += SQ_ESP_G_S_vrr_array[5];
      SQ_ESP_G_S[6] += SQ_ESP_G_S_vrr_array[6];
      SQ_ESP_G_S[7] += SQ_ESP_G_S_vrr_array[7];
      SQ_ESP_G_S[8] += SQ_ESP_G_S_vrr_array[8];
      SQ_ESP_G_S[9] += SQ_ESP_G_S_vrr_array[9];
      SQ_ESP_G_S[10] += SQ_ESP_G_S_vrr_array[10];
      SQ_ESP_G_S[11] += SQ_ESP_G_S_vrr_array[11];
      SQ_ESP_G_S[12] += SQ_ESP_G_S_vrr_array[12];
      SQ_ESP_G_S[13] += SQ_ESP_G_S_vrr_array[13];
      SQ_ESP_G_S[14] += SQ_ESP_G_S_vrr_array[14];

      /************************************************************
       * shell quartet name: SQ_ESP_F_S
       * doing contraction work for VRR part
       * totally 0 integrals are omitted
       ************************************************************/
      SQ_ESP_F_S[0] += SQ_ESP_F_S_vrr_array[0];
      SQ_ESP_F_S[1] += SQ_ESP_F_S_vrr_array[1];
      SQ_ESP_F_S[2] += SQ_ESP_F_S_vrr_array[2];
      SQ_ESP_F_S[3] += SQ_ESP_F_S_vrr_array[3];
      SQ_ESP_F_S[4] += SQ_ESP_F_S_vrr_array[4];
      SQ_ESP_F_S[5] += SQ_ESP_F_S_vrr_array[5];
      SQ_ESP_F_S[6] += SQ_ESP_F_S_vrr_array[6];
      SQ_ESP_F_S[7] += SQ_ESP_F_S_vrr_array[7];
      SQ_ESP_F_S[8] += SQ_ESP_F_S_vrr_array[8];
      SQ_ESP_F_S[9] += SQ_ESP_F_S_vrr_array[9];

      /************************************************************
       * shell quartet name: SQ_ESP_D_S
       * doing contraction work for VRR part
       * totally 0 integrals are omitted
       ************************************************************/
      SQ_ESP_D_S[0] += SQ_ESP_D_S_vrr_array[0];
      SQ_ESP_D_S[1] += SQ_ESP_D_S_vrr_array[1];
      SQ_ESP_D_S[2] += SQ_ESP_D_S_vrr_array[2];
      SQ_ESP_D_S[3] += SQ_ESP_D_S_vrr_array[3];
      SQ_ESP_D_S[4] += SQ_ESP_D_S_vrr_array[4];
      SQ_ESP_D_S[5] += SQ_ESP_D_S_vrr_array[5];
    }

    /************************************************************
     * declare the HRR1 result shell quartets in array form
     ************************************************************/

      /************************************************************
       * initilize the HRR steps : build the AB/CD variables
       ************************************************************/
      if (! isSignificant) continue;
      Double ABX = A[0] - B[0];
      Double ABY = A[1] - B[1];
      Double ABZ = A[2] - B[2];

      /************************************************************
       * shell quartet name: SQ_ESP_D_P
       * expanding position: BRA2
       * code section is: HRR
       * totally 0 integrals are omitted
       * RHS shell quartet name: SQ_ESP_F_S
       * RHS shell quartet name: SQ_ESP_D_S
       ************************************************************/
      __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_D_P[18];
      SQ_ESP_D_P[0] = SQ_ESP_F_S[0]+ABX*SQ_ESP_D_S[0];
      SQ_ESP_D_P[1] = SQ_ESP_F_S[1]+ABX*SQ_ESP_D_S[1];
      SQ_ESP_D_P[2] = SQ_ESP_F_S[2]+ABX*SQ_ESP_D_S[2];
      SQ_ESP_D_P[3] = SQ_ESP_F_S[3]+ABX*SQ_ESP_D_S[3];
      SQ_ESP_D_P[4] = SQ_ESP_F_S[4]+ABX*SQ_ESP_D_S[4];
      SQ_ESP_D_P[5] = SQ_ESP_F_S[5]+ABX*SQ_ESP_D_S[5];
      SQ_ESP_D_P[6] = SQ_ESP_F_S[1]+ABY*SQ_ESP_D_S[0];
      SQ_ESP_D_P[7] = SQ_ESP_F_S[3]+ABY*SQ_ESP_D_S[1];
      SQ_ESP_D_P[8] = SQ_ESP_F_S[4]+ABY*SQ_ESP_D_S[2];
      SQ_ESP_D_P[9] = SQ_ESP_F_S[6]+ABY*SQ_ESP_D_S[3];
      SQ_ESP_D_P[10] = SQ_ESP_F_S[7]+ABY*SQ_ESP_D_S[4];
      SQ_ESP_D_P[11] = SQ_ESP_F_S[8]+ABY*SQ_ESP_D_S[5];
      SQ_ESP_D_P[12] = SQ_ESP_F_S[2]+ABZ*SQ_ESP_D_S[0];
      SQ_ESP_D_P[13] = SQ_ESP_F_S[4]+ABZ*SQ_ESP_D_S[1];
      SQ_ESP_D_P[14] = SQ_ESP_F_S[5]+ABZ*SQ_ESP_D_S[2];
      SQ_ESP_D_P[15] = SQ_ESP_F_S[7]+ABZ*SQ_ESP_D_S[3];
      SQ_ESP_D_P[16] = SQ_ESP_F_S[8]+ABZ*SQ_ESP_D_S[4];
      SQ_ESP_D_P[17] = SQ_ESP_F_S[9]+ABZ*SQ_ESP_D_S[5];

      /************************************************************
       * shell quartet name: SQ_ESP_F_P
       * expanding position: BRA2
       * code section is: HRR
       * totally 5 integrals are omitted
       * RHS shell quartet name: SQ_ESP_G_S
       * RHS shell quartet name: SQ_ESP_F_S
       ************************************************************/
      __attribute__((aligned(PHI_COPROCESSOR_ALIGNMENT))) Double SQ_ESP_F_P[25];
      SQ_ESP_F_P[0] = SQ_ESP_G_S[0]+ABX*SQ_ESP_F_S[0];
      SQ_ESP_F_P[1] = SQ_ESP_G_S[1]+ABX*SQ_ESP_F_S[1];
      SQ_ESP_F_P[2] = SQ_ESP_G_S[2]+ABX*SQ_ESP_F_S[2];
      SQ_ESP_F_P[3] = SQ_ESP_G_S[3]+ABX*SQ_ESP_F_S[3];
      SQ_ESP_F_P[4] = SQ_ESP_G_S[4]+ABX*SQ_ESP_F_S[4];
      SQ_ESP_F_P[5] = SQ_ESP_G_S[5]+ABX*SQ_ESP_F_S[5];
      SQ_ESP_F_P[6] = SQ_ESP_G_S[6]+ABX*SQ_ESP_F_S[6];
      SQ_ESP_F_P[7] = SQ_ESP_G_S[7]+ABX*SQ_ESP_F_S[7];
      SQ_ESP_F_P[8] = SQ_ESP_G_S[8]+ABX*SQ_ESP_F_S[8];
      SQ_ESP_F_P[9] = SQ_ESP_G_S[9]+ABX*SQ_ESP_F_S[9];
      SQ_ESP_F_P[10] = SQ_ESP_G_S[3]+ABY*SQ_ESP_F_S[1];
      SQ_ESP_F_P[11] = SQ_ESP_G_S[4]+ABY*SQ_ESP_F_S[2];
      SQ_ESP_F_P[12] = SQ_ESP_G_S[6]+ABY*SQ_ESP_F_S[3];
      SQ_ESP_F_P[13] = SQ_ESP_G_S[7]+ABY*SQ_ESP_F_S[4];
      SQ_ESP_F_P[14] = SQ_ESP_G_S[8]+ABY*SQ_ESP_F_S[5];
      SQ_ESP_F_P[15] = SQ_ESP_G_S[10]+ABY*SQ_ESP_F_S[6];
      SQ_ESP_F_P[16] = SQ_ESP_G_S[11]+ABY*SQ_ESP_F_S[7];
      SQ_ESP_F_P[17] = SQ_ESP_G_S[12]+ABY*SQ_ESP_F_S[8];
      SQ_ESP_F_P[18] = SQ_ESP_G_S[13]+ABY*SQ_ESP_F_S[9];
      SQ_ESP_F_P[19] = SQ_ESP_G_S[5]+ABZ*SQ_ESP_F_S[2];
      SQ_ESP_F_P[20] = SQ_ESP_G_S[8]+ABZ*SQ_ESP_F_S[4];
      SQ_ESP_F_P[21] = SQ_ESP_G_S[9]+ABZ*SQ_ESP_F_S[5];
      SQ_ESP_F_P[22] = SQ_ESP_G_S[12]+ABZ*SQ_ESP_F_S[7];
      SQ_ESP_F_P[23] = SQ_ESP_G_S[13]+ABZ*SQ_ESP_F_S[8];
      SQ_ESP_F_P[24] = SQ_ESP_G_S[14]+ABZ*SQ_ESP_F_S[9];

      /************************************************************
       * shell quartet name: SQ_ESP_D_D
       * expanding position: BRA2
       * code section is: HRR
       * totally 0 integrals are omitted
       * RHS shell quartet name: SQ_ESP_F_P
       * RHS shell quartet name: SQ_ESP_D_P
       ************************************************************/
      abcd[0] = SQ_ESP_F_P[0]+ABX*SQ_ESP_D_P[0];
      abcd[1] = SQ_ESP_F_P[1]+ABX*SQ_ESP_D_P[1];
      abcd[2] = SQ_ESP_F_P[2]+ABX*SQ_ESP_D_P[2];
      abcd[3] = SQ_ESP_F_P[3]+ABX*SQ_ESP_D_P[3];
      abcd[4] = SQ_ESP_F_P[4]+ABX*SQ_ESP_D_P[4];
      abcd[5] = SQ_ESP_F_P[5]+ABX*SQ_ESP_D_P[5];
      abcd[6] = SQ_ESP_F_P[1]+ABY*SQ_ESP_D_P[0];
      abcd[7] = SQ_ESP_F_P[3]+ABY*SQ_ESP_D_P[1];
      abcd[8] = SQ_ESP_F_P[4]+ABY*SQ_ESP_D_P[2];
      abcd[9] = SQ_ESP_F_P[6]+ABY*SQ_ESP_D_P[3];
      abcd[10] = SQ_ESP_F_P[7]+ABY*SQ_ESP_D_P[4];
      abcd[11] = SQ_ESP_F_P[8]+ABY*SQ_ESP_D_P[5];
      abcd[12] = SQ_ESP_F_P[2]+ABZ*SQ_ESP_D_P[0];
      abcd[13] = SQ_ESP_F_P[4]+ABZ*SQ_ESP_D_P[1];
      abcd[14] = SQ_ESP_F_P[5]+ABZ*SQ_ESP_D_P[2];
      abcd[15] = SQ_ESP_F_P[7]+ABZ*SQ_ESP_D_P[3];
      abcd[16] = SQ_ESP_F_P[8]+ABZ*SQ_ESP_D_P[4];
      abcd[17] = SQ_ESP_F_P[9]+ABZ*SQ_ESP_D_P[5];
      abcd[18] = SQ_ESP_F_P[10]+ABY*SQ_ESP_D_P[6];
      abcd[19] = SQ_ESP_F_P[12]+ABY*SQ_ESP_D_P[7];
      abcd[20] = SQ_ESP_F_P[13]+ABY*SQ_ESP_D_P[8];
      abcd[21] = SQ_ESP_F_P[15]+ABY*SQ_ESP_D_P[9];
      abcd[22] = SQ_ESP_F_P[16]+ABY*SQ_ESP_D_P[10];
      abcd[23] = SQ_ESP_F_P[17]+ABY*SQ_ESP_D_P[11];
      abcd[24] = SQ_ESP_F_P[11]+ABZ*SQ_ESP_D_P[6];
      abcd[25] = SQ_ESP_F_P[13]+ABZ*SQ_ESP_D_P[7];
      abcd[26] = SQ_ESP_F_P[14]+ABZ*SQ_ESP_D_P[8];
      abcd[27] = SQ_ESP_F_P[16]+ABZ*SQ_ESP_D_P[9];
      abcd[28] = SQ_ESP_F_P[17]+ABZ*SQ_ESP_D_P[10];
      abcd[29] = SQ_ESP_F_P[18]+ABZ*SQ_ESP_D_P[11];
      abcd[30] = SQ_ESP_F_P[19]+ABZ*SQ_ESP_D_P[12];
      abcd[31] = SQ_ESP_F_P[20]+ABZ*SQ_ESP_D_P[13];
      abcd[32] = SQ_ESP_F_P[21]+ABZ*SQ_ESP_D_P[14];
      abcd[33] = SQ_ESP_F_P[22]+ABZ*SQ_ESP_D_P[15];
      abcd[34] = SQ_ESP_F_P[23]+ABZ*SQ_ESP_D_P[16];
      abcd[35] = SQ_ESP_F_P[24]+ABZ*SQ_ESP_D_P[17];

      //
      // here at the end of the raw integral code we combined the digestion
      // with the raw integral together

if (integraljobs::doK(intJob)) {

        // do bra side K digestion
        // all of denPhi and exRho matrix are atom blocked
        // therefore we need the local basis set offset to index them
        Double*       aKetExRho = matrix_phi::getPtr(colLocBasOffset,iGrid,aKetAtomBlockExRho);
        for(UInt j=0; j<6; j++) {
          const Double* abcd_ptr = &abcd[j*6];
          Double result = ZERO;
          result += aBraDenPhi[0]*abcd_ptr[0];
          result += aBraDenPhi[1]*abcd_ptr[1];
          result += aBraDenPhi[2]*abcd_ptr[2];
          result += aBraDenPhi[3]*abcd_ptr[3];
          result += aBraDenPhi[4]*abcd_ptr[4];
          result += aBraDenPhi[5]*abcd_ptr[5];
        }

if (iGloShell != jGloShell) {

          // do ket side K digestion
          Double* aBraExRho = matrix_phi::getPtr(rowLocBasOffset,iGrid,aBraAtomBlockExRho);
          for(UInt i=0; i<6; i++) {
            Double result = ZERO;
            result += aKetDenPhi[0]*abcd[0+i];
            result += aKetDenPhi[1]*abcd[6+i];
            result += aKetDenPhi[2]*abcd[12+i];
            result += aKetDenPhi[3]*abcd[18+i];
            result += aKetDenPhi[4]*abcd[24+i];
            result += aKetDenPhi[5]*abcd[30+i];
          }

}

if (nSpin == 2) {

          // do bra side K digestion
          // all of denPhi and exRho matrix are atom blocked
          // therefore we need the local basis set offset to index them
          Double*       bKetExRho = matrix_phi::getPtr(colLocBasOffset,iGrid,bKetAtomBlockExRho);
          for(UInt j=0; j<6; j++) {
            const Double* abcd_ptr = &abcd[j*6];
            Double result = ZERO;
            result += bBraDenPhi[0]*abcd_ptr[0];
            result += bBraDenPhi[1]*abcd_ptr[1];
            result += bBraDenPhi[2]*abcd_ptr[2];
            result += bBraDenPhi[3]*abcd_ptr[3];
            result += bBraDenPhi[4]*abcd_ptr[4];
            result += bBraDenPhi[5]*abcd_ptr[5];
          }

if (iGloShell != jGloShell) {

            // do ket side K digestion
            Double* bBraExRho = matrix_phi::getPtr(rowLocBasOffset,iGrid,bBraAtomBlockExRho);
            for(UInt i=0; i<6; i++) {
              Double result = ZERO;
              result += bKetDenPhi[0]*abcd[0+i];
              result += bKetDenPhi[1]*abcd[6+i];
              result += bKetDenPhi[2]*abcd[12+i];
              result += bKetDenPhi[3]*abcd[18+i];
              result += bKetDenPhi[4]*abcd[24+i];
              result += bKetDenPhi[5]*abcd[30+i];
            }

}

fenglai · ‎06-14-2017

Also the whole piece is inside a function to run inside a single thread, and it's only the final part

if (integraljobs::doK(intJob)) {

        // do bra side K digestion
        // all of denPhi and exRho matrix are atom blocked
        // therefore we need the local basis set offset to index them
        Double*       aKetExRho = matrix_phi::getPtr(colLocBasOffset,iGrid,aKetAtomBlockExRho);
        for(UInt j=0; j<6; j++) {
          const Double* abcd_ptr = &abcd[j*6];
          Double result = ZERO;
          result += aBraDenPhi[0]*abcd_ptr[0];
          result += aBraDenPhi[1]*abcd_ptr[1];
          result += aBraDenPhi[2]*abcd_ptr[2];
          result += aBraDenPhi[3]*abcd_ptr[3];
          result += aBraDenPhi[4]*abcd_ptr[4];
          result += aBraDenPhi[5]*abcd_ptr[5];
        }

if (iGloShell != jGloShell) {

          // do ket side K digestion
          Double* aBraExRho = matrix_phi::getPtr(rowLocBasOffset,iGrid,aBraAtomBlockExRho);
          for(UInt i=0; i<6; i++) {
            Double result = ZERO;
            result += aKetDenPhi[0]*abcd[0+i];
            result += aKetDenPhi[1]*abcd[6+i];
            result += aKetDenPhi[2]*abcd[12+i];
            result += aKetDenPhi[3]*abcd[18+i];
            result += aKetDenPhi[4]*abcd[24+i];
            result += aKetDenPhi[5]*abcd[30+i];
          }

}

this piece code is the trouble maker. Right now I just commented out the writing back sentense.

Thank you!

phoenix

jimdempseyatthecove · ‎06-14-2017

Why are you manually unrolling the loops 6 times?

While this may have been optimal for a host CPU executing SSE code, it is not optimal for use on KNC. Although the compiler might be able to undo your "craftiness", I wouldn't rely on it.

Simplify your hand unrolled loops and remove the const Double* ... and use the arrays directly.

Jim Dempsey

fenglai · ‎06-14-2017

I examined the timing of several tests. It seems the integral calculation part is abnormally small. I think Gregg's suggestion is correct. After I commment out the line to write back the result the compiler probably signal the whole loop for raw integral calculation as "meaningless". Therefore to make the speed up.

This is very sad. It seems difficult to improve the code performance on the Phi. the integral calculation is done recursively therefore I have a lot of scalar variables and if else conditions. right now it seems the code does not benefit from the vector usage. Is it possible to revise this code to improve the efficiency?

Thank you for any suggestion,

phoenix

Gregg_S_Intel · ‎06-14-2017

The Intel Xeon Phi are HPC processors for running big parallel vector loops.