Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

function call to ?1memcpy cannot be vectorized

Page__Mike
Beginner
863 Views

 

The following lines of code generates reports on failure to parallelize and to vectorize:

      allocate(particles(nrp+ngp))

      particles(    1:nrp) = rparticles

      particles(nrp+1:   ) = gparticles


declarations:

      use particle_mod   , only: particle_t

.
.
.

      type(particle_t), intent(in   ) :: rparticles(nrp), gparticles(ngp)

 

reports:

LOOP BEGIN at calc.f90(76,7)

   remark #17104: loop was not parallelized: existence of parallel dependence

   remark #17106: parallel dependence: assumed OUTPUT dependence between call:?1memcpy (76:7) and call:?1memcpy (76:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between call:?1memcpy (76:7) and call:?1memcpy (76:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (76:7) and PARTICLES(:) (76:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (76:7) and PARTICLES(:) (76:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (76:7) and rparticles(:) (76:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between rparticles(:) (76:7) and PARTICLES(:) (76:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between rparticles(:) (76:7) and PARTICLES(:) (76:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (76:7) and rparticles(:) (76:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between rparticles(:) (76:7) and rparticles(:) (76:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between rparticles(:) (76:7) and rparticles(:) (76:7)

   remark #15527: loop was not vectorized: function call to ?1memcpy cannot be vectorized

LOOP END

 

LOOP BEGIN at calc.f90(77,7)

   remark #17104: loop was not parallelized: existence of parallel dependence

   remark #17106: parallel dependence: assumed OUTPUT dependence between call:?1memcpy (77:7) and call:?1memcpy (77:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between call:?1memcpy (77:7) and call:?1memcpy (77:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (77:7) and PARTICLES(:) (77:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (77:7) and PARTICLES(:) (77:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (77:7) and gparticles(:) (77:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between gparticles(:) (77:7) and PARTICLES(:) (77:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between gparticles(:) (77:7) and PARTICLES(:) (77:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (77:7) and gparticles(:) (77:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between gparticles(:) (77:7) and gparticles(:) (77:7)

   remark #17106: parallel dependence: assumed OUTPUT dependence between gparticles(:) (77:7) and gparticles(:) (77:7)

   remark #15527: loop was not vectorized: function call to ?1memcpy cannot be vectorized

LOOP END

 

particle_t defined as:

  public  particle_t

  public  print_particles

 

  type, bind(C)  :: particle_t

     real(rt)    :: pos(3)     !< Position -- fortran component  1

     real(rt)    :: radius     !< Radius   -- fortran component  4

     real(rt)    :: volume     !< Volume   -- fortran component  5

     real(rt)    :: mass       !< Mass     -- fortran component  6

     real(rt)    :: density    !< Density  -- fortran component  7

     real(rt)    :: omoi       !< One over momentum of inertia -- fortran component 8

     real(rt)    :: vel(3)     !< Linear velocity              -- fortran components 9,10,11

     real(rt)    :: omega(3)   !< Angular velocity             -- fortran components 12,13,14

     real(rt)    :: drag(3)    !< Drag                         -- fortran components 15,16,17

     integer(c_int)  :: id

     integer(c_int)  :: cpu

     integer(c_int)  :: phase

     integer(c_int)  :: state

  end type particle_t

 
 
0 Kudos
6 Replies
Page__Mike
Beginner
863 Views

I have continued to look at this issue and have encountered information on streaming stores and non temporal writes

https://software.intel.com/en-us/articles/memcpy-memset-optimization-and-control

adding directives:

!DIR$ vector nontemporal

!DIR$ simd

or adding the compile option -qopt-streaming-stores

simd vectorizes the loops but the compiler reports

scalar cost:  89

vector cost: 159.87

 

​will be testing the change soon 

0 Kudos
TimP
Honored Contributor III
863 Views

Is the "non-vectorized" compilation using a memset library function, as expected?  The library call may well be faster, assuming those vectors are long enough. If short, alignment may play a role, but inline optimization could depend also on length being known at compile time.  It this fragment is critical for you, you should at least check the opt-report for the vectorized version, preferably look at the generated code (possibly under Advisor or VTune)?

Isn't the legacy !dir$ simd deprecated?

This forum itself may be going in the direction of deprecation.  Intel announced a committee to make recommendations on the roles of the various forums; meanwhile, your question might have got more attention on one of the Fortran forum sections.

0 Kudos
Page__Mike
Beginner
863 Views

"Is the "non-vectorized" compilation using a memset library function, as expected?"

Not that I can see:

nm of the object file:

0000000000000000 T this_code
0000000000000000 d this_code$format_pack.0.1
                 U cfrelvel_module_mp_cfrelvel_
                 U discretelement_mp_des_coll_model_enum_
                 U discretelement_mp_des_crossprdct_
                 U discretelement_mp_des_etan_
                 U discretelement_mp_hert_kn_
                 U discretelement_mp_kn_
                 U discretelement_mp_mew_
                 U for_alloc_allocatable
                 U for_check_mult_overflow64
                 U for_dealloc_allocatable
                 U for_stop_core
                 U for_write_seq_fmt
                 U for_write_seq_fmt_xmit
0000000000000000 r __STRLITPACK_10
0000000000000048 r __STRLITPACK_12.0.1
0000000000000050 r __STRLITPACK_13.0.1
0000000000000000 r var$145.0.1

"Isn't the legacy !dir$ simd deprecated?"

Not sure.

I am working from a 2014 Intel article:

https://software.intel.com/en-us/articles/memcpy-memset-optimization-and-lcontrol

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
863 Views

Try padding your particle structure such that is becomes a multiple of the vector width (16, 32 or 64 bytes as the case may be).

Jim Dempsey

0 Kudos
Page__Mike
Beginner
863 Views

Jim,

Thank you for your response. I have passed it along to a colleague who is actually working with the code.

I will do my best to post any resolution here or may continue with more questions.

0 Kudos
Page__Mike
Beginner
863 Views

Jim,

Your suggestion was reported as being helpful though I have not been able to test and measure the improvements yet myself.

Thanks Again.

0 Kudos
Reply