function call to ?1memcpy cannot be vectorized

Page__Mike · ‎07-03-2018

The following lines of code generates reports on failure to parallelize and to vectorize:

allocate(particles(nrp+ngp))

particles( 1:nrp) = rparticles

particles(nrp+1: ) = gparticles

declarations:

use particle_mod , only: particle_t

.
.

.

type(particle_t), intent(in ) :: rparticles(nrp), gparticles(ngp)

reports:

LOOP BEGIN at calc.f90(76,7)

remark #17104: loop was not parallelized: existence of parallel dependence

remark #17106: parallel dependence: assumed OUTPUT dependence between call:?1memcpy (76:7) and call:?1memcpy (76:7)

remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (76:7) and PARTICLES(:) (76:7)

remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (76:7) and rparticles(:) (76:7)

remark #17106: parallel dependence: assumed OUTPUT dependence between rparticles(:) (76:7) and PARTICLES(:) (76:7)

remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (76:7) and rparticles(:) (76:7)

remark #17106: parallel dependence: assumed OUTPUT dependence between rparticles(:) (76:7) and rparticles(:) (76:7)

remark #15527: loop was not vectorized: function call to ?1memcpy cannot be vectorized

LOOP END

LOOP BEGIN at calc.f90(77,7)

remark #17104: loop was not parallelized: existence of parallel dependence

remark #17106: parallel dependence: assumed OUTPUT dependence between call:?1memcpy (77:7) and call:?1memcpy (77:7)

remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (77:7) and PARTICLES(:) (77:7)

remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (77:7) and gparticles(:) (77:7)

remark #17106: parallel dependence: assumed OUTPUT dependence between gparticles(:) (77:7) and PARTICLES(:) (77:7)

remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (77:7) and gparticles(:) (77:7)

remark #17106: parallel dependence: assumed OUTPUT dependence between gparticles(:) (77:7) and gparticles(:) (77:7)

remark #15527: loop was not vectorized: function call to ?1memcpy cannot be vectorized

LOOP END

particle_t defined as:

public particle_t

public print_particles

type, bind(C) :: particle_t

real(rt) :: pos(3) !< Position -- fortran component 1

real(rt) :: radius !< Radius -- fortran component 4

real(rt) :: volume !< Volume -- fortran component 5

real(rt) :: mass !< Mass -- fortran component 6

real(rt) :: density !< Density -- fortran component 7

real(rt) :: omoi !< One over momentum of inertia -- fortran component 8

real(rt) :: vel(3) !< Linear velocity -- fortran components 9,10,11

real(rt) :: omega(3) !< Angular velocity -- fortran components 12,13,14

real(rt) :: drag(3) !< Drag -- fortran components 15,16,17

integer(c_int) :: id

integer(c_int) :: cpu

integer(c_int) :: phase

integer(c_int) :: state

end type particle_t

Page__Mike · ‎07-06-2018

I have continued to look at this issue and have encountered information on streaming stores and non temporal writes

https://software.intel.com/en-us/articles/memcpy-memset-optimization-and-control

adding directives:

!DIR$ vector nontemporal

!DIR$ simd

or adding the compile option -qopt-streaming-stores

simd vectorizes the loops but the compiler reports

scalar cost: 89

vector cost: 159.87

will be testing the change soon

TimP · ‎07-09-2018

Is the "non-vectorized" compilation using a memset library function, as expected? The library call may well be faster, assuming those vectors are long enough. If short, alignment may play a role, but inline optimization could depend also on length being known at compile time. It this fragment is critical for you, you should at least check the opt-report for the vectorized version, preferably look at the generated code (possibly under Advisor or VTune)?

Isn't the legacy !dir$ simd deprecated?

This forum itself may be going in the direction of deprecation. Intel announced a committee to make recommendations on the roles of the various forums; meanwhile, your question might have got more attention on one of the Fortran forum sections.

Page__Mike · ‎07-09-2018

"Is the "non-vectorized" compilation using a memset library function, as expected?"

Not that I can see:

nm of the object file:

0000000000000000 T this_code
0000000000000000 d this_code$format_pack.0.1
U cfrelvel_module_mp_cfrelvel_
U discretelement_mp_des_coll_model_enum_
U discretelement_mp_des_crossprdct_
U discretelement_mp_des_etan_
U discretelement_mp_hert_kn_
U discretelement_mp_kn_
U discretelement_mp_mew_
U for_alloc_allocatable
U for_check_mult_overflow64
U for_dealloc_allocatable
U for_stop_core
U for_write_seq_fmt
U for_write_seq_fmt_xmit
0000000000000000 r __STRLITPACK_10
0000000000000048 r __STRLITPACK_12.0.1
0000000000000050 r __STRLITPACK_13.0.1
0000000000000000 r var$145.0.1

"Isn't the legacy !dir$ simd deprecated?"

Not sure.

I am working from a 2014 Intel article:

https://software.intel.com/en-us/articles/memcpy-memset-optimization-and-lcontrol

jimdempseyatthecove · ‎07-28-2018

Try padding your particle structure such that is becomes a multiple of the vector width (16, 32 or 64 bytes as the case may be).

Jim Dempsey

Page__Mike · ‎08-06-2018

Jim,

Thank you for your response. I have passed it along to a colleague who is actually working with the code.

I will do my best to post any resolution here or may continue with more questions.

Page__Mike · ‎08-31-2018

Jim,

Your suggestion was reported as being helpful though I have not been able to test and measure the improvements yet myself.

Thanks Again.