- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following lines of code generates reports on failure to parallelize and to vectorize:
allocate(particles(nrp+ngp))
particles( 1:nrp) = rparticles
particles(nrp+1: ) = gparticles
declarations:
use particle_mod , only: particle_t
.
type(particle_t), intent(in ) :: rparticles(nrp), gparticles(ngp)
reports:
LOOP BEGIN at calc.f90(76,7)
remark #17104: loop was not parallelized: existence of parallel dependence
remark #17106: parallel dependence: assumed OUTPUT dependence between call:?1memcpy (76:7) and call:?1memcpy (76:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between call:?1memcpy (76:7) and call:?1memcpy (76:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (76:7) and PARTICLES(:) (76:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (76:7) and PARTICLES(:) (76:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (76:7) and rparticles(:) (76:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between rparticles(:) (76:7) and PARTICLES(:) (76:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between rparticles(:) (76:7) and PARTICLES(:) (76:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (76:7) and rparticles(:) (76:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between rparticles(:) (76:7) and rparticles(:) (76:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between rparticles(:) (76:7) and rparticles(:) (76:7)
remark #15527: loop was not vectorized: function call to ?1memcpy cannot be vectorized
LOOP END
LOOP BEGIN at calc.f90(77,7)
remark #17104: loop was not parallelized: existence of parallel dependence
remark #17106: parallel dependence: assumed OUTPUT dependence between call:?1memcpy (77:7) and call:?1memcpy (77:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between call:?1memcpy (77:7) and call:?1memcpy (77:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (77:7) and PARTICLES(:) (77:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (77:7) and PARTICLES(:) (77:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (77:7) and gparticles(:) (77:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between gparticles(:) (77:7) and PARTICLES(:) (77:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between gparticles(:) (77:7) and PARTICLES(:) (77:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between PARTICLES(:) (77:7) and gparticles(:) (77:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between gparticles(:) (77:7) and gparticles(:) (77:7)
remark #17106: parallel dependence: assumed OUTPUT dependence between gparticles(:) (77:7) and gparticles(:) (77:7)
remark #15527: loop was not vectorized: function call to ?1memcpy cannot be vectorized
LOOP END
particle_t defined as:
public particle_t
public print_particles
type, bind(C) :: particle_t
real(rt) :: pos(3) !< Position -- fortran component 1
real(rt) :: radius !< Radius -- fortran component 4
real(rt) :: volume !< Volume -- fortran component 5
real(rt) :: mass !< Mass -- fortran component 6
real(rt) :: density !< Density -- fortran component 7
real(rt) :: omoi !< One over momentum of inertia -- fortran component 8
real(rt) :: vel(3) !< Linear velocity -- fortran components 9,10,11
real(rt) :: omega(3) !< Angular velocity -- fortran components 12,13,14
real(rt) :: drag(3) !< Drag -- fortran components 15,16,17
integer(c_int) :: id
integer(c_int) :: cpu
integer(c_int) :: phase
integer(c_int) :: state
end type particle_t
- Tags:
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have continued to look at this issue and have encountered information on streaming stores and non temporal writes
https://software.intel.com/en-us/articles/memcpy-memset-optimization-and-control
adding directives:
!DIR$ vector nontemporal
!DIR$ simd
or adding the compile option -qopt-streaming-stores
simd vectorizes the loops but the compiler reports
scalar cost: 89
vector cost: 159.87
will be testing the change soon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the "non-vectorized" compilation using a memset library function, as expected? The library call may well be faster, assuming those vectors are long enough. If short, alignment may play a role, but inline optimization could depend also on length being known at compile time. It this fragment is critical for you, you should at least check the opt-report for the vectorized version, preferably look at the generated code (possibly under Advisor or VTune)?
Isn't the legacy !dir$ simd deprecated?
This forum itself may be going in the direction of deprecation. Intel announced a committee to make recommendations on the roles of the various forums; meanwhile, your question might have got more attention on one of the Fortran forum sections.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Is the "non-vectorized" compilation using a memset library function, as expected?"
Not that I can see:
nm of the object file:
0000000000000000 T this_code
0000000000000000 d this_code$format_pack.0.1
U cfrelvel_module_mp_cfrelvel_
U discretelement_mp_des_coll_model_enum_
U discretelement_mp_des_crossprdct_
U discretelement_mp_des_etan_
U discretelement_mp_hert_kn_
U discretelement_mp_kn_
U discretelement_mp_mew_
U for_alloc_allocatable
U for_check_mult_overflow64
U for_dealloc_allocatable
U for_stop_core
U for_write_seq_fmt
U for_write_seq_fmt_xmit
0000000000000000 r __STRLITPACK_10
0000000000000048 r __STRLITPACK_12.0.1
0000000000000050 r __STRLITPACK_13.0.1
0000000000000000 r var$145.0.1
"Isn't the legacy !dir$ simd deprecated?"
Not sure.
I am working from a 2014 Intel article:
https://software.intel.com/en-us/articles/memcpy-memset-optimization-and-lcontrol
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try padding your particle structure such that is becomes a multiple of the vector width (16, 32 or 64 bytes as the case may be).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Thank you for your response. I have passed it along to a colleague who is actually working with the code.
I will do my best to post any resolution here or may continue with more questions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Your suggestion was reported as being helpful though I have not been able to test and measure the improvements yet myself.
Thanks Again.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page