Community
cancel
Showing results for 
Search instead for 
Did you mean: 
179 Views

__assume_aligned could not be resolved (Eclipse)

Hi,

I am using Intel C++ compiler (2016 update 3) inside Eclipse, running the vec report I found that

I need to have my class aligned.

I used the #include <align_new> but I think the __assume_aligned is necessary and

it could not be resolved inside Eclipse. I dont know what is missing.

How could I align my class and the arrays inside it? The vec report says vectorization support: reference atom has unaligned access.

atom.hpp


#ifndef SRC_HEADERS_ATOM_HPP_
#define SRC_HEADERS_ATOM_HPP_

#include <string>
using namespace std;


class Atom {


    public:
        double *charge,*mass;
        //Constructor

        Atom(unsigned int size);
        ~Atom();

};
#endif


atom.cpp

#include <string>
#include <headers/atom.hpp>
#include <aligned_new>
using namespace std;

Atom::Atom(unsigned int size){
            charge = new double[size];
            mass = new double[size];
        };

        Atom::~Atom(){
            delete[] charge;
            delete[] mass;
        };

Thanks.

 

0 Kudos
13 Replies
Employee
179 Views

Hi Leandro,
I created a sample test case from your input and could vectorize without any alignment issue on Linux. Can you attach the full test case (just the header/cpp files) and the command line options you used as well so I can reproduce the issue?
Thanks,
Kittur

0 Kudos
179 Views

Hi Kittur,

Thanks for your response.

I attached the file that is sending the message and it is using the class I sent before.

 

Thanks

 

 

 

0 Kudos
Employee
179 Views

Hi Leandro,
Your attachment is incomplete and fails to compile. Can you attach the full test case and make sure it compiles? You can use the -P option to generate the .i preprocessed file and can attach that file. Also, indicate the command line options used as well.
Thanks,
Kittur

0 Kudos
179 Views

Hi Kittur,

The command I used to compile is  icpc  -qopt-report=5 -qopt-report-phase=vec -g -O2 -parallel -std=c++0x -axCORE-AVX2

I created a simple example based on my original (big) code that shows the same problem on my report.

The files are attached.

Thanks.

 

 

 

0 Kudos
Employee
179 Views

Thanks Leandro, I'll try to reproduce the issue and if so will file the issue with the developers. I'll keep you updated accordingly, appreciate much.

Kittur

0 Kudos
Black Belt
179 Views

(not using Eclipse) It reports aligned access in the vectorized loop but unaligned in the remainder.  This appears to mean that it will check alignment at run time and use the unaligned to take care of any adjustment. 

When I echo the pre-processed source for potential.cpp there is no aligned_new, but I agree that it might be reasonable for the compiler to assume that new (in 64-bit mode) returns an aligned allocation even without that. If it doesn't observe your compile flag, it might be only 16-byte aligned. For AVX2, you need 32-byte alignment for best performance.  The posted example seems to imply aligned_new giving 64-byte alignment.

I checked that throwing #pragma vector aligned results in the remainder loop as well as the main one assuming aligned access.  I'd be surprised if you could measure any difference in performance.

If you are targeting 32-bit mode, it might make more sense to define iatom as int, not unsigned int.

0 Kudos
Employee
179 Views

Thanks Tim, yes you're correct.

Hi Leandro,
I tried it as well and yes due to heuristics and sometimes for compatibility reasons the vectorizer will need explicit hint by the user. In your case, the use of pragma vector aligned does explicitly let the compiler know and hence it does vectorize the loop with aligned access. You should use that pragma in your code

-------------
LOOP BEGIN at main.cpp(15,2)
<Multiversioned v1>
   remark #15388: vectorization support: reference atom has aligned access   [ main.cpp(16,3) ]
   remark #15388: vectorization support: reference atom has aligned access   [ main.cpp(17,3) ]
   remark #15305: vectorization support: vector length 4
   remark #15399: vectorization support: unroll factor set to 2
   remark #15309: vectorization support: normalized vectorization overhead 0.083
   remark #15300: LOOP WAS VECTORIZED
   remark #15449: unmasked aligned unit stride stores: 2
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 14
   remark #15477: vector loop cost: 6.000
   remark #15478: estimated potential speedup: 2.330
   remark #15487: type converts: 2
   remark #15488: --- end vector loop cost summary ---
LOOP END'
-------------------------

Kittur

0 Kudos
179 Views

Hi guys,

Thanks for your response! I tested it using that pragma and it solved the problem. also thanks for the hint on AVX2 32-byte aligment, since I will be using on Xeon Phi soon.

 


 

0 Kudos
Employee
179 Views

Great, thanks Leandro for letting us know. Again, appreciate the feedback and your patience throughout.

Cheers,
Kittur 

0 Kudos
179 Views

>>thanks for the hint on AVX2 32-byte aligment, since I will be using on Xeon Phi soon

Xeon Phi KNC does not support AVX2, KNL should. Both support 512-bit vectors (slightly different instruction sets). Therefor target your alignment at 64 bytes.

Jim Dempsey

0 Kudos
Employee
179 Views

Good point, Jim on Xeon Phi.  

Leandro, Xeon Phi has vector unit that's 512 bits wide and 32 registers per context.  But, it doesn't support MMX™ technology, Streaming SIMD Extensions (SSE), Intel® Advanced Vector Extensions (Intel AVX) and so on. 

So, fundamental considerations should be 1) scaling - is your application utilizes highly parallel capabilities of Xeon processor. Can it scale? 2) Vectorization - can it use vector units heavily? 3) Memory usage - is the application utilizing more local memory bandwidth than available on Xeon processor?

Regards,
Kittur

0 Kudos
179 Views

Thanks Jim for your comment.

Kittur, about your considerations:

1 - I am doing a Monte Carlo Simulation so the calculated angles are independent and the Vtune is showing the  Average and Tagert on the same number of threads (max available threads).

2 - The functions that are called extensively are vectorized now. (Thnaks for your previous response) I need only to use DSWP on the last one.

3 - I think, but I will check next week, that the application is not utilizing more local memory bandwidth than available on Xeon processor. Itis not a hungry memory application but a compute- intensive one.

Thanks you guys for your help.

0 Kudos
Employee
179 Views

Got it, thanks Leandro. Yes, if it's compute intensive and is a good candidate to offload on to the coprocessor and can utilize those vector units while executing the rest of the code sections on Xeon host is something to try out as well.

Cheers,
Kittur

0 Kudos