Community
cancel
Showing results for 
Search instead for 
Did you mean: 
mikeitexpert
New Contributor I
174 Views

How to define a memory aligned template pointer as a C++ class member? (intel/gcc)

I have a matrix class that has a pointer class member (data pointer to the class template variable) and I would like to declare it as a pointer to aligned memory just as a hint to intel compiler that it can use aligned access for auto-vectorized code. The short version of the code is as below:

 

template <class V>
class matrix{
public:

    typedef __declspec(align(64)) V VA;
    VA * data;
    

    int nrows, ncols;
}

 

When I check optimization report I see that access to data is not aligned unless I use __assume_aligned in the code block where I use matrix.data.

Please let me know if there is an alternative to let the intel/gcc compiler know that data field is an aligned pointer so that I don't have to use __assume_aligned everywhere there is an access to the pointer.

 

Regards

 

0 Kudos
7 Replies
RahulV_intel
Moderator
155 Views

Hi,


I've used your code snippet to create a pointer to matrix.data, without making the pointer "__assume_aligned()". Here is the remark from the vectorization report of the created pointer:


 remark #15388: vectorization support: reference mat.data[i] has aligned access


To generate a vectorization report, make use of the below compilation flags:

-qopt-report=5 -qopt-report-phase=vec


If possible, please attach a small reproducible code sample(compilable) and vectorization reports for both the cases (with and without __assume_aligned()).



Thanks,

Rahul


mikeitexpert
New Contributor I
148 Views

Oh wonderful, Could you attach the full sample code and the report please?

I prepare a small snippet where I see unaligned access ... 

Regards

Mike

 

 

mikeitexpert
New Contributor I
142 Views

Incorrect replay meant for Rahul was deleted.

mikeitexpert
New Contributor I
124 Views

Dear Rahul,

First let me clarify what I mean by aligned access ... We all know when icl auto-vectorizes a loop, it breaks the main loop into three other loops (namely peeled, main, remainder loops). However, peeled loop is not generated if all array pointers in the loop are aligned. Simply because there is no need. To put another way, all memory accesses would be aligned since the start of the loop, and thus no need to peel off some iteration until we can benefit from aligned access. Therefore, I am interested in communicating the later scenario with compiler. 

First lets look at the below example we briefly discussed earlier to further clarify the aforementioned discussion:


#include <iostream>

#define MEM_ALIGN (512/8)
#define USE_ALIGNED_MEM

template <class V>
class matrix{
public:

    // typedef __declspec(align(64)) V VA;
    typedef V VA;
    VA * data;   
    int nrows, ncols;

    matrix(int nr, int nc){
    // data = (VA*) _mm_malloc(nr*nc*sizeof(V), MEM_ALIGN);
    data = (VA*) malloc(nr*nc*sizeof(V));

	nrows = nr;
	ncols = nc;
    }

    inline int numel() const{ return nrows*ncols; }
};

#define printline	std::cout << " LINE: " << __LINE__ << std::endl;
int main(){
    printline
	matrix<float> Afm(10, 20);
	matrix<float> Bfm(10, 20);	

    printline
    #pragma omp parallel for simd 
	for(int i = 0; i < Afm.numel(); i++){
		Afm.data[i] = Afm.data[i] + Bfm.data[i];
	}
    printline
	return 0;
}

 

Where I use the below command line to compile and generate the optimization report for: 

icl -Qopt-report-phase:all -Qopt-report:5 test.cpp && test.exe

 

And then, I get to see the vectorization part of the report as below:

 

Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(35,2)
<Peeled loop for vectorization>
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=3
LOOP END

LOOP BEGIN at C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(35,2)
   remark #15388: vectorization support: reference Afm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,3) ]
   remark #15388: vectorization support: reference Afm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,17) ]
   remark #15388: vectorization support: reference Bfm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,31) ]
   remark #15305: vectorization support: vector length 4
   remark #15399: vectorization support: unroll factor set to 2
   remark #15309: vectorization support: normalized vectorization overhead 1.167
   remark #15301: SIMD LOOP WAS VECTORIZED
   remark #15442: entire loop may be executed in remainder
   remark #15448: unmasked aligned unit stride loads: 2 
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 8 
   remark #15477: vector cost: 1.500 
   remark #15478: estimated potential speedup: 4.370 
   remark #15488: --- end vector cost summary ---
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=25
LOOP END

LOOP BEGIN at C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(35,2)
<Alternate Alignment Vectorized Loop>
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=25
LOOP END

LOOP BEGIN at C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(35,2)
<Remainder loop for vectorization>
   remark #15388: vectorization support: reference Afm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,3) ]
   remark #15388: vectorization support: reference Afm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,17) ]
   remark #15389: vectorization support: reference Bfm.data[i] has unaligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,31) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or /Qvec-threshold0 to override
   remark #15305: vectorization support: vector length 2
   remark #15309: vectorization support: normalized vectorization overhead 1.000
   remark #15442: entire loop may be executed in remainder
   remark #15448: unmasked aligned unit stride loads: 2 
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 8 
   remark #15477: vector cost: 1.500 
   remark #15478: estimated potential speedup: 4.370 
   remark #15488: --- end vector cost summary ---
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
LOOP END

 

As I brought to your attention, peeled loop is generated as reported from line 5 to 8. 

No lets uncomment line no. 17 and comment line no. 18 and use _mm_malloc instead of malloc and lets see how to the report would look like:

 

 

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(35,2)
   remark #15388: vectorization support: reference Afm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,3) ]
   remark #15388: vectorization support: reference Afm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,17) ]
   remark #15388: vectorization support: reference Bfm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,31) ]
   remark #15305: vectorization support: vector length 4
   remark #15399: vectorization support: unroll factor set to 2
   remark #15301: SIMD LOOP WAS VECTORIZED
   remark #15448: unmasked aligned unit stride loads: 2 
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 8 
   remark #15477: vector cost: 1.500 
   remark #15478: estimated potential speedup: 5.330 
   remark #15488: --- end vector cost summary ---
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=25
LOOP END

 

 

As you can see no peeled loop generation is reported .... what is even more interesting to me is that I do not even need to use declaration specification to communicate alignment with ICL, which brings me to the question: how ICL can find out the access is aligned since the beginning of the loop?

One might say, well you use _mm_malloc and you would be on safe side; however, I have a wrapper around both malloc  and _mm_malloc so I guess there must be a similar way (as perhaps used in _mm_malloc declaration) to reflect that in my wrapper function. 

Please let me know your thought on this ... I would appreciate a well-versed code optimization expert to comment on that.

The second dilemma I am facing is that why object attributes can't be used in aligned clauses?? Lets look at the below example so I can clarify myself better: 

#define printline	std::cout << " LINE: " << __LINE__ << std::endl;
int main(){
    printline
	matrix<float> Afm(10, 20);
	matrix<float> Bfm(10, 20);	

    printline
    #pragma omp parallel for simd aligned(Afm.data:MEM_ALIGN,Bfm.data:MEM_ALIGN)
	for(int i = 0; i < Afm.numel(); i++){
		Afm.data[i] = Afm.data[i] + Bfm.data[i];
	}
    printline
	return 0;
}

 

Below you can see the compilation error message: 

C:\tmp>icl -Qopt-report-phase:all -Qopt-report:5 test.cpp && test.exe
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.2.254 Build 20200623
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.

icl: remark #10397: optimization reports are generated in *.optrpt files in the output location
test.cpp
test.cpp(34): error: Afm variable cannot be specified in this clause
      #pragma omp parallel for simd aligned(Afm.data:MEM_ALIGN,Bfm.data:MEM_ALIGN)
                                            ^

compilation aborted for test.cpp (code 2)

 

So I have to define a separate pointer (line 6 & 7 below) for each object attribute like below to communicate  this with compiler:

 

int main(){
    printline
	matrix<float> Afm(10, 20);
	matrix<float> Bfm(10, 20);	

    float *Afm_data = Afm.data;
    float *Bfm_data = Bfm.data;

    printline
    #pragma omp parallel for simd aligned(Afm_data:MEM_ALIGN,Bfm_data:MEM_ALIGN)
	for(int i = 0; i < Afm.numel(); i++){
		Afm_data[i] = Afm_data[i] + Bfm_data[i];
	}
    printline
	return 0;
}

 

Please let me know if there is an easier alternative.

Much appreciate your comments.

Mike

 

 

mikeitexpert
New Contributor I
103 Views

Hi Rahul,

Please let me know if there is any update to this thread.

Regards

RahulV_intel
Moderator
88 Views

Hi,


Apologies for the late response.


>> Oh wonderful, Could you attach the full sample code and the report, please?

>>how ICL can find out the access is aligned since the beginning of the loop?


The code sample that I had used is more or less similar to your first scenario (but using __declspec(align(64))). However, even without the alignment declaration, I could still see the aligned access remark in the report, as you rightly pointed out. Despite the remark, I could see peeled loops in the report. I need to discuss this internally. We will get back to you on this.



>> I am facing is that why object attributes can't be used in aligned clauses?


I will check on this internally and let you know. Meanwhile, can you try using the #pragma vector aligned directive and see if it helps?


Please refer to this documentation for more information:

https://software.intel.com/content/www/us/en/develop/articles/data-alignment-to-assist-vectorization...



Thanks,

Rahul


81 Views

Hi Mike,

Thank you for raising this concern. I will test the code snippet provided and get back to you soon.


Regards,

Subarna


Reply