How to define a memory aligned template pointer as a C++ class member? (intel/gcc)

mikeitexpert · ‎04-20-2021

I have a matrix class that has a pointer class member (data pointer to the class template variable) and I would like to declare it as a pointer to aligned memory just as a hint to intel compiler that it can use aligned access for auto-vectorized code. The short version of the code is as below:

template <class V>
class matrix{
public:

    typedef __declspec(align(64)) V VA;
    VA * data;
    

    int nrows, ncols;
}

When I check optimization report I see that access to data is not aligned unless I use __assume_aligned in the code block where I use matrix.data.

Please let me know if there is an alternative to let the intel/gcc compiler know that data field is an aligned pointer so that I don't have to use __assume_aligned everywhere there is an access to the pointer.

Regards

RahulV_intel · ‎04-22-2021

Hi,

I've used your code snippet to create a pointer to matrix.data, without making the pointer "__assume_aligned()". Here is the remark from the vectorization report of the created pointer:

remark #15388: vectorization support: reference mat.data[i] has aligned access

To generate a vectorization report, make use of the below compilation flags:

-qopt-report=5 -qopt-report-phase=vec

If possible, please attach a small reproducible code sample(compilable) and vectorization reports for both the cases (with and without __assume_aligned()).

Thanks,

Rahul

mikeitexpert · ‎04-22-2021

Oh wonderful, Could you attach the full sample code and the report please?

I prepare a small snippet where I see unaligned access ...

Regards

Mike

mikeitexpert · ‎04-22-2021

Incorrect replay meant for Rahul was deleted.

mikeitexpert · ‎04-22-2021

Dear Rahul,

First let me clarify what I mean by aligned access ... We all know when icl auto-vectorizes a loop, it breaks the main loop into three other loops (namely peeled, main, remainder loops). However, peeled loop is not generated if all array pointers in the loop are aligned. Simply because there is no need. To put another way, all memory accesses would be aligned since the start of the loop, and thus no need to peel off some iteration until we can benefit from aligned access. Therefore, I am interested in communicating the later scenario with compiler.

First lets look at the below example we briefly discussed earlier to further clarify the aforementioned discussion:


#include <iostream>

#define MEM_ALIGN (512/8)
#define USE_ALIGNED_MEM

template <class V>
class matrix{
public:

    // typedef __declspec(align(64)) V VA;
    typedef V VA;
    VA * data;   
    int nrows, ncols;

    matrix(int nr, int nc){
    // data = (VA*) _mm_malloc(nr*nc*sizeof(V), MEM_ALIGN);
    data = (VA*) malloc(nr*nc*sizeof(V));

	nrows = nr;
	ncols = nc;
    }

    inline int numel() const{ return nrows*ncols; }
};

#define printline	std::cout << " LINE: " << __LINE__ << std::endl;
int main(){
    printline
	matrix<float> Afm(10, 20);
	matrix<float> Bfm(10, 20);	

    printline
    #pragma omp parallel for simd 
	for(int i = 0; i < Afm.numel(); i++){
		Afm.data[i] = Afm.data[i] + Bfm.data[i];
	}
    printline
	return 0;
}

Where I use the below command line to compile and generate the optimization report for:

icl -Qopt-report-phase:all -Qopt-report:5 test.cpp && test.exe

And then, I get to see the vectorization part of the report as below:

Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(35,2)
<Peeled loop for vectorization>
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=3
LOOP END

LOOP BEGIN at C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(35,2)
   remark #15388: vectorization support: reference Afm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,3) ]
   remark #15388: vectorization support: reference Afm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,17) ]
   remark #15388: vectorization support: reference Bfm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,31) ]
   remark #15305: vectorization support: vector length 4
   remark #15399: vectorization support: unroll factor set to 2
   remark #15309: vectorization support: normalized vectorization overhead 1.167
   remark #15301: SIMD LOOP WAS VECTORIZED
   remark #15442: entire loop may be executed in remainder
   remark #15448: unmasked aligned unit stride loads: 2 
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 8 
   remark #15477: vector cost: 1.500 
   remark #15478: estimated potential speedup: 4.370 
   remark #15488: --- end vector cost summary ---
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=25
LOOP END

LOOP BEGIN at C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(35,2)
<Alternate Alignment Vectorized Loop>
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=25
LOOP END

LOOP BEGIN at C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(35,2)
<Remainder loop for vectorization>
   remark #15388: vectorization support: reference Afm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,3) ]
   remark #15388: vectorization support: reference Afm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,17) ]
   remark #15389: vectorization support: reference Bfm.data[i] has unaligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,31) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or /Qvec-threshold0 to override
   remark #15305: vectorization support: vector length 2
   remark #15309: vectorization support: normalized vectorization overhead 1.000
   remark #15442: entire loop may be executed in remainder
   remark #15448: unmasked aligned unit stride loads: 2 
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 8 
   remark #15477: vector cost: 1.500 
   remark #15478: estimated potential speedup: 4.370 
   remark #15488: --- end vector cost summary ---
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
LOOP END

As I brought to your attention, peeled loop is generated as reported from line 5 to 8.

No lets uncomment line no. 17 and comment line no. 18 and use _mm_malloc instead of malloc and lets see how to the report would look like:

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(35,2)
   remark #15388: vectorization support: reference Afm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,3) ]
   remark #15388: vectorization support: reference Afm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,17) ]
   remark #15388: vectorization support: reference Bfm.data[i] has aligned access   [ C:\Users\Mehdi-laptop\sico\ark_cpplab\tmp\test.cpp(36,31) ]
   remark #15305: vectorization support: vector length 4
   remark #15399: vectorization support: unroll factor set to 2
   remark #15301: SIMD LOOP WAS VECTORIZED
   remark #15448: unmasked aligned unit stride loads: 2 
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 8 
   remark #15477: vector cost: 1.500 
   remark #15478: estimated potential speedup: 5.330 
   remark #15488: --- end vector cost summary ---
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
   remark #25015: Estimate of max trip count of loop=25
LOOP END

As you can see no peeled loop generation is reported .... what is even more interesting to me is that I do not even need to use declaration specification to communicate alignment with ICL, which brings me to the question: how ICL can find out the access is aligned since the beginning of the loop?

One might say, well you use _mm_malloc and you would be on safe side; however, I have a wrapper around both malloc and _mm_malloc so I guess there must be a similar way (as perhaps used in _mm_malloc declaration) to reflect that in my wrapper function.

Please let me know your thought on this ... I would appreciate a well-versed code optimization expert to comment on that.

The second dilemma I am facing is that why object attributes can't be used in aligned clauses?? Lets look at the below example so I can clarify myself better:

#define printline	std::cout << " LINE: " << __LINE__ << std::endl;
int main(){
    printline
	matrix<float> Afm(10, 20);
	matrix<float> Bfm(10, 20);	

    printline
    #pragma omp parallel for simd aligned(Afm.data:MEM_ALIGN,Bfm.data:MEM_ALIGN)
	for(int i = 0; i < Afm.numel(); i++){
		Afm.data[i] = Afm.data[i] + Bfm.data[i];
	}
    printline
	return 0;
}

Below you can see the compilation error message:

C:\tmp>icl -Qopt-report-phase:all -Qopt-report:5 test.cpp && test.exe
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.2.254 Build 20200623
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.

icl: remark #10397: optimization reports are generated in *.optrpt files in the output location
test.cpp
test.cpp(34): error: Afm variable cannot be specified in this clause
      #pragma omp parallel for simd aligned(Afm.data:MEM_ALIGN,Bfm.data:MEM_ALIGN)
                                            ^

compilation aborted for test.cpp (code 2)

So I have to define a separate pointer (line 6 & 7 below) for each object attribute like below to communicate this with compiler:

int main(){
    printline
	matrix<float> Afm(10, 20);
	matrix<float> Bfm(10, 20);	

    float *Afm_data = Afm.data;
    float *Bfm_data = Bfm.data;

    printline
    #pragma omp parallel for simd aligned(Afm_data:MEM_ALIGN,Bfm_data:MEM_ALIGN)
	for(int i = 0; i < Afm.numel(); i++){
		Afm_data[i] = Afm_data[i] + Bfm_data[i];
	}
    printline
	return 0;
}

Please let me know if there is an easier alternative.

Much appreciate your comments.

Mike

mikeitexpert · ‎04-27-2021

Hi Rahul,

Please let me know if there is any update to this thread.

Regards

RahulV_intel · ‎04-28-2021

Hi,

Apologies for the late response.

>> Oh wonderful, Could you attach the full sample code and the report, please?

>>how ICL can find out the access is aligned since the beginning of the loop?

The code sample that I had used is more or less similar to your first scenario (but using __declspec(align(64))). However, even without the alignment declaration, I could still see the aligned access remark in the report, as you rightly pointed out. Despite the remark, I could see peeled loops in the report. I need to discuss this internally. We will get back to you on this.

>> I am facing is that why object attributes can't be used in aligned clauses?

I will check on this internally and let you know. Meanwhile, can you try using the #pragma vector aligned directive and see if it helps?

Please refer to this documentation for more information:

https://software.intel.com/content/www/us/en/develop/articles/data-alignment-to-assist-vectorization.html

Thanks,

Rahul

Subarnarek_G_Intel · ‎04-28-2021

Hi Mike,

Thank you for raising this concern. I will test the code snippet provided and get back to you soon.

Regards,

Subarna

Subarnarek_G_Intel · ‎05-11-2021

Hi Mike,

Did the pragma help you in any way?

Regards,

Subarna

Subarnarek_G_Intel · ‎05-13-2021

Hi Mike,

I hope I understand what you are trying to explain. Before explaining to you the answers of your questions let me make sure I clear certain concepts.

_mm_malloc assures gives a hint to the compiler to align the memory access. Alignment assures there is no peeled loop present that is what is happening in the second example.

Using only malloc makes the chances of a loop getting aligned lower. In this case, the loop is not aligned resulting in peeled loops.

You have three questions here :

What is even more interesting to me is that I do not even need to use declaration specification to communicate alignment with ICL, which brings me to the question: how ICL can find out the access is aligned since the beginning of the loop?

It is not necessary to always give hints to the compiler. Intel compiler is smart enough to do certain vectorizations on it's own. But whenever it sees slight chances of performance degradation it stops there. Moreover there the _mm_malloc acted as a hint to the compiler which suggests that user is looking to align his loop.

I have a wrapper around both malloc and _mm_malloc so I guess there must be a similar way (as perhaps used in _mm_malloc declaration) to reflect that in my wrapper function.

I didn't understand what you meant by wrapper class here. I don't see any user defined wrapper class here. Can you explain what you meant by this?

So I have to define a separate pointer (line 6 & 7 below) for each object attribute like below to communicate this with compiler. Please let me know if there is an easier alternative.

If you see https://www.openmp.org/spec-html/5.1/openmpsu49.html in the openmp you will see for C++ it is clearly mentioned that array, pointer, reference to array, or reference to pointer are the only supported type for aligned pragma.The same reason why you get the error when you try to pass an object. I thought you have a pretty easy alternate creating pointers.

Regards,

Subarna

mikeitexpert · ‎05-15-2021

Hello Subarna,

I use the wrapper because we are using an internal heap for our application. The memory for the heap is allocated at the beginning as the program starts (of course using _mm_malloc) and later I use my internal heap API / ("wrapper functions") to allocate memory from the internal heap.

I was wondering how I can mimic/communicate the alignment of memory allocated by the API/wrapper. I am hoping there is a pragma or sth (using which I can let compiler know that the api would have the same effect as _mm_malloc) so that I don't have to do it for every loop using the alignment clause.

I guess I am just looking for some better ways to let compiler know that the returned void* from my api/wrapper is returning aligned memory addresses.

The other alternatives I am not sure of is that : if switching to oneAPI c++ compiler such as icx or dpcpp would be more helpful compared to icl??? I highly appreciate your comments if any.

Regards

Subarnarek_G_Intel · ‎05-26-2021

Hi Mike,

Is it possible to share the wrapper class with me so that I get a better understanding?

Regards,

Subarna

mikeitexpert · ‎05-29-2021

Yes sure ...

void *malloc_wrapper(size_t sz, size_t alignment){
    return _mm_malloc(sz, alignment);
}

void free_wrapper (void * mem_addr){
    _mm_free(ptr);
}

Subarnarek_G_Intel · ‎06-07-2021

Hi Mike,

Please refer to this open sourced version of the implementation of _mm_alloc. https://github.com/adamcroissant/mm_malloc/blob/master/mm.c. I searched it in google. It is pretty easy to find. Let me know if you are looking for something else.

Regards,

Subarna

Subarnarek_G_Intel · ‎08-18-2021

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.