Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Vectorization - memory alignment

Pascal1
Beginner
1,039 Views

Hello

I am programming on an i7-2600K which should be supporting avx and so allow me to align with 32 bits.

I have a class like this:

#ifdef __INTEL_COMPILER
    class __declspec(align(32)) Color
#else
    class Color
#endif
{
  public:
    float r;
    float g;
    float b;
    float a;

    Color(){}
    Color(float _r, float _g, float _b, float _a) : r(_r), g(_g), b(_b), a(_a){}
  ~Color(){}
};

and then I have:

#ifdef __INTEL_COMPILER
    __assume_aligned(newImg.data,32);
    __assume_aligned(img.data,32);
#endif
    #pragma omp parallel num_threads(16)
    {
        #pragma omp for
        #pragma ivdep
        for (int y=newImg.getExtents(2); y<newImg.getExtents(3); y++)
            for (int x=newImg.getExtents(0); x<newImg.getExtents(1); x++)
            {
                int indexDst = (y - newImg.getExtents(2))*widthDst + (x - newImg.getExtents(0));
                int indexSrc = (y -    img.getExtents(2))*widthSrc + (x -    img.getExtents(0));
                
                newImg.data[indexDst].r = img.data[indexSrc].r;
                newImg.data[indexDst].g = img.data[indexSrc].g;
                newImg.data[indexDst].b = img.data[indexSrc].b;
                newImg.data[indexDst].a = img.data[indexSrc].a;
            }
    }

where newImg.data has been declared as Color *data;

If I do not use any align, the code is faster than me using align! Is there anything that I am missing? Am I not declaring the class to be aligned properly or 

__assume_aligned

 

is not being used properly?

 

I am compiling with: -xAVX -ipo -O3 -openmp -std=c++11

Am i missing something here?

thanks

 

0 Kudos
8 Replies
TimP
Honored Contributor III
1,039 Views

You may wish to check the compiler's diagnostics on alignment assumption (and the effect of assume_aligned), by adding -qopt-report4 (for 15.0 compiler, spelling has changed with each major version).

0 Kudos
KitturGanesh
Employee
1,039 Views

I agree with Tim the diagnostics should give some input on what's going on with access (like unaligned access, unaligned unit stride and so on). BTW, the assume_aligned directive should be used as a hint to the compiler when you think that it's not very clear for the compiler to figure out (not the pointer variable itself but the data pointed to by the pointer) so  it can apply alignment accordingly.  That said, if the data seemed unaligned then there could be an overhead. If not, it could be a bug which needs to be looked at.  If you can attach a small reproducer can take a look at it, thanks.

_Kittur

0 Kudos
Pascal1
Beginner
1,039 Views

Thanks using "-qopt-report4 " proved to be quite useful. I also wrote a test code

#include "work.h"
void doWork(int x)
{
	Timer clock;
	const int ALIGN = 32;

	float *a, *b, *c;

	a = (float *)_mm_malloc( x * sizeof(float), ALIGN);
	b = (float *)_mm_malloc( x * sizeof(float), ALIGN);
	c = (float *)_mm_malloc( x * sizeof(float), ALIGN);

	clock.start();

	#pragma omp parallel
	#pragma omp for simd

	//#pragma simd
	for (int i=0; i<x; i++){
		c = a + b;
	}

	clock.stop();

	std::cout << "Duration: " << clock.getDuration() << std::endl;
	std::cout << "[100]: " << a[100] << ", " << b[100] << ", " << c[100] << std::endl;
}

My compilation line in my makefile is:

CC = icc -O3 -axAVX -openmp -qopt_report4 -std=c++11

I am running with x = 400000000 and the time it takes is in seconds is in Duration in bold

and these are the results I get - with #pragma simd enabled and both #pragma omp parallel & #pragma omp for simd disabled:

#pragma simd
    for (int i=0; i<x; i++){
        c = a + b;
    }

LOOP BEGIN at work.cpp(46,2)
   remark #15388: vectorization support: reference c has aligned access   [ work.cpp(47,3) ]
   remark #15388: vectorization support: reference a has aligned access   [ work.cpp(47,3) ]
   remark #15388: vectorization support: reference b has aligned access   [ work.cpp(47,3) ]
   remark #15399: vectorization support: unroll factor set to 2
   remark #15301: SIMD LOOP WAS VECTORIZED
   remark #15448: unmasked aligned unit stride loads: 2 
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 9 
   remark #15477: vector loop cost: 1.250 
   remark #15478: estimated potential speedup: 9.370 
   remark #15479: lightweight vector operations: 5 
   remark #15488: --- end vector loop cost summary ---
LOOP END

LOOP BEGIN at work.cpp(46,2)
<Remainder>
   remark #15389: vectorization support: reference c has unaligned access   [ work.cpp(47,3) ]
   remark #15389: vectorization support: reference a has unaligned access   [ work.cpp(47,3) ]
   remark #15389: vectorization support: reference b has unaligned access   [ work.cpp(47,3) ]
   remark #15381: vectorization support: unaligned access used inside loop body   [ work.cpp(47,3) ]
   remark #15301: REMAINDER LOOP WAS VECTORIZED
LOOP END


Duration: ~ 0.72

I do not understand how array a,b and c are initially found to be aligned but then suddenly unaligned!

But now if I compile using: CC = icc -openmp -qopt_report4 -std=c++11

and disable all my the #pragma in my code, this is what I get:

    //#pragma omp parallel
    //#pragma omp for simd

    //#pragma simd
    for (int i=0; i<x; i++){
        c = a + b;
    }

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]

LOOP BEGIN at work.cpp(45,2)
   remark #15388: vectorization support: reference c has aligned access   [ work.cpp(46,3) ]
   remark #15388: vectorization support: reference a has aligned access   [ work.cpp(46,3) ]
   remark #15388: vectorization support: reference b has aligned access   [ work.cpp(46,3) ]
   remark #15399: vectorization support: unroll factor set to 2
   remark #15300: LOOP WAS VECTORIZED
   remark #15448: unmasked aligned unit stride loads: 2 
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 11 
   remark #15477: vector loop cost: 2.500 
   remark #15478: estimated potential speedup: 7.970 
   remark #15479: lightweight vector operations: 5 
   remark #15488: --- end vector loop cost summary ---
LOOP END

LOOP BEGIN at work.cpp(45,2)
<Remainder>
   remark #15388: vectorization support: reference c has aligned access   [ work.cpp(46,3) ]
   remark #15388: vectorization support: reference a has aligned access   [ work.cpp(46,3) ]
   remark #15388: vectorization support: reference b has aligned access   [ work.cpp(46,3) ]
   remark #15301: REMAINDER LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at work.cpp(45,2)
<Remainder>
LOOP END

Duration: ~ 0.65

The code is marginally faster and all memory is aligned!!! I really do not understand that!!!

 

If I disable #pragma simd but enable both #pragma omp parallel & #pragma omp for simd

    #pragma omp parallel
    #pragma omp for simd

    //#pragma simd
    for (int i=0; i<x; i++){
        c = a + b;
    }


OpenMP Construct at work.cpp(42,2)
   remark #16201: OpenMP DEFINED REGION WAS PARALLELIZED

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at work.cpp(46,2)
<Peeled>
LOOP END

LOOP BEGIN at work.cpp(46,2)
   remark #15389: vectorization support: reference c has unaligned access   [ work.cpp(47,3) ]
   remark #15389: vectorization support: reference a has unaligned access   [ work.cpp(47,3) ]
   remark #15389: vectorization support: reference b has unaligned access   [ work.cpp(47,3) ]
   remark #15381: vectorization support: unaligned access used inside loop body   [ work.cpp(47,3) ]
   remark #15399: vectorization support: unroll factor set to 2
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
   remark #15442: entire loop may be executed in remainder
   remark #15448: unmasked aligned unit stride loads: 1 
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15450: unmasked unaligned unit stride loads: 1 
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 9 
   remark #15477: vector loop cost: 3.000 
   remark #15478: estimated potential speedup: 5.430 
   remark #15479: lightweight vector operations: 5 
   remark #15488: --- end vector loop cost summary ---
LOOP END

LOOP BEGIN at work.cpp(46,2)
   remark #25460: No loop optimizations reported
LOOP END

LOOP BEGIN at work.cpp(46,2)
<Remainder>
LOOP END

 

Duration: 0.152983

All the cores are being used and the code is faster but this time all the memory is completely unaligned!

But if the code is like that:

	//#pragma omp parallel
	//#pragma omp for

	//#pragma simd

	#pragma omp parallel num_threads(16)
	{

		#pragma omp for
		for (int i=0; i<x; i++){


			c = a + b;
		}
	}

This is the result:

OpenMP Construct at work.cpp(46,2)
   remark #16201: OpenMP DEFINED REGION WAS PARALLELIZED

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at work.cpp(50,3)
<Peeled>
LOOP END

LOOP BEGIN at work.cpp(50,3)
   remark #15388: vectorization support: reference c has aligned access   [ work.cpp(53,4) ]
   remark #15389: vectorization support: reference a has unaligned access   [ work.cpp(53,4) ]
   remark #15388: vectorization support: reference b has aligned access   [ work.cpp(53,4) ]
   remark #15381: vectorization support: unaligned access used inside loop body   [ work.cpp(53,4) ]
   remark #15399: vectorization support: unroll factor set to 2
   remark #15300: LOOP WAS VECTORIZED
   remark #15442: entire loop may be executed in remainder
   remark #15448: unmasked aligned unit stride loads: 1 
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15450: unmasked unaligned unit stride loads: 1 
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 11 
   remark #15477: vector loop cost: 3.000 
   remark #15478: estimated potential speedup: 6.530 
   remark #15479: lightweight vector operations: 5 
   remark #15488: --- end vector loop cost summary ---
LOOP END

LOOP BEGIN at work.cpp(50,3)
   remark #25460: No loop optimizations reported
LOOP END

LOOP BEGIN at work.cpp(50,3)
<Remainder>
   remark #15388: vectorization support: reference c has aligned access   [ work.cpp(53,4) ]
   remark #15389: vectorization support: reference a has unaligned access   [ work.cpp(53,4) ]
   remark #15389: vectorization support: reference b has unaligned access   [ work.cpp(53,4) ]
   remark #15381: vectorization support: unaligned access used inside loop body   [ work.cpp(53,4) ]
LOOP END

Duration: ~0.15

The code is faster and there is some alignment and I have removed all optimizations; my CC line is CC = icc -openmp -qopt_report4 -std=c++11

I really do not understand what is happening.

My main.cpp

#include "work.h"

int main(int argc, char* argv[])
{   
    //
    // Read in arguments
    long x = 1000000;    // default is 16 but it can change
    if (argc > 1)
	   x = atoi(argv[1]);
	
    doWork(x);

    return 0;
}

work.h

#include "../utilities.h"
#include "../timer.h"

#include <stdlib.h>
#include <unistd.h>
#include <malloc.h>
#include <iostream>
#include <omp.h>

void doWork(int x);

 

timer.h

#include <chrono>

class Timer{
  	std::chrono::time_point<std::chrono::system_clock> startTime, endTime;
    std::chrono::duration<double> elapsed_seconds;

  public:

	Timer();
	~Timer();

	void start(){ startTime = std::chrono::system_clock::now(); }
	void stop(){ endTime = std::chrono::system_clock::now(); elapsed_seconds = endTime - startTime; }
	double getDuration(){ return elapsed_seconds.count(); }	// time in seconds
};

inline Timer::Timer(){}
inline Timer::~Timer(){}

My makefile:

SOURCES = main.cpp work.cpp
OBJECTS = $(SOURCES:.cpp=.o)
EXE = vecTest

CC = icc -openmp -qopt_report4 -std=c++11

CFLAGS = -c
FLAGS =

all: $(SOURCES) $(EXE)

$(EXE): $(OBJECTS)
	$(CC) $(OBJECTS) $(FLAGS) -o $@

.cpp.o:
	$(CC) $(CFLAGS) $< -o $@

 

0 Kudos
TimP
Honored Contributor III
1,039 Views

It's a little strange that option -axAVX appears to result in ignoring the alignments you specified.  The diagnostics indicate that the compiler thinks a remainder loop may be needed to adjust for alignment, but it recognizes that all the operands are aligned consistently.  I don't like to use that option if it's not necessary. Did you try -xAVX?  We've seen cases where -xAVX didn't show any advantage, but -xCORE-AVX2 did.

When you build for OpenMP parallel, the compiler has to provide for unaligned access, unless you add alignment pragmas and take responsibility for making the array size a multiple of vector width (32 bytes) times number of threads, so that individual chunks come out aligned.

0 Kudos
KitturGanesh
Employee
1,039 Views

Tim makes a very valid point that the -axAVX option seems to be ignoring the specified alignment and needs to be looked at. I'll try to reproduce and file the issue with our developers accordingly.  Additionally, you'll need to take care of alignment on the array when you're building with OpenMP parallel as well which Tim has already noted in his remarks (thanks Tim).

_Kittur

0 Kudos
Pascal1
Beginner
1,039 Views

So I changed the pragma

	//#pragma omp parallel
	//#pragma omp for simd 

	#pragma omp parallel for simd aligned(a:ALIGN,b:ALIGN,c:ALIGN)

and now it seems that the memory is reported as being aligned

OpenMP Construct at work.cpp(267,2)
   remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at work.cpp(268,2)
<Peeled>
LOOP END

LOOP BEGIN at work.cpp(268,2)
   remark #15388: vectorization support: reference c has aligned access   [ work.cpp(269,3) ]
   remark #15388: vectorization support: reference a has aligned access   [ work.cpp(269,3) ]
   remark #15388: vectorization support: reference b has aligned access   [ work.cpp(269,3) ]
   remark #15399: vectorization support: unroll factor set to 2
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
   remark #15448: unmasked aligned unit stride loads: 2 
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 9 
   remark #15477: vector loop cost: 1.250 
   remark #15478: estimated potential speedup: 9.060 
   remark #15479: lightweight vector operations: 5 
   remark #15488: --- end vector loop cost summary ---
LOOP END

LOOP BEGIN at work.cpp(268,2)
<Remainder>
   remark #15389: vectorization support: reference c has unaligned access   [ work.cpp(269,3) ]
   remark #15389: vectorization support: reference a has unaligned access   [ work.cpp(269,3) ]
   remark #15389: vectorization support: reference b has unaligned access   [ work.cpp(269,3) ]
   remark #15381: vectorization support: unaligned access used inside loop body   [ work.cpp(269,3) ]
LOOP END

    Report from: Code generation optimizations [cg]

remark #34000: call to memcpy implemented inline with loads and stores with proven source (alignment, offset): (8, 0), and destination (alignment, offset): (8, 0)
remark #34000: call to memcpy implemented inline with loads and stores with proven source (alignment, offset): (8, 0), and destination (alignment, offset): (8, 0)

 

Duration: ~ 0.16

and this is compiling with: CC = icc -O3 -xAVX -openmp -qopt_report4 -std=c++11

With CC = icc -openmp -qopt_report4 -std=c++11

OpenMP Construct at work.cpp(267,2)
   remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at work.cpp(268,2)
<Peeled>
LOOP END

LOOP BEGIN at work.cpp(268,2)
   remark #15388: vectorization support: reference c has aligned access   [ work.cpp(269,3) ]
   remark #15388: vectorization support: reference a has aligned access   [ work.cpp(269,3) ]
   remark #15388: vectorization support: reference b has aligned access   [ work.cpp(269,3) ]
   remark #15399: vectorization support: unroll factor set to 2
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
   remark #15448: unmasked aligned unit stride loads: 2 
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 11 
   remark #15477: vector loop cost: 2.500 
   remark #15478: estimated potential speedup: 7.780 
   remark #15479: lightweight vector operations: 5 
   remark #15488: --- end vector loop cost summary ---
LOOP END

LOOP BEGIN at work.cpp(268,2)
<Remainder>
   remark #15388: vectorization support: reference c has aligned access   [ work.cpp(269,3) ]
   remark #15388: vectorization support: reference a has aligned access   [ work.cpp(269,3) ]
   remark #15388: vectorization support: reference b has aligned access   [ work.cpp(269,3) ]
LOOP END

    Report from: Code generation optimizations [cg]

remark #34000: call to memcpy implemented inline with loads and stores with proven source (alignment, offset): (8, 0), and destination (alignment, offset): (8, 0)
remark #34000: call to memcpy implemented inline with loads and stores with proven source (alignment, offset): (8, 0), and destination (alignment, offset): (8, 0)

I have not commented the pragma. Interestingly, even if I remove the -O3 I get aligned access when I run it and same time. I have not removed the pragma so maybe icc tries to optimize things on it's own. This is the icc I have: 
% icc -v
icc version 15.0.0 (gcc version 4.8.0 compatibility)

 

As for -xCORE-AVX2, this is what I get when I run it:
Please verify that both the operating system and the processor support Intel(R) MOVBE, F16C, FMA, BMI, LZCNT and AVX2 instructions.

That's because I have the i7-2600K which has avx but not AVX2

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid

 

 

0 Kudos
KitturGanesh
Employee
1,039 Views

Thanks for the info and good to note that you get aligned access.  Of course, even if you remove -O3, the default optimization for ICC is still -O2 and hence the compiler does look for optimizations (vectorization is enabled at O2) accordingly. 

_Kittur 

 

0 Kudos
KitturGanesh
Employee
1,039 Views

Pascal, I've filed the issue (memory alignment with -axAVX) with the developers and will update this post as soon as I've any info accordingly, thanks.

_Kittur

0 Kudos
Reply