Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7944 Discussions

Vectorization problem: dereference too complex

fullung
Beginner
673 Views
Hello all,

I'm trying to compile the following function:

static void DOUBLE_minus(char** args, const int* dimensions, const int* steps) {
register int i;
int is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0];
char *i1=args[0], *i2=args[1], *op=args[2];
for(i=0; i *((double *)op)=*((double *)i1) - *((double *)i2);
}
}

This function essentially steps over two arrays A and B, performs some element-wise operation, and stores the result in a third array C. This kind of function is at the core of the NumPy library's broadcasting functionality. MATLAB does something similar, called bsxfun. As a simple example, this kind of function allows you to subtract a vector from a matrix.

The function as it is written there seems to suffer from flow and anti-dependence problems. It can be simplified a bit by requirnig that the steps parameters refer to "items" and not bytes, allowing one to get away from arithmetic on char pointers.

I rewrote it as follows:

#ifdef __RESTRICT
#define RESTRICT restrict
#else
#define RESTRICT
#endif
static void DOUBLE_minus(char** args, const int* dimensions, const int* steps) {
register int i;
int is1=steps[0];
int is2=steps[1];
int os=steps[2];
int n=dimensions[0];
const double* RESTRICT i1=(const double*)args[0];
const double* RESTRICT i2=(const double*)args[1];
double* RESTRICT op=(double*)args[2];
for(i=0; i op[i*os] = i1[i*is1] - i2[i*is2];
}
}

The problem is that on 64-bit Linux with version 10.1.018 of the Intel compiler, I still get this:

remark: loop was not vectorized: dereference too complex.

Is there any way to change this loop so that it can successfully be vectorized?

Cheers,

Albert

P.S. I haven't checked again, but I'm pretty sure that version 9.1 of the Intel compiler successfully vectorized this loop when I tried it about a year ago.
0 Kudos
8 Replies
TimP
Honored Contributor III
673 Views
You leave several questions open:
1. When your example doesn't work as presented, is this really a sample of what you want?
2. How can you expect effective vectorization when each array has a different stride?
3. Do you expect restrict to have an effect when not used in a function prototype? The compiler can see that your local pointers are derived from unrestricted ones. Is your complaint that icc doesn't perform stricter syntax checking, such as gcc does, and reject the example you give here?
4. What compiler options do you use? If you use -ansi-alias, do you expect it to work when your code obviously is not "legal" C?
5. Why not write you code so as to remove the integer multiply from the loop?
0 Kudos
fullung
Beginner
673 Views
Hello,

1. This really is a sample of what I want.

2. I was mostly hoping for automatic parallelization. Maybe explicitly using OpenMP is the way to go?

3. I don't know much about restrict. It seems I have some more reading to do.

4. I used the following options: -fast -parallel -vec-report2. I was hoping for a loop that would execute in parallel. I should have mentioned that when this code is compiled for NumPy, it uses GCC's -fno-strict-aliasing option. This option is required to compile any Python C extension. So I would imagine that compiling with -ansi-alias is going to break things.

5. Not quite sure what you mean here. The reason for the integer multiply was my attempt to make the loop iterations independent. However, this still yields a dereferencing operation that cannot be vectorized for some reason.

Thanks!

Cheers,

Albert


0 Kudos
TimP
Honored Contributor III
673 Views
2. You should be able to promote automatic parallelization by lowering par_threshold. As the code is written, the compiler assumes the loop is not long enough to be worth auto-parallelization. OpenMP would work, if the loop is big enough, or the parallelization can be done in an outer loop. However, you asked originally about vectorization; you are correct in implying that vectorization should be explored before considering parallelization.
3. I see that gcc has no trouble with the code, when I remember the correct options (-std=c99 enables restrict for both gcc and icc). A colleague confirmed that restrict willhave no effectin context such as you show.
4. If the source code is intentionally written in violation of standard C, so that -ansi_alias is not permitted, it may be difficult to optimize. Declaring local pointers to simplify the code, as you have begun to do, may help, but you should avoid introducing additional violations of C standard. However, there really isn't much possibility of further optimization, unless you make a separate case to optimize the case where strides are 1.
5. If the code were standard C, without impediments to optimization, it would be reasonable to expect a compiler to perform strength reduction, taking advantage of the fact that each pointer increments by a known amount on each loop iteration, and no multiplication is required inside the loop. When you have such impediments, it is reasonable to write the optimization in yourself, to see if it will help. For example,
for(i=0; ishould come out as
for(i=0, i_inc=0; i
As it stands, icc is more reluctant than other compilers to perform strength reduction, and hardware trends are such as to make it less frequently necessary.
0 Kudos
John_O_Intel
Employee
673 Views

I'd suggest writing your loop in a more transparent way that will be easier for the compiler to analyze. I didn't completely follow what you're doing, butthe following example will vectorize:

void DOUBLE_minus(double* a, double *b, double *c, const int N) {
int i;
#pragma parallel always
for(i=0; i c=b-a;
}
}

Adding #pragma parallel always tells the compiler you think it's worthwhile to to auto-parallelize this loop.You should always determine if this will improve performance or not, it's possible (probable ?) the compiler is correct and their is not enough work to make the overhead of threading the loop be worthwhile.

#pragma parallel always
for(i=0; i

Hope this helps,
john

0 Kudos
fullung
Beginner
673 Views
Hello,

Your example is unfortunately too simple.

The problem is that I see no way of simplifying the more general loop I posted in my original message so that the compiler can successfully parallelize it.

Regards,

Albert
0 Kudos
John_O_Intel
Employee
673 Views

Sorry that example wasn't of use directly, but fundamentally I think you want to subtract 2 arrays & store in the 3rd array, so if you could re-organize your data structures it would enable vectorization & possibly improve cache usage.

As Tim pointed out above, having different strides across the different arrays causes many problems, could you explain how your loop logic compares to my simple example, i.e. how does the op, i1, & i2 pointers increment for each iteration of the loop ? We recommend using array notation instead of pointer notation to enable vectorization (shown in my example), as the compiler is better at analyzing the array notation. You want to write code as transparent as possible to get it to vectorize -it would be better to avoid declaring i1,i2&op as char*, thencasting to double*.

Can you send the commands to compile & the output including the compiler build #, as I don't see the vectorization diagnostics you mention.

best regards,

_|ohnO

0 Kudos
jimdempseyatthecove
Honored Contributor III
673 Views

Albert,

If your DOUBLE_minus is heavily used perhaps something like the following will be worth the effort in extra code

if(os==1)
{
if(is1==1)
{
if(is2==1)
{
for(i=0; i op = i1 - i2;
}
}
else
{
for(i=0; i op = i1 - i2[i*is2];
}
}
else
{
if(is2==1)
{
for(i=0; i op = i1[i*is1] - i2;
}
}
else
{
for(i=0; i op = i1[i*is1] - i2[i*is2];
}
}
}
else
{
if(is1==1)
{
if(is2==1)
{
for(i=0; i op[i*os] = i1 - i2;
}
}
else
{
for(i=0; i op = i1 - i2[i*is2];
}
}
else
{
if(is2==1)
{
for(i=0; i op[i*os] = i1[i*is1] - i2;
}
}
else
{
for(i=0; i op[i*os] = i1[i*is1] - i2[i*is2];
}
}
}

Jim Dempsey

0 Kudos
fullung
Beginner
673 Views
Good idea, thanks!
0 Kudos
Reply