Solved: Hi, Jan

Jan_D_ · ‎03-18-2017

Hi everyone,

I recently ran into an issue I don't understand. I wrote an iterative flow solver that carries out some calculations in each, then calculates a residual and starts over if the residual is still too large. I moved the calculation part into a function that is called in each iteration. It looks like this;

for (int j = 1; j < (ny - 1); j++)
	{
		for (int i = boundLeft; i < (bruttoLength-boundRight); i++)
		{
			i0 = i + j * bruttoLength;

			utilde[i0] = 0;

			// differences in x-direction

			if (i == 1) {
				utilde[i0] = utilde[i0] - (-v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * u[i0 + 3];

				utilde[i0] = utilde[i0] - (v / (2 * a * dx) + 1 / (3 * pow(dx, 2))) * u[i0 + 2];

				utilde[i0] = utilde[i0] - (-3 * v / (2 * a*dx) + 1 / (2 * pow(dx, 2))) * u[i0 + 1];

				utilde[i0] = utilde[i0] - (v / (4 * a*dx) + 11 / (12 * pow(dx, 2))) * utilde[i0 - 1];

				own_share = own_share + (5 * v / (6 * a*dx) - 5 / (3 * pow(dx, 2)));
			}
			else
			{
				if (i == (bruttoLength - 2)) {

					utilde[i0] = utilde[i0] - (v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * utilde[i0 - 3];

					utilde[i0] = utilde[i0] - (-v / (2 * a * dx) + 1 / (3 * pow(dx, 2))) * utilde[i0 - 2];

					utilde[i0] = utilde[i0] - (3 * v / (2 * a * dx) + 1 / (2 * pow(dx, 2))) * utilde[i0 - 1];

					utilde[i0] = utilde[i0] - (-v / (4 * a * dx) + 11 / (12 * pow(dx, 2))) * u[i0 + 1];

					own_share = own_share + (-5 * v / (6 * a*dx) - 5 / (3 * pow(dx, 2)));
				}
				else
				{
					utilde[i0] = utilde[i0] - (v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * u[i0 + 2];

					utilde[i0] = utilde[i0] - (-2 * v / (3 * a * dx) + 4 / (3 * pow(dx, 2))) * u[i0 + 1];

					utilde[i0] = utilde[i0] - (2 * v / (3 * a * dx) + 4 / (3 * pow(dx, 2))) * utilde[i0 - 1];

					utilde[i0] = utilde[i0] - (-v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * utilde[i0 - 2];

					own_share = own_share + (-5 / (2 * pow(dx, 2)));
				}
			}

			// repeat equivalent code for y-direction. It has the same structure as above

			// SOR-Share
			utilde[i0] = utilde[i0] / own_share * w + (1 - w) * u[i0];
			own_share = 0;
		}
	}

As long as the code is executed as a function call (arguments are 3 integers and 2 double pointers) the performance is really bad. As soon as I copy the code directly into my loop there is a massive speed up.

I tried both versions with and without the code optimization enabled (/O2) and measured the average execution time of the code snippet above. It looks like there is only minor code optimization for the version with the function call as the execution time did not improve much (3x faster, compared to 12x faster withou the function call).

I'm not sure if this is the root of the problem though. Can anybody give me some advise? Of course I could leave the whole calculation part inside my while-loop, but that looks very confusing. It would be much clearer to move it into a separate function.

I'm using the compiler that comes with Intel Parallel Studio XE 2017.

Best regards.

Yuan_C_Intel · ‎03-22-2017

Hi, Jan

If you are passing pointers as function parameters, compiler will consider them alias with each other with data dependence that some of the optimization cannot be used. You can those pointers with Restrict keyword to eliminate the dependance, e.g.:

void fun(double *restrict a, double *restrict b)

If you are compiling with C file, please specify /Qstd=c99 or -std=c99 so that the Restrict keyword can be recognized. Or you can use option /Qrestrict or -restrict.

You can also try option /Qopt-report:5 to generate an optimization report to examine different optimizations on your code.

Hope this helps.

Thanks.

View solution in original post

Yuan_C_Intel · ‎03-22-2017

Hi, Jan

If you are passing pointers as function parameters, compiler will consider them alias with each other with data dependence that some of the optimization cannot be used. You can those pointers with Restrict keyword to eliminate the dependance, e.g.:

void fun(double *restrict a, double *restrict b)

If you are compiling with C file, please specify /Qstd=c99 or -std=c99 so that the Restrict keyword can be recognized. Or you can use option /Qrestrict or -restrict.

You can also try option /Qopt-report:5 to generate an optimization report to examine different optimizations on your code.

Hope this helps.

Thanks.

SergeyKostrov · ‎03-22-2017

>>...As long as the code is executed as a function call (arguments are 3 integers and 2 double pointers) the performance is >>really bad. As soon as I copy the code directly into my loop there is a massive speed up. Your for loop is a complex one however it looks like in the 2nd case it was vectorized. Take a look at optimization reports to understand what was wrong in the 1st case.

Jan_D_ · ‎03-29-2017

Thank you very much guys! You helped me a lot!

Code optimization fails