Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Simkin__Dan
Beginner
502 Views

Optimizing Performance on Xeon Servers

Jump to solution

I am trying to use the "Intel Processor Microarchitecture-Specific Optimization" setting that's available in Visual Studio when using the Intel C++ compiler (in properties under C/C++ >> Code Generation [Intel C++]). This is the same as the command-line /tune option. Right now I am using /tune:skylake to try to improve performance on a Xeon Server (a Xeon Platinum 8168, running a VM), but I am not seeing a performance improvement. What /tune option should I be using for this server?

0 Kudos
1 Solution
jimdempseyatthecove
Black Belt
476 Views

You could use Skylake and AVX2 on your desktop (and it will run on the Xeon Platinum 8168 processor)
But the Xeon Platinum 8168 processor could benefit from /QxCORE-AVX512

While both are Skylake series, your desktop Core i7-6700K has 2 memory channels, whereas the Xeon Platinum 8168 has 6 (as well as AVX512).

At this point in development (optimizaton), I would recommend first to work on your desktop targeting AVX2 and implement the indexing method as suggested in post #10. **** however keep the older pointer code in a conditional #if section for use in testing.

You will need to check the benefit or lack thereof for each ISA (AVX2 on desktop and server, and AVX512 on server).

The selection for use of the pointer method or index method should be relatively easy to integrate using #if defined(...) #else ... #endif

IIF your loop is (are) compatible with scatter/gather, then more of it can be vectorized without going through code contortions to get there. Presumably yielding significant performance improvements.

Jim Dempsey

 

View solution in original post

21 Replies
GouthamK_Intel
Moderator
464 Views

Hi Dan,

Thanks for reaching out to us,

Could you please provide more details about the workload which you are working on. And also provide the details of your environment like Visual Studio version, OS Version.

Please provide command/steps which you are following. So that we can investigate more.

 

Regards

Goutham

Simkin__Dan
Beginner
464 Views

Hello Goutham,

I am using Visual Studio 2017. I have been testing on two machines: my local machine, which is running Windows Professional 2010, and a Xeon server, which is running Windows Server 2016 Datacenter. The server with the Xeon Platinum 8168 processor is the machine where performance is important.

The "workload" is an application that does numerical integration, using a linear finite element model, where the results for each cell depend on the state of its neighbors. The application is multi-threaded, and breaks the workload up into blocks of neighboring cells. The application alternates between doing numerical integration for even cell partitions and odd cell partitions.

In Visual Studio project properties, I have set two properties:

C/C++ >> Code Generation [Intel C++] >> Intel(R) processor or microarchitecture code name Skylake (/tune:skylake)

C/C++ >> Code Generation >> Intel(R) Advanced Vector Extensions 2 (/arch:CORE-AVX2)

I actually think I see a small amount of performance improvement now, it's just hard to be sure, because the performance varies from run to run.

Is "skylake" the correct microarchitecture to target??

Dan Simkin

 

jimdempseyatthecove
Black Belt
464 Views

What is your local machine's processor?

Is your code written to take advantage of vectorization?

If your code is written to take advantage of vectorization and your local machine is without AVX512, are you willing to build two different configurations? (one for local machine using AVX2 and one for server using AVX512).

Jim Dempsey

Simkin__Dan
Beginner
464 Views

Local machine is Core i7-6700K, but I do not especially care about performance here.
Code was not written specifically to benefit from vectorization, but showed modest improvement in a previous test using SSE2 and VC++.
I will try AVX512. Thanks for the suggestion.

Dan Simkin

jimdempseyatthecove
Black Belt
464 Views

If your code is mostly scalar (IOW compiler cannot generate vectorized version of your scalar looking loops) then AVX512 will have limited use:

Array = Scalar
Array = Array * Scalar
Array = Array + OtherArray
...
and simple equivalent DO loops

This said, if your data structures are of OOP design, consider all that money you spent on that Xeon Platinum 8168 only partially utilized. AVX512 Vectorized code can perform 8 doubles or 16 floats per instruction as opposed to 1 of either per instruction. This is not to say you will see an 8x or 16x performance boost as memory bandwidth and cache utilization comes into play but it is not unusual to observe a significant multiplier.

Jim Dempsey

Simkin__Dan
Beginner
464 Views

The DLL I am working on is written in C++, with lots of classes, but the data structures I am using were mostly allocated by calling malloc.
Here is an example of the numerical integration code:
        for ( j = 0; j < n; j++ )
        {
          *xFast = ryn + ( hf6 ) * ( rk1 + ( *xFastDot ) );
        }
(I am not the original author of this code.)

jimdempseyatthecove
Black Belt
464 Views

The *xFast and *xFastDot will drag down vectorization.

You have arrays of pointers. While it is true that the Xeon Platinum 8168 is capable of vectorizing the loop using gather and scatter, these operations (can) consume a large portion of memory bandwidth. If performance is truly an imperative I would look at the data placement to see if the loop can be performed more efficiently.

Often the distribution of interacting data points are, or can be made, contiguous and that the above indirect array referencing is a means to simplify the coding of the boundary conditions (wrap around of data interactions). In these circumstances it is better to handle the first and last elements seperately from the interior.

Additional note. The SIMD gather and scatter instructions (on those CPUs that support them) use indexes in the SIMD vector lanes that offset a base (array) address provided to the instruction. These indexes have the same number of bits as the operand (float or double), and this is a signed index and are scaled by the operand number of bytes. If your arguments are float, and you are compiling for x64, you cannot have an array base address of 0x0000000000000000 and then place the pointers into your indexing array (e.g. xFast).

The preferred route, is to place the indexes into the arrays as opposed to the pointers. For example, should xFast be an array of pointers into an array named Fast and the array xFastDot be an array of pointers into an array named FastDot, then change these arrays from float* or double* to int32 or int64 respectively. Then your loop becomes:

        for ( j = 0; j < n; j++ )
        {
          Fast[xFast] = ryn + ( hf6 ) * ( rk1 + ( FastDot[xFastDot] ) );
        }
This way, if the target is capable of using scatter/gather, the compiler may be able to vectorize this code.

Jim Dempsey

Simkin__Dan
Beginner
464 Views

The values referenced by the pointers in xFast and xFastDot are float, but unfortunately they are not in an array. They are member variables in individual class instances. Basically, each cell is represented by two class instances, and the member variables for each cell are referenced by neighboring entries in these arrays. I did not write this code, so I cannot be sure of the author's motivation, but I think this organizational scheme was intended to ensure that numerical integration for different kinds of physical quantities was handled by the same method, in a uniform way. 

jimdempseyatthecove
Black Belt
464 Views

The indexes in the gather/scatter instruction would then be signed 32-bits for each lane. If you have an array of these individual class instances, you can construct a float* to the member variable on interest in the first instance and use that as the base of the "array", then for each of the other class instances member variables construct a float* and then produce the difference in the pointers. This will produce the integer index to the other class's member variable as if it were in an array of floats. You could do this with arbitrary new'd class objects provided that the resolved pointer difference "index" did not exceed 2G when picking the lowest addressed object or approximately 4G when constructing a midpoint "index"

float* lowest = getLowestAddressOfFastObjectFloatOfInterest();
float* highest = getHighestAddressOfFastObjectFloatOfInterest();
int64_t range = highest - lowest;
ASSERT(range < (1ull << (32 + 2)); // assure we can address using 32-bit signed indexes (with scale of *4)
float* pseudoArray = &(lowest[range / 2]);
// xFast can be filled with something like
xFast = &ObjectA.FloatMember - pseudoArray;

Do the same for xFastDot

Jim Dempsey

Simkin__Dan
Beginner
464 Views

Clever solution. I will try this if I run out of other options. Some of the other developers I work with are not especially mathematical, and they already have concerns about the readability of the codebase I am working on.

Just to be clear, the /tune:skylake compiler option still has value, right?

jimdempseyatthecove
Black Belt
477 Views

You could use Skylake and AVX2 on your desktop (and it will run on the Xeon Platinum 8168 processor)
But the Xeon Platinum 8168 processor could benefit from /QxCORE-AVX512

While both are Skylake series, your desktop Core i7-6700K has 2 memory channels, whereas the Xeon Platinum 8168 has 6 (as well as AVX512).

At this point in development (optimizaton), I would recommend first to work on your desktop targeting AVX2 and implement the indexing method as suggested in post #10. **** however keep the older pointer code in a conditional #if section for use in testing.

You will need to check the benefit or lack thereof for each ISA (AVX2 on desktop and server, and AVX512 on server).

The selection for use of the pointer method or index method should be relatively easy to integrate using #if defined(...) #else ... #endif

IIF your loop is (are) compatible with scatter/gather, then more of it can be vectorized without going through code contortions to get there. Presumably yielding significant performance improvements.

Jim Dempsey

 

View solution in original post

jimdempseyatthecove
Black Belt
464 Views

I think I should mention something I did not state in post #10 (index method). This is an assumed standard method that should be obvious but might not be to an inexperienced programmer.

The float member variable (to be indexed), for each object, should all align on a natural boundary. For float, this will be an address of a multiple of 4. This may require you to pad the class/struct and assure that the offset from base of class/struct is also naturally aligned. IIF not naturally aligned (not recommended) then all such indexed pointers must resolve to the same offset from the natural alignment. IOW the resolved address is:

  baseAddress + SignExtended(index*4) // for float

Note, while the offset from the base of class/struct to the indexed float is a multiple of 4, if you do make the necessary programming requirements, you would not be assured that an array of this class/struct has each object aligned at multiple of 4 (this is for float), nor would you be assured if this class is derived: class foo : classWithYourFloatNeedingAlignment {

Even when using pointers, it is not a good practice to de-reference a pointer to an unaligned float.

Jim Dempsey

Simkin__Dan
Beginner
464 Views

The classes are not derived or virtual. Also, I added a custom operator new for each one that makes the memory for each class instance 64 byte aligned, in an effort to keep them on separate cache lines. The class instances are kept in an array, but they are still allocated individually from heap. Since I have my own new operator, I could easily change this to allocate them contiguously from one large block of heap. I appreciate all of your help, but I probably won't try this anytime soon.

jimdempseyatthecove
Black Belt
464 Views

>>The class instances are kept in an array, but they are still allocated individually from heap.

That is a contradiction of terms. You either:

a) Allocate (new) them individually from the heap
b) Allocate (new[]) an array of them once from the heap.

While an individual one can be alligned while individually instantiated in static, stack or new'd (assuming aligned new), an array of said objects will be packed:

&object[0] is aligned as specified
&object[1] is at &object[0] + sizeof(object[0])
&object[2] is at &object[1] + sizeof(object[0])
...
Note, I cannot say for certain that newer C++ standards will place each element of an array at the specified boundary as this may require the sizeof(object) to be padded if necessary to a multiple of the alignment. Older standards of C/C++ did not. IOW it would be prudent to padd the object, possibly through use of a derived type that contains the pad (this will eliminate an extraneous "paddedContainer.". An explicit pad will do the same, but this may not be suitable for all cases.

Jim Dempsey

GouthamK_Intel
Moderator
464 Views

Simkin, Dan wrote:

Clever solution. I will try this if I run out of other options. Some of the other developers I work with are not especially mathematical, and they already have concerns about the readability of the codebase I am working on.

Just to be clear, the /tune:skylake compiler option still has value, right?

Hi Dan,

/tune is a supported option for the compiler. Please find the below link for more information about /tune options. 

https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-mtune-tune

Please let us know if your issue is resolved. 

 

Thanks

Goutham

Simkin__Dan
Beginner
464 Views

Jim,

Sorry I was not clear, I am using new, not new[]. The array I mentioned is an array of pointers to class instances, not an array of the class instances themselves.

Goutham,

Is there any advantage to using both /tune:skylake-avx512 and /QxCORE-AVX512, or is this equivalent to using /tune:skylake and /QxCORE-AVX512?

Dan Simkin

jimdempseyatthecove
Black Belt
464 Views

I do not know. My AVX512 system is a Knights Landing (KNL 5210) not skylake.

/tune:skylake-avx512 and /QxCORE-AVX512 may be the same. You could generate two copies, one with each option, with the additional option to produce assembler source code. Then run WinDiff on the two .asm files. This could confirm any differences.

Jim Dempsey

GouthamK_Intel
Moderator
464 Views

Simkin, Dan wrote:

Jim,

Sorry I was not clear, I am using new, not new[]. The array I mentioned is an array of pointers to class instances, not an array of the class instances themselves.

Goutham,

Is there any advantage to using both /tune:skylake-avx512 and /QxCORE-AVX512, or is this equivalent to using /tune:skylake and /QxCORE-AVX512?

Dan Simkin

 

Hi Dan, 

 /tune is backward compatible in terms of supported architecture, unlike /QxCORE. /QxCORE is used to run on a specific architecture mentioned only.

 

Thanks

Goutham

GouthamK_Intel
Moderator
464 Views

Hi Dan,

Please let us know if your issue is resolved.

 

Thanks

Goutham

GouthamK_Intel
Moderator
203 Views

Hi Dan,

Could you please let us know if your issue is resolved?

 

Thanks

Goutham

Reply