>>I parallelized my

Michael_H_6 · ‎12-09-2014

I'm observing a strange behavior and would like to know if it is Intel Xeon Phi related or not.

I observed, that the memory transfer times from host to the mic are extremely high. I have tracked the overhead for the memory transfers down to the memory allocation time on the accelerator which scales linear with the amount of requested memory (huge pages are used). The memory allocations take about twice the time for allocations form 2MB to 48MB. After this point the allocations overhead goes up to almost 7x.

I'm observing this behavior with LEO, OpenMP 4.0 and OpenCL. Is this a configuration problem of my MICs or is this somehow related to the OS running on the MIC? Is the overhead of allocating pages on the MIC realy that high?

If this is only my problem, how can I conquer it? Are there some environment variables to accelerate the allocation on the MIC?

Thanks in advance!

JJK · ‎12-09-2014

perhaps you're bitten by the same thing that I am on one of my Phi's:

https://software.intel.com/en-us/comment/1805129

Normally memory offloading should happen at ~ 6 GB/s.

JJK · ‎12-10-2014

Can you post your code? I've got 5110P's as well and am curious whether I will see the same thing. Also, what happens if you run the sample code from my thread ?

jimdempseyatthecove · ‎12-10-2014

Michael,

Your allocation test program, #5, is not a valid test.

The reason being is the code will be running on a system (Linux or possibly Windows on host) and these system employ Virtual Memory environments. While the process containing the above application may provide for the full Virtual Address range, the addresses within this range are not associated with anything until the process (a thread in your application) touches the page (write or read) holding the address of interest.

The heap manager is a function called by your application. Putting aside NUMA considerations for the moment, the heap manager (typically) uses an organization of nodes (either linked together or organized into buckets by size). Prior to the first allocation, the heap typically has one large node. The point to observe is although the node header references a humongous block of RAM, only the page(s) in which the header resides has been mapped to physical RAM and/or backed to a page file page. This is the situation prior to you touching the memory. First touch to an address not mapped, causes a page fault, this then invisibly traps to O/S, which when address is valid for your process, assigns a page or group of pages to RAM and/or page file, and optionally wipes the page to 0's.

In your above test of allocating vectors, the only "first touch" happening in the loop are: a) the heap manager managing the node header, and b) the ctor of the vector in initializing its member variables. The entire contents of the vector are not touched.

Note, depending on the heap manager, it may do one of three generalized things on return the memory (vector on dtor when going out of scope):

a) place the node into a pool of similar sized allocations (fast for re-allocation of similar sized nodes)
b) place the node into a queue of to be garbage collected nodes (protects against buggy code that re-uses buffers after deletes)
c) place the node directly into the heap (front or back end)
d) place the node in the proper place of the heap (simple heap) without stitching adjacent nodes together
e) place the node in the proper place of the heap (simple heap) with stitching adjacent nodes together

If the heap manager you link with is using e), then your program will run very fast because the heap node header and vector member data will keep allocating the memory at the same location, then only first touching the 1 or possibly 2 pages on the first iteration.

For all the other types, a-d and others not listed, your code above would likely perform "first touch" of only the heap node header and vector member data alone, and not to any of the allocated memory for the vector sans the node header of the allocated node and node header following in memory when allocation causes a node to split.

Also, regardless of not touching the data portion of the vector, a better program that evaluates heap managers similar to a-d try:

#include <iostream>
#include <cstdlib>
#include <string>
#include <stdio.h>
#include <vector>
#include <sys/time.h>

double elapsedTime(void) {
  struct timeval t;
  gettimeofday(&t, 0);
  return ((double)t.tv_sec + ((double)t.tv_usec / 1000000.0));
}

int main(int argc, char *argv[]) {
for(int run = 1; run < 3; ++run) {
  for (int i = 1; i <= 1024; i++) {
    double start = elapsedTime();
    std::vector<float> test(256 * 1024 * i);
    double end = elapsedTime();
    double duration = end - start;
    std::cout << "Run " << run << ": "<< duration << " Seconds" << std::endl;
  }
}
  exit(0);
}

Jim Dempsey

James_C_Intel2 · ‎12-10-2014

Jim,

While the process containing the above application may provide for the full Virtual Address range, the addresses within this range are not associated with anything until the process (a thread in your application) touches the page (write or read) holding the address of interest

In Linux reading is not sufficient. The kernel has a single "zero page" which is mapped with the "copy-on-write" property behind an initial allocation. An actual physical page is only allocated when the page is written to. Reading it does not cause allocation of a physical page, simply returns a zero value from the zero-page. Therefore to force the allocation of pages you must write to them (writing a zero is fine, of course).

jimdempseyatthecove · ‎12-10-2014

So your vector ctor is designed with a wipe on allocation. While this might be expected with a debug build it is not expected with a release build (unless of course you have a #define to require wiping). Your code does not indicate if your allocator performs aligned allocations.

Note, in the wipe case, the length of time for first touch wipe will depend on two factors: page size, and O/S overhead to perform the mapping.

The second iteration time should eliminate the O/S time, however you state equal time for allocations which indicates 0 overhead time??

This leads me to believe that vector stores were not in fact used. Were you able to run VTune to see where the time was spent?

Do not rely on the vector report to tell you if the vector generated code was actually used or not.

Another potential issue, not discernible from the code listed above, is if the iterator resolved to an int64 type or uint64 type (or other). There have been some reports that uint64 or uint32 types for loop control variables on Phi have performance issues.

See if std::vector<float> test(256 * 1024 * (__int64)i); does anything for you.

Jim Dempsey

Jim_Dempsey1 · ‎12-11-2014

>>I parallelized my initialization loop with OpenMP and get an improvement...

You would likely need to add test for if you are currently in a parallel region or if size too small, then use 1 thread for initialization.

The other route is, other than for DEBUG build, to NOT initialize the array. At least not in the ctor. You could add member functions initialize_to(x) and parallel_initialize_to(x)

Jim Dempsey

Memory allocation overhead on MIC in offload mode