Software Archive
Read-only legacy content
17061 Discussions

How long does it take to allocate memory on the MIC?

Mingqing_W_
Beginner
792 Views

I wrote some code to test the speed of allocating memory on the MIC. I find that allocte 4GB memory on MIC card need 
almost 14 seconds. Is this a normal speed?

The test program is like THIS:

  1 #include<stdio.h>
  2 #include<stdlib.h>
  3 void main(){
  4         int size=1024*1024*1024;
  5         __attribute__((target(mic:0)))float *a;
  6         a=(float*)malloc(size*sizeof(float));
  7 #pragma offload target(mic:0)nocopy(a:length(size) alloc_if(1) free_if(0))
  8  {}
  9 #pragma offload target(mic:0)nocopy(a:length(size) alloc_if(0) free_if(1))
 10  {}
 11 free(a);
 12 }

Then I got the OFFLOAD report:

[Offload] [MIC 0] [File]            test.c
[Offload] [MIC 0] [Line]            7
[Offload] [MIC 0] [Tag]             Tag 0
[Offload] [HOST]  [Tag 0] [CPU Time]        13.984358(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data]   0 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time]        0.000158(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data]   8 (bytes)

[Offload] [MIC 0] [File]            test.c
[Offload] [MIC 0] [Line]            9
[Offload] [MIC 0] [Tag]             Tag 1
[Offload] [HOST]  [Tag 1] [CPU Time]        0.003295(seconds)
[Offload] [MIC 0] [Tag 1] [CPU->MIC Data]   16 (bytes)
[Offload] [MIC 0] [Tag 1] [MIC Time]        0.000047(seconds)
[Offload] [MIC 0] [Tag 1] [MIC->CPU Data]   0 (bytes)


I wonder if this test is right ? And allocating 4GB memory really need 14 seconds?

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
792 Views

You must be aware that under most operating systems that a process on process start is provided a virtual address range (box) in which to use, however, this address range is divide into pages and that these pages are not assigned (mapped) to physical RAM and/or the page file (on systems with page files). Therefore, the first time a page is referenced, it takes a page fault hit, to trap to the O/S which will check the validity, and if a valid virtual address, the O/S will perform the mapping. Additionally, depending on the O/S, the first time mapping may also initiate a wipe to zero.

1 #include<stdio.h>
  2 #include<stdlib.h>
  3 void main(){
  4         int size=1024*1024*1024;
  5         __attribute__((target(mic:0)))float *a;
  6         a=(float*)malloc(size*sizeof(float));
for(int i=0; i < 3; ++i) // repeat a few times, discard first, average remainder
{
  7 #pragma offload target(mic:0)nocopy(a:length(size) alloc_if(1) free_if(0))
  8  {}
  9 #pragma offload target(mic:0)nocopy(a:length(size) alloc_if(0) free_if(1))
 10  {}
}
 11 free(a);
 12 }

Jim Dempsey

View solution in original post

0 Kudos
4 Replies
Andrey_Vladimirov
New Contributor III
792 Views

The first offload is always slow because the driver has to initialize. You can make the application perform all initialization at launch, instead of at the first offload, by setting the environment variable OFFLOAD_INIT=on_start (see https://software.intel.com/en-us/node/522775 ).

If you do that, or if you time the second and subsequent offloads, you should get allocation speed around 2 GB/s with MPSS 3.4 (that was our experience). The best practice, as you have correctly adopted in your code, is to use memory buffer retention between offloads with alloc_if/free_if.

0 Kudos
Mingqing_W_
Beginner
792 Views

Thanks for your response!

I have set  the environment variable OFFLOAD_INIT=on_start as your guide. Then I allocated "2GB+2GB" memory using two offloads. The OFFLOAD report is THIS:

[Offload] [MIC 0] [File]            test.c

[Offload] [MIC 0] [Line]            9
[Offload] [MIC 0] [Tag]             Tag 0
[Offload] [HOST]  [Tag 0] [CPU Time]        6.643577(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data]   0 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time]        0.000152(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data]   8 (bytes)

[Offload] [MIC 0] [File]            test.c
[Offload] [MIC 0] [Line]            11
[Offload] [MIC 0] [Tag]             Tag 1
[Offload] [HOST]  [Tag 1] [CPU Time]        6.609005(seconds)
[Offload] [MIC 0] [Tag 1] [CPU->MIC Data]   0 (bytes)
[Offload] [MIC 0] [Tag 1] [MIC Time]        0.000038(seconds)
[Offload] [MIC 0] [Tag 1] [MIC->CPU Data]   8 (bytes)

You can see that the second offload also needs almost 6.6 seconds. It is much slower than 2GB/S. I wonder if some other possible reasons lead to the reslut. The information of MIC card is :

 System Info
                HOST OS                 : Linux
                OS Version              : 2.6.32-358.el6.x86_64
                Driver Version          : 3.4-1
                MPSS Version            : 3.4
                Host Physical Memory    : 132098 MB

 Version
                Flash Version            : 2.1.02.0390
                SMC Firmware Version     : 1.16.5078
                SMC Boot Loader Version  : 1.8.4326
                uOS Version              : 2.6.38.8+mpss3.4
                Device Serial Number     : ADKC31601341

Again, thanks for your response.

 

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
793 Views

You must be aware that under most operating systems that a process on process start is provided a virtual address range (box) in which to use, however, this address range is divide into pages and that these pages are not assigned (mapped) to physical RAM and/or the page file (on systems with page files). Therefore, the first time a page is referenced, it takes a page fault hit, to trap to the O/S which will check the validity, and if a valid virtual address, the O/S will perform the mapping. Additionally, depending on the O/S, the first time mapping may also initiate a wipe to zero.

1 #include<stdio.h>
  2 #include<stdlib.h>
  3 void main(){
  4         int size=1024*1024*1024;
  5         __attribute__((target(mic:0)))float *a;
  6         a=(float*)malloc(size*sizeof(float));
for(int i=0; i < 3; ++i) // repeat a few times, discard first, average remainder
{
  7 #pragma offload target(mic:0)nocopy(a:length(size) alloc_if(1) free_if(0))
  8  {}
  9 #pragma offload target(mic:0)nocopy(a:length(size) alloc_if(0) free_if(1))
 10  {}
}
 11 free(a);
 12 }

Jim Dempsey

0 Kudos
Ravi_N_Intel
Employee
792 Views

Also enable use of 2M pages  using environment variable MIC_USE_2MB_BUFFERS=2M

0 Kudos
Reply