- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wrote some code to test the speed of allocating memory on the MIC. I find that allocte 4GB memory on MIC card need
almost 14 seconds. Is this a normal speed?
The test program is like THIS:
1 #include<stdio.h>
2 #include<stdlib.h>
3 void main(){
4 int size=1024*1024*1024;
5 __attribute__((target(mic:0)))float *a;
6 a=(float*)malloc(size*sizeof(float));
7 #pragma offload target(mic:0)nocopy(a:length(size) alloc_if(1) free_if(0))
8 {}
9 #pragma offload target(mic:0)nocopy(a:length(size) alloc_if(0) free_if(1))
10 {}
11 free(a);
12 }
Then I got the OFFLOAD report:
[Offload] [MIC 0] [File] test.c
[Offload] [MIC 0] [Line] 7
[Offload] [MIC 0] [Tag] Tag 0
[Offload] [HOST] [Tag 0] [CPU Time] 13.984358(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data] 0 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time] 0.000158(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data] 8 (bytes)
[Offload] [MIC 0] [File] test.c
[Offload] [MIC 0] [Line] 9
[Offload] [MIC 0] [Tag] Tag 1
[Offload] [HOST] [Tag 1] [CPU Time] 0.003295(seconds)
[Offload] [MIC 0] [Tag 1] [CPU->MIC Data] 16 (bytes)
[Offload] [MIC 0] [Tag 1] [MIC Time] 0.000047(seconds)
[Offload] [MIC 0] [Tag 1] [MIC->CPU Data] 0 (bytes)
I wonder if this test is right ? And allocating 4GB memory really need 14 seconds?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You must be aware that under most operating systems that a process on process start is provided a virtual address range (box) in which to use, however, this address range is divide into pages and that these pages are not assigned (mapped) to physical RAM and/or the page file (on systems with page files). Therefore, the first time a page is referenced, it takes a page fault hit, to trap to the O/S which will check the validity, and if a valid virtual address, the O/S will perform the mapping. Additionally, depending on the O/S, the first time mapping may also initiate a wipe to zero.
1 #include<stdio.h>
2 #include<stdlib.h>
3 void main(){
4 int size=1024*1024*1024;
5 __attribute__((target(mic:0)))float *a;
6 a=(float*)malloc(size*sizeof(float));
for(int i=0; i < 3; ++i) // repeat a few times, discard first, average remainder
{
7 #pragma offload target(mic:0)nocopy(a:length(size) alloc_if(1) free_if(0))
8 {}
9 #pragma offload target(mic:0)nocopy(a:length(size) alloc_if(0) free_if(1))
10 {}
}
11 free(a);
12 }
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The first offload is always slow because the driver has to initialize. You can make the application perform all initialization at launch, instead of at the first offload, by setting the environment variable OFFLOAD_INIT=on_start (see https://software.intel.com/en-us/node/522775 ).
If you do that, or if you time the second and subsequent offloads, you should get allocation speed around 2 GB/s with MPSS 3.4 (that was our experience). The best practice, as you have correctly adopted in your code, is to use memory buffer retention between offloads with alloc_if/free_if.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your response!
I have set the environment variable OFFLOAD_INIT=on_start as your guide. Then I allocated "2GB+2GB" memory using two offloads. The OFFLOAD report is THIS:
[Offload] [MIC 0] [File] test.c
[Offload] [MIC 0] [Line] 9
[Offload] [MIC 0] [Tag] Tag 0
[Offload] [HOST] [Tag 0] [CPU Time] 6.643577(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data] 0 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time] 0.000152(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data] 8 (bytes)
[Offload] [MIC 0] [File] test.c
[Offload] [MIC 0] [Line] 11
[Offload] [MIC 0] [Tag] Tag 1
[Offload] [HOST] [Tag 1] [CPU Time] 6.609005(seconds)
[Offload] [MIC 0] [Tag 1] [CPU->MIC Data] 0 (bytes)
[Offload] [MIC 0] [Tag 1] [MIC Time] 0.000038(seconds)
[Offload] [MIC 0] [Tag 1] [MIC->CPU Data] 8 (bytes)
You can see that the second offload also needs almost 6.6 seconds. It is much slower than 2GB/S. I wonder if some other possible reasons lead to the reslut. The information of MIC card is :
System Info
HOST OS : Linux
OS Version : 2.6.32-358.el6.x86_64
Driver Version : 3.4-1
MPSS Version : 3.4
Host Physical Memory : 132098 MB
Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.4
Device Serial Number : ADKC31601341
Again, thanks for your response.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You must be aware that under most operating systems that a process on process start is provided a virtual address range (box) in which to use, however, this address range is divide into pages and that these pages are not assigned (mapped) to physical RAM and/or the page file (on systems with page files). Therefore, the first time a page is referenced, it takes a page fault hit, to trap to the O/S which will check the validity, and if a valid virtual address, the O/S will perform the mapping. Additionally, depending on the O/S, the first time mapping may also initiate a wipe to zero.
1 #include<stdio.h>
2 #include<stdlib.h>
3 void main(){
4 int size=1024*1024*1024;
5 __attribute__((target(mic:0)))float *a;
6 a=(float*)malloc(size*sizeof(float));
for(int i=0; i < 3; ++i) // repeat a few times, discard first, average remainder
{
7 #pragma offload target(mic:0)nocopy(a:length(size) alloc_if(1) free_if(0))
8 {}
9 #pragma offload target(mic:0)nocopy(a:length(size) alloc_if(0) free_if(1))
10 {}
}
11 free(a);
12 }
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also enable use of 2M pages using environment variable MIC_USE_2MB_BUFFERS=2M

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page