- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am writing a portion of parallel CPU code to assist with GPGPU CUDA code. Using Intel C++ Compiler 16.0, as well as Intel IPP Multi-threaded static library.
Initial testing of the parallel CPU code using OpenMP, computing a series of two hashing functions and reduces on the results, had a timing of 6ms each for 8million elements each. But then when using cudaMallocHost for memory the code to the operation above, the timings changed to 9ms and 13ms for each of the operations.
Looking at the vector report when using cudaMallocHost it states "vector dependence" for the loop
Below is a sample of run-able code to show the issue.
#include <stdio.h> #include <stdlib.h> #include <conio.h> #include <string.h> #include <windows.h> #include <assert.h> #include <cuda_runtime.h> #include <vector_types.h> #include <malloc.h> #include <omp.h> #include <ipps.h> #include <ippcore.h> #include <ippvm.h> struct timer_ms { double PCFreq; __int64 CounterStart; }; void StartCounter_ms(timer_ms &t) { LARGE_INTEGER li; if(!QueryPerformanceFrequency(&li)) printf("QueryPerformanceFrequency failed!\n"); t.PCFreq = double(li.QuadPart)/1000.0; QueryPerformanceCounter(&li); t.CounterStart = li.QuadPart; } double GetCounter_ms(timer_ms &t) { LARGE_INTEGER li; QueryPerformanceCounter(&li); return double(li.QuadPart-t.CounterStart)/t.PCFreq; } Ipp8u hash_GPUsActiveList(float3 seg, int x_range, int z_range){ if(seg.x < (x_range*0.5f)){ if(seg.z < (z_range*0.5f)){ return 0; // Top Left }else{ return 3; // Bottom Left } }else{ if(seg.z < (z_range*0.5f)){ return 1; // Top Right }else{ return 2; // Bottom Right } } } void histoActive(Ipp8u * GPUsActiveList,int count,int *activeHistogram){ const int num_buckets = 4; #pragma omp parallel { int i; int activeCountsGPU_private[num_buckets] = {0}; #pragma omp for nowait for(i=0; i<count; i++) { activeCountsGPU_private[GPUsActiveList]++; //printf("Cnt %d Thread %d ", i, omp_get_thread_num()); } #pragma omp critical { for(i=0; i<num_buckets; i++) activeHistogram += activeCountsGPU_private; } } } void hashAndCountActiveSeg(Ipp8u * GPUActiveList,int count,int *activeCntGPU, float3 * locHost, int xRange, int zRange){ #pragma omp parallel for for(int h = 0; h < count; h++) GPUActiveList= hash_GPUsActiveList(locHost , xRange, zRange); for(int i=0; i<4; i++) activeCntGPU = 0; histoActive(GPUActiveList, count, activeCntGPU); } int main() { timer_ms timerTwo; timerTwo.CounterStart = 0; timerTwo.PCFreq = 0; double timerActive = 0.0; static const int maxThreads = omp_get_max_threads(); static const int count = 8000000; int bufferSize = 0; int x_range = 5000; int z_range = 5000; size_t locDataSize = sizeof(float3)*count; float3 * locations_Host; cudaMallocHost((void**)&locations_Host, locDataSize); //locations_Host = (float3*)malloc(locDataSize); for(int h = 0; h < count; h++){ locations_Host .x = rand() % x_range; locations_Host .y = rand() % 2; locations_Host .z = rand() % z_range; } Ipp8u * GPUsActiveList = ippsMalloc_8u(sizeof(Ipp8u)*count); omp_set_num_threads(maxThreads); double loopCount = 30; int activeCountsGPU[4]; for(int t = 0; t < loopCount+10; t++){ StartCounter_ms(timerTwo); #pragma omp parallel for for(int h = 0; h < count; h++) GPUsActiveList = hash_GPUsActiveList(locations_Host , x_range, z_range); for(int i=0; i<4; i++) activeCountsGPU = 0; histoActive(GPUsActiveList, count, activeCountsGPU); timerActive += GetCounter_ms(timerTwo); if(t == 9){ timerActive = 0.0; } } printf("Average for GPUsActiveList:\t%5.2f ms\n", (float)timerActive/loopCount); //free(locations_Host); cudaFree(locations_Host); ippsFree(GPUsActiveList); cudaDeviceReset(); _getch(); return 0; }
Using Malloc:
LOOP BEGIN at C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(110,4) remark #15388: vectorization support: reference GPUsActiveList has aligned access [ C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(111,5) ] remark #15328: vectorization support: gather was emulated for the variable locations_Host: strided by 3 [ C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(111,5) ] remark #15328: vectorization support: gather was emulated for the variable locations_Host: strided by 3 [ C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(111,5) ] remark #15305: vectorization support: vector length 16 remark #15309: vectorization support: normalized vectorization overhead 0.090 remark #15300: LOOP WAS VECTORIZED remark #15449: unmasked aligned unit stride stores: 1 remark #15460: masked strided loads: 2 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 36 remark #15477: vector loop cost: 25.810 remark #15478: estimated potential speedup: 1.390 remark #15488: --- end vector loop cost summary --- LOOP END
Using cudaMallocHost:
LOOP BEGIN at C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(110,4) remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #15346: vector dependence: assumed ANTI dependence between locations_Host line 111 and GPUsActiveList line 111 remark #15346: vector dependence: assumed FLOW dependence between GPUsActiveList line 111 and locations_Host line 111 LOOP END
I tried using commands such as #pragma ivdep and #pragma vector {aligned | always} but I was unsuccessful.
Thanks,
Greg
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What are your compiler options? And what is (are) the Host system(s)?
Vectors of 4 (floats) are used with SSE series instructions 8 (floats) with AVX256 and 16 floats AVX512.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The compiler doesn't know the alignment of cudamalloc. If you believe it is aligned, you could use __assume_aligned to assert alignment of your array. Granted, if that works, it makes the dependency message look misleading.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not pertinent to your question...
Why are you not using:
Ipp8u hash_GPUsActiveList(float3 seg, float x_half_range, float z_half_range){ if(seg.x < x_half_range){ if(seg.z < z_half_range){ return 0; // Top Left }else{ return 3; // Bottom Left } }else{ if(seg.z < z_half_range){ return 1; // Top Right }else{ return 2; // Bottom Right } } } ... float x_half_range = x_range * 0.5f; float z_half_range = z_range * 0.5f; #pragma omp parallel for for(int h = 0; h < count; h++) GPUsActiveList= hash_GPUsActiveList(locations_Host , x_half_range, z_half_range); ...
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A quick experiment shows that the following. The behavior is same even if you use the unpinned equivalent of cudaMallocHost which is cudaMalloc.
1. using malloc:
LOOP BEGIN at C:\Users\amadhuso\Downloads\testcuda.cpp(111,13) remark #15388: vectorization support: reference GPUsActiveList has aligned access [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,17) ] remark #15328: vectorization support: gather was emulated for the variable locations_Host: strided by 3 [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,57) ] remark #15328: vectorization support: gather was emulated for the variable locations_Host: strided by 3 [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,57) ] remark #15305: vectorization support: vector length 16 remark #15309: vectorization support: normalized vectorization overhead 0.090 remark #15300: LOOP WAS VECTORIZED remark #15449: unmasked aligned unit stride stores: 1 remark #15460: masked strided loads: 2 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 36 remark #15477: vector loop cost: 25.810 remark #15478: estimated potential speedup: 1.390 remark #15488: --- end vector loop cost summary --- LOOP END
2. using cudaMalloc:
LOOP BEGIN at C:\Users\amadhuso\Downloads\testcuda.cpp(111,13) remark #15389: vectorization support: reference GPUsActiveList has unaligned access [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,17) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or /Qvec-threshold0 to override remark #15328: vectorization support: gather was emulated for the variable locations_Host: strided by 3 [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,57) ] remark #15328: vectorization support: gather was emulated for the variable locations_Host: strided by 3 [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,57) ] remark #15305: vectorization support: vector length 4 remark #15309: vectorization support: normalized vectorization overhead 0.115 remark #15451: unmasked unaligned unit stride stores: 1 remark #15460: masked strided loads: 2 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 39 remark #15477: vector loop cost: 43.500 remark #15478: estimated potential speedup: 0.890 remark #15487: type converts: 1 remark #15488: --- end vector loop cost summary --- LOOP END
3. using cudaMallocHost:
LOOP BEGIN at C:\Users\amadhuso\Downloads\testcuda.cpp(111,13) remark #15389: vectorization support: reference GPUsActiveList has unaligned access [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,17) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or /Qvec-threshold0 to override remark #15328: vectorization support: gather was emulated for the variable locations_Host: strided by 3 [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,57) ] remark #15328: vectorization support: gather was emulated for the variable locations_Host: strided by 3 [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,57) ] remark #15305: vectorization support: vector length 4 remark #15309: vectorization support: normalized vectorization overhead 0.115 remark #15451: unmasked unaligned unit stride stores: 1 remark #15460: masked strided loads: 2 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 39 remark #15477: vector loop cost: 43.500 remark #15478: estimated potential speedup: 0.890 remark #15487: type converts: 1 remark #15488: --- end vector loop cost summary --- LOOP END
Thanks and Regards
Anoop
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the responses,
Tim: I tried adding __assume_aligned but my project did not recognize that command. I'm using windows if that makes a difference. However alignment doesn't seem to be the issue currently.
Jim: Thanks, that helped a bit too. I was more focused on other issues so I didn't see that.
Anoop: I'm unsure how you tested vectorization with cudaMalloc, cudaMalloc'ed memory isn't accessible by the host.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looking at the statements in Anoop's comment the vector report only mentions GPUsActiveList being aligned or unaligned. But that data is never changed in the code variations being tested.
It is always Ipp8u * GPUsActiveList = ippsMalloc_8u(
sizeof
(Ipp8u)*count);
So does the data in locations_Host affect the alignment of GPUsActiveList?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was able to force the loop to vectorize however it did so in a slightly different way that normally occurs with malloc'ed memory.
I used:
#pragma omp parallel for #pragma ivdep #pragma vector aligned for(int h = 0; h < count; h++){ GPUsActiveList= hash_GPUsActiveList(locations_Host , x_half_range, z_half_range); }
Then I also specified the command "/Qvec-threshold0"
However this seems not ideal to specify the threshold above for an entire project.
LOOP BEGIN at C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(117,4) remark #15388: vectorization support: reference GPUsActiveList has aligned access [ C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(118,5) ] remark #15328: vectorization support: gather was emulated for the variable (unknown): strided by 3 [ C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(118,5) ] remark #15328: vectorization support: gather was emulated for the variable (unknown): strided by 3 [ C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(118,5) ] remark #15305: vectorization support: vector length 4 remark #15309: vectorization support: normalized vectorization overhead 0.099 remark #15300: LOOP WAS VECTORIZED remark #15451: unmasked unaligned unit stride stores: 1 remark #15460: masked strided loads: 2 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 38 remark #15477: vector loop cost: 42.750 remark #15478: estimated potential speedup: 0.880 remark #15487: type converts: 1 remark #15488: --- end vector loop cost summary --- LOOP END
When forcing vectorization it has a vector length of 4 instead of 16 and estimates there being a performance loss.
Being forced the performance is a bit less predictable (could be partly because I'm on a laptop too). The run time is consistently around 6 ms though like the malloc'ed verson on my computer. Instead of about 9.5ms for non-vectorized.
Is there a safe way to force only one loop to vectorize by setting its threshold in a #paragma?
Still reading though all the documentation on vectorization, but I thought I would ask here. Is there any way to forcibly control "vectorization support: vector length"?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The choice of shorter vector length is likely on account of allowing for unaligned data. I've got to stop replying if you're not interested. My metered data plan is on me.
Your interest in additional pragmas ought to be an incentive to open up the docs. For Intel specific pragma you have the HTML docs index installed with the compiler. For openmp-simd you may try to understand the official docs at openmp.org.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't understand your statement "I've got to stop replying if you're not interested. My metered data plan is on me." I've never said I'm not interested I only said I wasn't able to test the one thing you said. I have still been looking on how to get my project to recognize the command __assume_aligned.
I thought the use of the forms was to ask others for help or to be pointed in the right direction?
Do Intel's forms work differently?
I will try to solve it on my own, thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What are your compiler options? And what is (are) the Host system(s)?
Vectors of 4 (floats) are used with SSE series instructions 8 (floats) with AVX256 and 16 floats AVX512.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Jim that comment helped a lot.
The two processors that I'm testing with are:
Intel® Core™ i7-5700HQ Processor : Instruction Set Extensions SSE4.1/4.2 AVX 2.0
Intel® Core™ i7-5960X Processor Extreme Edition : Instruction Set Extensions SSE4.2, AVX 2.0, AES
I went into the VisualStudio properties and saw the instruction set was left to default. I switched to AVX and the vector lengths changed from:
With malloc VS default instructions: vector length 4 -> AVX: vector length 16
With cudaMallocHost VS default instructions: vector length 2 -> AVX: vector length 16
My code is now different from what I posted initially, but same basic logic. When using cudaMallocHost, specifying AVX boosted performace 4x.
-----Other interesting things I noticed-----
The options/pragma below with AVX cause the vector length to be 16 (11% performance gain over the next options)
/Qvec-threshold0
#pragma omp parallel for
#pragma ivdep
#pragma vector aligned
The pragma below with AVX cause the vector length to be 8
#pragma omp parallel for simd
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The #pragma omp parallel for simd is likely partitioning the loop space based on a SIMD constructed of floats (the input vector type/width).
Also, "simulated gather" are mostly scalar instructions to pack into SIMD register for subsequent SIMD operation. IOW, while the report may indicate code is vectorized, the code is not entirely vectorized. Also, when you eventually compile for a machine supporting gather, a gather load/store with stride other than 1, is slower than a contiguous load (speed varies with stride).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the detailed explanation.
When learning HPC I went straight into GPGPU, aside from one project with Intel MIC. So its very interesting to learn the ways to control the CPU on a lower level like a GPU.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page