Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

How to get vector support for CUDA pinned memory?

Greg_G_
Beginner
1,498 Views

I am writing a portion of parallel CPU code to assist with GPGPU CUDA code. Using Intel C++ Compiler 16.0, as well as Intel IPP Multi-threaded static library.

Initial testing of the parallel CPU code using OpenMP, computing a series of two hashing functions and reduces on the results, had a timing of 6ms each for 8million elements each. But then when using cudaMallocHost for memory the code to the operation above, the timings changed to 9ms and 13ms for each of the operations.

Looking at the vector report when using cudaMallocHost it states "vector dependence" for the loop

Below is a sample of run-able code to show the issue.

#include <stdio.h>
#include <stdlib.h>
#include <conio.h>
#include <string.h>
#include <windows.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <vector_types.h>
#include <malloc.h>
#include <omp.h>

#include <ipps.h>
#include <ippcore.h>
#include <ippvm.h>

struct timer_ms {
    double PCFreq;
    __int64 CounterStart;
};
void StartCounter_ms(timer_ms &t)
{
    LARGE_INTEGER li;
    if(!QueryPerformanceFrequency(&li))
        printf("QueryPerformanceFrequency failed!\n");

    t.PCFreq = double(li.QuadPart)/1000.0;

    QueryPerformanceCounter(&li);
    t.CounterStart = li.QuadPart;
}
double GetCounter_ms(timer_ms &t)
{
    LARGE_INTEGER li;
    QueryPerformanceCounter(&li);
    return double(li.QuadPart-t.CounterStart)/t.PCFreq;
}

Ipp8u hash_GPUsActiveList(float3 seg, int x_range, int z_range){
    if(seg.x < (x_range*0.5f)){
        if(seg.z < (z_range*0.5f)){
            return 0; // Top Left
        }else{
            return 3; // Bottom Left
        }
    }else{
        if(seg.z < (z_range*0.5f)){
            return 1; // Top Right
        }else{
            return 2; // Bottom Right
        }
    }
}
void histoActive(Ipp8u * GPUsActiveList,int count,int *activeHistogram){
        const int num_buckets = 4;
#pragma omp parallel
        {
            int i;
            int activeCountsGPU_private[num_buckets] = {0};
#pragma omp for nowait
            for(i=0; i<count; i++) { 
                activeCountsGPU_private[GPUsActiveList]++;
                //printf("Cnt %d  Thread %d ", i, omp_get_thread_num());
            }      
#pragma omp critical 
            {
                for(i=0; i<num_buckets; i++) activeHistogram += activeCountsGPU_private;
            }
        }
    }
void hashAndCountActiveSeg(Ipp8u * GPUActiveList,int count,int *activeCntGPU, float3 * locHost, int xRange, int zRange){
        #pragma omp parallel for
            for(int h = 0; h < count; h++)
                GPUActiveList = hash_GPUsActiveList(locHost, xRange, zRange);
            for(int i=0; i<4; i++) activeCntGPU = 0;
            histoActive(GPUActiveList, count, activeCntGPU);
    }

int main() {
        timer_ms timerTwo;
        timerTwo.CounterStart = 0;
        timerTwo.PCFreq = 0;
        double timerActive = 0.0;

        static const int maxThreads = omp_get_max_threads();
        static const int count = 8000000;
        int bufferSize = 0;
        int x_range = 5000; int z_range = 5000;

        size_t locDataSize = sizeof(float3)*count;

        float3 * locations_Host; 
        cudaMallocHost((void**)&locations_Host, locDataSize);
        //locations_Host = (float3*)malloc(locDataSize);
        for(int h = 0; h < count; h++){
            locations_Host.x = rand() % x_range;
            locations_Host.y = rand() % 2;
            locations_Host.z = rand() % z_range;
        }

        Ipp8u * GPUsActiveList = ippsMalloc_8u(sizeof(Ipp8u)*count);

        omp_set_num_threads(maxThreads);

        double loopCount = 30;
        int activeCountsGPU[4];
        for(int t = 0; t < loopCount+10; t++){ 
            StartCounter_ms(timerTwo);
#pragma omp parallel for
            for(int h = 0; h < count; h++)
                GPUsActiveList = hash_GPUsActiveList(locations_Host, x_range, z_range);
            for(int i=0; i<4; i++) activeCountsGPU = 0;
            histoActive(GPUsActiveList, count, activeCountsGPU);
            timerActive += GetCounter_ms(timerTwo);

            if(t == 9){
                timerActive = 0.0;
            }
        }
        printf("Average for GPUsActiveList:\t%5.2f ms\n", (float)timerActive/loopCount);
        //free(locations_Host);
        cudaFree(locations_Host);
        ippsFree(GPUsActiveList);
        cudaDeviceReset();
        _getch();
        return 0;
    }

Using Malloc:

LOOP BEGIN at C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(110,4)
   remark #15388: vectorization support: reference GPUsActiveList has aligned access   [ C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(111,5) ]
   remark #15328: vectorization support: gather was emulated for the variable locations_Host:  strided by 3   [ C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(111,5) ]
   remark #15328: vectorization support: gather was emulated for the variable locations_Host:  strided by 3   [ C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(111,5) ]
   remark #15305: vectorization support: vector length 16
   remark #15309: vectorization support: normalized vectorization overhead 0.090
   remark #15300: LOOP WAS VECTORIZED
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15460: masked strided loads: 2 
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 36 
   remark #15477: vector loop cost: 25.810 
   remark #15478: estimated potential speedup: 1.390 
   remark #15488: --- end vector loop cost summary ---
LOOP END

Using cudaMallocHost:

LOOP BEGIN at C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(110,4)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed ANTI dependence between locations_Host line 111 and GPUsActiveList line 111
   remark #15346: vector dependence: assumed FLOW dependence between GPUsActiveList line 111 and locations_Host line 111
LOOP END

I tried using commands such as #pragma ivdep and #pragma vector {aligned | always} but I was unsuccessful. 

Thanks, 

Greg

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
1,498 Views

What are your compiler options? And what is (are) the Host system(s)?

Vectors of 4 (floats) are used with SSE series instructions 8 (floats) with AVX256 and 16 floats AVX512.

Jim Dempsey

View solution in original post

0 Kudos
13 Replies
TimP
Honored Contributor III
1,498 Views

The compiler doesn't know the alignment of cudamalloc.  If you believe it is aligned, you could use __assume_aligned to assert alignment of your array.  Granted, if that works, it makes the dependency message look misleading.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,498 Views

Not pertinent to your question...

Why are you not using:

Ipp8u hash_GPUsActiveList(float3 seg, float x_half_range, float z_half_range){
    if(seg.x < x_half_range){
        if(seg.z < z_half_range){
            return 0; // Top Left
        }else{
            return 3; // Bottom Left
        }
    }else{
        if(seg.z < z_half_range){
            return 1; // Top Right
        }else{
            return 2; // Bottom Right
        }
    }
}

...
float x_half_range = x_range * 0.5f;
float z_half_range = z_range * 0.5f;
#pragma omp parallel for
            for(int h = 0; h < count; h++)
                GPUsActiveList = hash_GPUsActiveList(locations_Host, x_half_range, z_half_range);
...

Jim Dempsey

0 Kudos
Anoop_M_Intel
Employee
1,498 Views

A quick experiment shows that the following. The behavior is same even if you use the unpinned equivalent of cudaMallocHost which is cudaMalloc. 

1. using malloc:

LOOP BEGIN at C:\Users\amadhuso\Downloads\testcuda.cpp(111,13)
   remark #15388: vectorization support: reference GPUsActiveList has aligned access   [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,17) ]
   remark #15328: vectorization support: gather was emulated for the variable locations_Host:  strided by 3   [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,57) ]
   remark #15328: vectorization support: gather was emulated for the variable locations_Host:  strided by 3   [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,57) ]
   remark #15305: vectorization support: vector length 16
   remark #15309: vectorization support: normalized vectorization overhead 0.090
   remark #15300: LOOP WAS VECTORIZED
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15460: masked strided loads: 2 
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 36 
   remark #15477: vector loop cost: 25.810 
   remark #15478: estimated potential speedup: 1.390 
   remark #15488: --- end vector loop cost summary ---
LOOP END

2. using cudaMalloc:

LOOP BEGIN at C:\Users\amadhuso\Downloads\testcuda.cpp(111,13)
   remark #15389: vectorization support: reference GPUsActiveList has unaligned access   [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,17) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or /Qvec-threshold0 to override
   remark #15328: vectorization support: gather was emulated for the variable locations_Host:  strided by 3   [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,57) ]
   remark #15328: vectorization support: gather was emulated for the variable locations_Host:  strided by 3   [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,57) ]
   remark #15305: vectorization support: vector length 4
   remark #15309: vectorization support: normalized vectorization overhead 0.115
   remark #15451: unmasked unaligned unit stride stores: 1 
   remark #15460: masked strided loads: 2 
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 39 
   remark #15477: vector loop cost: 43.500 
   remark #15478: estimated potential speedup: 0.890 
   remark #15487: type converts: 1 
   remark #15488: --- end vector loop cost summary ---
LOOP END

3. using cudaMallocHost:

LOOP BEGIN at C:\Users\amadhuso\Downloads\testcuda.cpp(111,13)
   remark #15389: vectorization support: reference GPUsActiveList has unaligned access   [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,17) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or /Qvec-threshold0 to override
   remark #15328: vectorization support: gather was emulated for the variable locations_Host:  strided by 3   [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,57) ]
   remark #15328: vectorization support: gather was emulated for the variable locations_Host:  strided by 3   [ C:\Users\amadhuso\Downloads\testcuda.cpp(112,57) ]
   remark #15305: vectorization support: vector length 4
   remark #15309: vectorization support: normalized vectorization overhead 0.115
   remark #15451: unmasked unaligned unit stride stores: 1 
   remark #15460: masked strided loads: 2 
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 39 
   remark #15477: vector loop cost: 43.500 
   remark #15478: estimated potential speedup: 0.890 
   remark #15487: type converts: 1 
   remark #15488: --- end vector loop cost summary ---
LOOP END

Thanks and Regards
Anoop

0 Kudos
Greg_G_
Beginner
1,498 Views

Thanks for the responses,

Tim: I tried adding __assume_aligned but my project did not recognize that command. I'm using windows if that makes a difference. However alignment doesn't seem to be the issue currently. 

Jim: Thanks, that helped a bit too. I was more focused on other issues so I didn't see that. 

Anoop: I'm unsure how you tested vectorization with cudaMalloc, cudaMalloc'ed memory isn't accessible by the host.

0 Kudos
TimP
Honored Contributor III
1,498 Views
I wouldn't try assume aligned without looking it up. Not easy on cell
0 Kudos
Greg_G_
Beginner
1,498 Views

Looking at the statements in Anoop's comment the vector report only mentions GPUsActiveList being aligned or unaligned. But that data is never changed in the code variations being tested. 

It is always Ipp8u * GPUsActiveList = ippsMalloc_8u(sizeof(Ipp8u)*count);

So does the data in locations_Host affect the alignment of GPUsActiveList?

0 Kudos
Greg_G_
Beginner
1,498 Views

I was able to force the loop to vectorize however it did so in a slightly different way that normally occurs with malloc'ed memory. 

I used: 
 

#pragma omp parallel for
#pragma ivdep
#pragma vector aligned
			for(int h = 0; h < count; h++){
				GPUsActiveList = hash_GPUsActiveList(locations_Host, x_half_range, z_half_range);
			}

Then I also specified the command "/Qvec-threshold0" 

However this seems not ideal to specify the threshold above for an entire project. 

LOOP BEGIN at C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(117,4)
   remark #15388: vectorization support: reference GPUsActiveList has aligned access   [ C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(118,5) ]
   remark #15328: vectorization support: gather was emulated for the variable (unknown):  strided by 3   [ C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(118,5) ]
   remark #15328: vectorization support: gather was emulated for the variable (unknown):  strided by 3   [ C:\Users\one\Documents\Visual Studio 2012\Projects\CPUManageWithCuda_stackOverflow\CPUManageWithCuda\CPUManageFile.cpp(118,5) ]
   remark #15305: vectorization support: vector length 4
   remark #15309: vectorization support: normalized vectorization overhead 0.099
   remark #15300: LOOP WAS VECTORIZED
   remark #15451: unmasked unaligned unit stride stores: 1 
   remark #15460: masked strided loads: 2 
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 38 
   remark #15477: vector loop cost: 42.750 
   remark #15478: estimated potential speedup: 0.880 
   remark #15487: type converts: 1 
   remark #15488: --- end vector loop cost summary ---
LOOP END

When forcing vectorization it has a vector length of 4 instead of 16 and estimates there being a performance loss. 
Being forced the performance is a bit less predictable (could be partly because I'm on a laptop too). The run time is consistently around 6 ms though like the malloc'ed verson on my computer. Instead of about 9.5ms for non-vectorized. 

Is there a safe way to force only one loop to vectorize by setting its threshold in a #paragma

Still reading though all the documentation on vectorization, but I thought I would ask here. Is there any way to forcibly control "vectorization support: vector length"?

0 Kudos
TimP
Honored Contributor III
1,498 Views

The choice of shorter vector length is likely on account of allowing for unaligned data. I've got to stop replying if you're not interested. My metered data plan is on me.

Your interest in additional pragmas ought to be an incentive to open up the docs. For Intel specific pragma you have the HTML docs index installed with the compiler.  For openmp-simd you may try to understand the official docs at openmp.org.

0 Kudos
Greg_G_
Beginner
1,498 Views

I don't understand your statement "I've got to stop replying if you're not interested. My metered data plan is on me."  I've never said I'm not interested I only said I wasn't able to test the one thing you said. I have still been looking on how to get my project to recognize the command __assume_aligned.

I thought the use of the forms was to ask others for help or to be pointed in the right direction?  
Do Intel's forms work differently? 

I will try to solve it on my own, thanks. 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,499 Views

What are your compiler options? And what is (are) the Host system(s)?

Vectors of 4 (floats) are used with SSE series instructions 8 (floats) with AVX256 and 16 floats AVX512.

Jim Dempsey

0 Kudos
Greg_G_
Beginner
1,498 Views

Thank you Jim that comment helped a lot. 

The two processors that I'm testing with are:
Intel® Core™ i7-5700HQ Processor : Instruction Set Extensions    SSE4.1/4.2 AVX 2.0

Intel® Core™ i7-5960X Processor Extreme Edition Instruction Set Extensions    SSE4.2, AVX 2.0, AES

I went into the VisualStudio properties and saw the instruction set was left to default. I switched to AVX and the vector lengths changed from:

With malloc VS default instructions: vector length 4 -> AVX: vector length 16

With cudaMallocHost VS default instructions: vector length 2 -> AVX: vector length 16

My code is now different from what I posted initially, but same basic logic. When using cudaMallocHost, specifying AVX boosted performace 4x. 

-----Other interesting things I noticed-----

The options/pragma below with AVX cause the vector length to be 16 (11% performance gain over the next options)

/Qvec-threshold0
#pragma omp parallel for
#pragma ivdep
#pragma vector aligned

The pragma below with AVX cause the vector length to be 8

#pragma omp parallel for simd

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,498 Views

The #pragma omp parallel for simd is likely partitioning the loop space based on a SIMD constructed of floats (the input vector type/width).

Also, "simulated gather" are mostly scalar instructions to pack into SIMD register for subsequent SIMD operation. IOW, while the report may indicate code is vectorized, the code is not entirely vectorized. Also, when you eventually compile for a machine supporting gather, a gather load/store with stride other than 1, is slower than a contiguous load (speed varies with stride).

Jim Dempsey

0 Kudos
Greg_G_1
Beginner
1,498 Views

Thank you for the detailed explanation. 

When learning HPC I went straight into GPGPU, aside from one project with Intel MIC. So its very interesting to learn the ways to control the CPU on a lower level like a GPU. 

0 Kudos
Reply