Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
661 Discussions

DevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one

GNL
Beginner
1,532 Views

Hello,

 

I tried the BaseKit-code-samples/DPC++Compiler/vector-add example as suggested in the DeveloperZone get started page:

 

https://devcloud.intel.com/oneapi/get-started/base-toolkit/

 

I measured the execution time of add_arrays_scalar() and add_arrays_parallel() functions with my RTM_START() and RTM_STOP() macros.

 

double getCPUFreq() { #define BUFLEN 110   FILE* sysinfo; char* ptr; char buf[BUFLEN]; char key[] = "cpu MHz"; int keylen = sizeof(key) - 1; double freq = -1;   sysinfo = fopen("/proc/cpuinfo", "r"); if (sysinfo != NULL) { while (fgets(buf, BUFLEN, sysinfo) != NULL) { if (!strncmp(buf, key, keylen)) { ptr = strstr(buf, ":"); freq = atof(ptr + 1) * 1000000; break; } } fclose(sysinfo); } fprintf(stderr, "Freq = %f GHz\n", freq / 1000000000); return freq; }   #define RTM_START() start = (double)_rdtsc()   #define RTM_STOP() stop = (double)_rdtsc(); \ secs = ((double)(stop - start)) / (double)getCPUFreq();

and my main function, where I took the measurements:

 

int main() { double start = 0.0, stop = 0.0, secs = 0.0;   IntArray addend_1, addend_2, sum_scalar, sum_parallel;   // Initialize arrays with values from 0 to array_size-1 initialize_array(addend_1); initialize_array(addend_2); initialize_array(sum_scalar); initialize_array(sum_parallel);   printf("CPU Freq = %10.2lf\n", (double)getCPUFreq());   start = (double)_rdtsc(); // Add arrays in scalar and in parallel add_arrays_scalar(sum_scalar, addend_1, addend_2); stop = (double)_rdtsc(); secs = ((double)(stop - start)) / (double)getCPUFreq(); printf("Scalar execution time = %2.6lf seconds\n", secs);   add_arrays_parallel(sum_parallel, addend_1, addend_2);   // Verify that the two sum arrays are equal for (size_t i = 0; i < sum_parallel.size(); i++) { if (sum_parallel[i] != sum_scalar[i]) { std::cout << "fail" << std::endl; return -1; } } std::cout << "success" << std::endl;   std::cout << "MGUNAL TEST 5 DONE" << std::endl; return 0; }

And finally my results:

 

######################################################################## # Date: Wed Mar 11 06:14:01 PDT 2020 # Job ID: 542630.v-qsvr-1.aidevcloud # User: u38134 # Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00 ########################################################################   :: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force ./vector-add CPU Freq = 954182000.00 Scalar execution time = 0.000085 seconds Device: Intel(R) Gen9 HD Graphics NEO Parallel execution time = 3.516770 seconds success MGUNAL TEST 5 DONE   ######################################################################## # End of output for job 542630.v-qsvr-1.aidevcloud # Date: Wed Mar 11 06:14:03 PDT 2020 ########################################################################

As seen above, parallel execution time is much more greater than the scaler execution time. So, I decided to use a SYCL host device instead of an accelerator.

 

/* // FPGA device selector: Emulator or Hardware #ifdef FPGA_EMULATOR intel::fpga_emulator_selector device_selector; #elif defined(FPGA) intel::fpga_selector device_selector; #else // Initializing the devices queue with the default selector // The device queue is used to enqueue the kernels and encapsulates // all the states needed for execution default_selector device_selector; #endif */   host_selector device_selector;

Parallel execution is still significantly slower than the scaler one..

 

######################################################################## # Date: Thu Mar 12 00:34:12 PDT 2020 # Job ID: 542971.v-qsvr-1.aidevcloud # User: u38134 # Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00 ########################################################################   :: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force ./vector-add CPU Freq = 1000065000.00 Scalar execution time = 0.000077 seconds Device: SYCL host device Parallel execution time = 0.318517 seconds success   ######################################################################## # End of output for job 542971.v-qsvr-1.aidevcloud # Date: Thu Mar 12 00:34:14 PDT 2020 ########################################################################

I have no idea why the parallel execution takes so long? Am I doing something wrong?

0 Kudos
8 Replies
GNL
Beginner
1,289 Views

Instead of measuring the add_arrays_parallel() function as a whole, I separated it to chunks as shown below:

 

void add_arrays_parallel(IntArray &sum, const IntArray &addend_1, const IntArray &addend_2) { double start = 0.0, stop = 0.0, secs = 0.0;   RTM_START(); std::unique_ptr<queue> q = initialize_device_queue(); RTM_STOP(); printf("Init device queue = %2.6lf seconds\n", secs);   RTM_START(); // The range of the arrays managed by the buffer range<1> num_items{ array_size };   // Buffers are used to tell DPC++ which data will be shared between the host // and the devices because they usually don't share physical memory // The pointer that's being passed as the first parameter transfers ownership // of the data to DPC++ at runtime. The destructor is called when the buffer // goes out of scope and the data is given back to the std::arrays. // The second parameter specifies the range given to the buffer. buffer<cl_int, 1> addend_1_buf(addend_1.data(), num_items); buffer<cl_int, 1> addend_2_buf(addend_2.data(), num_items); buffer<cl_int, 1> sum_buf(sum.data(), num_items); RTM_STOP(); printf("Buffer creation = %2.6lf seconds\n", secs);   RTM_START(); // queue::submit takes in a lambda that is passed in a command group handler // constructed at runtime. The lambda also contains a command group, which // contains the device-side operation and its dependencies q->submit([&](handler &h) { // Accessors are the only way to get access to the memory owned // by the buffers initialized above. The first get_access template parameter // specifies the access mode for the memory and the second template // parameter is the type of memory to access the data from; this parameter // has a default value auto addend_1_accessor = addend_1_buf.template get_access<dp_read>(h); auto addend_2_accessor = addend_2_buf.template get_access<dp_read>(h);   // Note: Can use access::mode::discard_write instead of access::mode::write // because we're replacing the contents of the entire buffer. auto sum_accessor = sum_buf.template get_access<dp_write>(h);   // Use parallel_for to run array addition in parallel. This executes the // kernel. The first parameter is the number of work items to use and the // second is the kernel, a lambda that specifies what to do per work item. // The template parameter ArrayAdd is used to name the kernel at runtime. // The parameter passed to the lambda is the work item id of the current // item. // // To remove the requirement to specify the kernel name you can enable // unnamed lamdba kernels with the option: // dpcpp -fsycl-unnamed-lambda h.parallel_for<class ArrayAdd>(num_items, [=](id<1> i) { sum_accessor[i] = addend_1_accessor[i] + addend_2_accessor[i]; }); }); RTM_STOP(); printf("Queue submission + Accessors + parallel execution time = %2.6lf seconds\n", secs);   RTM_START(); // call wait_and_throw to catch async exception q->wait_and_throw(); RTM_STOP(); printf("Q wait_and_throw() execution time = %2.6lf seconds\n", secs);   // DPC++ will enqueue and run the kernel. Recall that the buffer's data is // given back to the host at the end of the method's scope. }

When I build & run this for a SYCL host device, my results are:

 

######################################################################## # Date: Thu Mar 12 03:55:56 PDT 2020 # Job ID: 543019.v-qsvr-1.aidevcloud # User: u38134 # Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00 ########################################################################   :: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force ./vector-add CPU Freq = 939189000.00 Scalar execution time = 0.000086 seconds Init device queue = 0.253235 seconds Buffer creation = 0.000001 seconds Queue submission + Accessors + parallel execution time = 0.000085 seconds Q wait_and_throw() execution time = 0.000002 seconds success   ######################################################################## # End of output for job 543019.v-qsvr-1.aidevcloud # Date: Thu Mar 12 03:55:57 PDT 2020 ########################################################################

Then I found out that the initialize_device_queue() function is the slowest one. And the Queue submission + Accessors + parallel execution time is almost the same with the Scalar execution time...

 

Then I build & run this for a GPU device, and my results are:

 

######################################################################## # Date: Thu Mar 12 04:00:45 PDT 2020 # Job ID: 543025.v-qsvr-1.aidevcloud # User: u38134 # Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00 ########################################################################   :: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force ./vector-add CPU Freq = 1000062000.00 Scalar execution time = 0.000077 seconds Device: Intel(R) Gen9 HD Graphics NEO Init device queue = 0.422673 seconds Buffer creation = 0.000004 seconds Queue submission + Accessors + parallel execution time = 1.751712 seconds Q wait_and_throw() execution time = 0.001086 seconds success   ######################################################################## # End of output for job 543025.v-qsvr-1.aidevcloud # Date: Thu Mar 12 04:00:47 PDT 2020 ########################################################################

and the Queue submission + Accessors + parallel execution time is 1.75 seconds

 

What takes so long?

 

 

0 Kudos
AnilErinch_A_Intel
1,289 Views

Hi,

There is some overhead when using an accelerator, so you have make the array size much larger to see the advantage of parallel operation. 

In the file vector_add.cpp: Change the following, rebuild and rerun the example.

static const size_t ARRAY_SIZE = 10000;

to

static const size_t ARRAY_SIZE = 10000000;

and let us know the results

Regards

Anil

0 Kudos
GNL
Beginner
1,289 Views

Hi,

 

When I try to increase the array size as suggested, I get a seg-fault on the cloud and a stack overflow exception, on my computer using intel oneapi base toolkit (Windows) (FPGA Emulator)

// Problem size for this example //constexpr size_t array_size = 10000; constexpr size_t array_size = 100000000;

I could not find a way to increase the stack size and then decided to use a vector instead of an array.

// Define the ARRAY type for use in this example // typedef std::array<cl::sycl::cl_int, array_size> IntArray; typedef std::vector<cl::sycl::cl_int> IntArray;

and, create vectors using the v(size, val) constructor.

IntArray addend_1(array_size, 0), addend_2(array_size, 0), sum_scalar(array_size, 0), sum_parallel(array_size, 0);

I tried for the same size, and checked my result vector is filled as expected. Then, I increased the array_size to 100M, and this is the result:

######################################################################## # Date: Mon Mar 30 06:10:14 PDT 2020 # Job ID: 560373.v-qsvr-1.aidevcloud # User: u38134 # Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00 ########################################################################   :: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force ./vector-add Vector Size = 100000000 CPU Freq (GHz)= 4.000 Scalar execution time = 0.063809 seconds Device: Intel(R) Gen9 HD Graphics NEO Parallel execution time = 1.711407 seconds success   ######################################################################## # End of output for job 560373.v-qsvr-1.aidevcloud # Date: Mon Mar 30 06:10:20 PDT 2020 ########################################################################

The code blocks I measured here:

RTM_START(); // Add arrays in scalar and in parallel add_arrays_scalar(sum_scalar, addend_1, addend_2); RTM_STOP(); printf("Scalar execution time = %2.6lf seconds\n", secs); RTM_START(); add_arrays_parallel(sum_parallel, addend_1, addend_2); RTM_STOP(); printf("Parallel execution time = %2.6lf seconds\n", secs);

Finally, I measured it again partially and this is the result:

######################################################################## # Date: Mon Mar 30 06:45:23 PDT 2020 # Job ID: 560414.v-qsvr-1.aidevcloud # User: u38134 # Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00 ########################################################################   :: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force ./vector-add Vector Size = 100000000 CPU Freq (GHz)= 4.000 Scalar execution time = 0.063832 seconds Device: Intel(R) Gen9 HD Graphics NEO Init device queue = 1.273701 seconds Buffer creation = 0.000002 seconds Queue submission + Accessors + parallel execution time = 0.504512 seconds Q wait_and_throw() execution time = 0.063734 seconds Parallel execution time = 3.893053 seconds success   ######################################################################## # End of output for job 560414.v-qsvr-1.aidevcloud # Date: Mon Mar 30 06:45:29 PDT 2020 ########################################################################

The blocks I measured above, inside the add_arrays_parallel() function:

RTM_START(); std::unique_ptr<queue> q = initialize_device_queue(); RTM_STOP(); printf("Init device queue = %2.6lf seconds\n", secs);   RTM_START(); // The range of the arrays managed by the buffer range<1> num_items{ array_size };   // Buffers are used to tell DPC++ which data will be shared between the host // and the devices because they usually don't share physical memory // The pointer that's being passed as the first parameter transfers ownership // of the data to DPC++ at runtime. The destructor is called when the buffer // goes out of scope and the data is given back to the std::arrays. // The second parameter specifies the range given to the buffer. buffer<cl_int, 1> addend_1_buf(addend_1.data(), num_items); buffer<cl_int, 1> addend_2_buf(addend_2.data(), num_items); buffer<cl_int, 1> sum_buf(sum.data(), num_items); RTM_STOP(); printf("Buffer creation = %2.6lf seconds\n", secs);   RTM_START(); // queue::submit takes in a lambda that is passed in a command group handler // constructed at runtime. The lambda also contains a command group, which // contains the device-side operation and its dependencies q->submit([&](handler &h) { // Accessors are the only way to get access to the memory owned // by the buffers initialized above. The first get_access template parameter // specifies the access mode for the memory and the second template // parameter is the type of memory to access the data from; this parameter // has a default value auto addend_1_accessor = addend_1_buf.template get_access<dp_read>(h); auto addend_2_accessor = addend_2_buf.template get_access<dp_read>(h);   // Note: Can use access::mode::discard_write instead of access::mode::write // because we're replacing the contents of the entire buffer. auto sum_accessor = sum_buf.template get_access<dp_write>(h);   // Use parallel_for to run array addition in parallel. This executes the // kernel. The first parameter is the number of work items to use and the // second is the kernel, a lambda that specifies what to do per work item. // The template parameter ArrayAdd is used to name the kernel at runtime. // The parameter passed to the lambda is the work item id of the current // item. // // To remove the requirement to specify the kernel name you can enable // unnamed lamdba kernels with the option: // dpcpp -fsycl-unnamed-lambda h.parallel_for<class ArrayAdd>(num_items, [=](id<1> i) { sum_accessor[i] = addend_1_accessor[i] + addend_2_accessor[i]; }); }); RTM_STOP(); printf("Queue submission + Accessors + parallel execution time = %2.6lf seconds\n", secs);   RTM_START(); // call wait_and_throw to catch async exception q->wait_and_throw(); RTM_STOP(); printf("Q wait_and_throw() execution time = %2.6lf seconds\n", secs);

There is still a big difference between Scalar execution time = 0.063832 seconds and Queue submission + Accessors + parallel execution time = 0.504512 seconds blocks.

 

Parallel exec seems 10 times slower than the scalar one. Did I miss something here?

 

 

0 Kudos
GNL
Beginner
1,289 Views

In addition to this,

 

I found this about the sycl profiling events,

 

https://codeplay.com/portal/08-27-19-optimizing-your-sycl-code-using-profiling

 

https://gist.github.com/GeorgeWeb/ff908516bfe57f107bc36822dbdfe145

 

void profile(event_list& eventList, const time_point_list& startTimeList) { if (startTimeList.size() != eventList.size()) { std::string errMsg = "The number of events do not match the number of starting time " "points."; throw std::runtime_error("Profiling Error:\n" + errMsg); }   T cgSubmissionTime = 0; T kernExecutionTime = 0; T realExecutionTime = 0; const auto eventCount = eventList.size(); for (size_t i = 0; i < eventCount; ++i) { auto curEvent = eventList.at(i); curEvent.wait(); auto curStartTime = startTimeList.at(i);   const auto end = wall_clock_t::now(); time_interval_t<T, std::milli> curRealExecutionTime = end - curStartTime; realExecutionTime += curRealExecutionTime.count();   const auto cgSubmissionTimePoint = curEvent.template get_profiling_info< cl::sycl::info::event_profiling::command_submit>(); const auto startKernExecutionTimePoint = curEvent.template get_profiling_info< cl::sycl::info::event_profiling::command_start>(); const auto endKernExecutionTimePoint = curEvent.template get_profiling_info< cl::sycl::info::event_profiling::command_end>();   cgSubmissionTime += to_milli(startKernExecutionTimePoint - cgSubmissionTimePoint); kernExecutionTime += to_milli(endKernExecutionTimePoint - startKernExecutionTimePoint); }   set_command_group_submission_time(cgSubmissionTime); set_kernel_execution_time(kernExecutionTime); set_real_execution_time(realExecutionTime); }

And applied the steps to my vector addition code:

void VectorAddInDPCPP_2(const IntArray &VA, const IntArray &VB, IntArray &VC, queue deviceQueue) {   // print out the device information used for the kernel code std::cout << "Device: " << deviceQueue.get_device().get_info<info::device::name>() << std::endl;   // set up profiling data containers using wall_clock_t = std::chrono::high_resolution_clock; using time_point_t = std::chrono::time_point<wall_clock_t>;   int profiling_iters = 1;   std::vector<cl::sycl::event> eventList(profiling_iters); std::vector<time_point_t> startTimeList(profiling_iters);   // create the range object for the arrays managed by the buffer range<1> num_items{ array_size };   buffer<int, 1> bufferA(VA.data(), num_items); buffer<int, 1> bufferB(VB.data(), num_items); buffer<int, 1> bufferC(VC.data(), num_items);   // Submit a kernel to the queue, returns a SYCL event for (size_t i = 0; i < profiling_iters; ++i) { startTimeList.at(i) = wall_clock_t::now(); eventList.at(i) = deviceQueue.submit([&](handler &cgh) { auto accessorA = bufferA.get_access<dp_read>(cgh); auto accessorB = bufferB.get_access<dp_read>(cgh); auto accessorC = bufferC.get_access<dp_write>(cgh);   cgh.parallel_for(num_items, [=](id<1> j) { accessorC[j] = accessorA[j] + accessorB[j]; }); }); }   // exec profile example_profiler<double> my_profiler(eventList, startTimeList); std::cout << "Kernel exec: " << my_profiler.get_kernel_execution_time() << " msec" << std::endl; std::cout << "Cmd Group submission: " << my_profiler.get_command_group_submission_time() << " msec" << std::endl; std::cout << "Real exec: " << my_profiler.get_real_execution_time() << " msec" << std::endl; }

Finally, this is the result for a 10K array, on the cloud:

######################################################################## # Date: Wed Apr 1 08:26:52 PDT 2020 # Job ID: 561887.v-qsvr-1.aidevcloud # User: u38134 # Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00 ########################################################################   :: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force ./vector-add Device: Intel(R) Gen9 HD Graphics NEO Kernel exec: 0.150332 msec Cmd Group submission: 2.66203 msec Real exec: 167.013 msec VectorAddInDPCPP exec: 232.501 msec Scalar exec: 0.095 msec success   ######################################################################## # End of output for job 561887.v-qsvr-1.aidevcloud # Date: Wed Apr 1 08:26:57 PDT 2020 ########################################################################

As you can see above, the queue submission + kernel execution time is much less than the real execution time. What is the reason for waiting that long and is there a way to improve/get rid of this waiting time?

 

Melih

0 Kudos
AnilErinch_A_Intel
1,289 Views

Hi

Hope things are going fine there.

 

For processing the loop efficiently in the FPGA in parallel fashion , loop unrolling can be applied. This ensures that different execution units are created to handle the summation. Without using this the parallel and scalar operation performance will not differ much , and parallel will also involve setup and related time.

So to extract performance out of the FPGA please perform a loop unrolling

      #pragma unroll UNROLL_FACTOR

before the for loop exection.

You can try with different UNROLL_FACTORS , as twos multiples like 2 , 4 ,8 ,16 etc

and then compare the performance with the scalar version.

Please find an example below

cgh.single_task<class covariance>(

[=]()

{

           /* Accessor related code HERE */

           #pragma unroll UNROLL_FACTOR

           For (int j=0;j<num_items, j++)

   {

         accessorC[j] = accessorA[j] + accessorB[j];

   }

}

Thanks and Regards

Anil

0 Kudos
GNL
Beginner
1,289 Views

Hi,

Everything is fine for now, I hope it's the same there..

 

I have some experience on DSP programming and I am familiar with some of the low level optimization methods.. At the last example I gave in the prev. post you can see that the CPU exec time is about 1600+ times faster. In this case, I think our problem is not that low level. I think the main problem here is the JIT Compiler overhead.

 

dpcpp_compile.PNG

 

 

This picture is from the book: https://www.apress.com/us/data-parallel-c-advanced-chapters-just-released/17382670

 

 

On the cloud, I will try Ahead-Of-Time compiling, as suggested in here: https://software.intel.com/en-us/oneapi-dpcpp-compiler-dev-guide-and-reference-ahead-of-time-compilation

 

I will notify you with the results. And maybe we need to move this subject into another topic.

 

GNL

 

 

0 Kudos
GNL
Beginner
1,289 Views

Hi again, it seems like ahead-of-time compiling drastically reduced our run-time performance for the device. On the other hand, it increased the host execution time..

 

This is the result for AOT compiling:

######################################################################## # Date: Tue Apr 14 05:18:06 PDT 2020 # Job ID: 573459.v-qsvr-1.aidevcloud # User: u38134 # Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00 ########################################################################   :: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force ./vector-add Vector Size: 100000 ------------------------------------------- Device: Intel(R) Gen9 HD Graphics NEO Kernel exec: 0.088999 msec Cmd Group submission: 3.9869 msec Real exec: 9.85082 msec VectorAddInDPCPP exec: 17.543 msec Scalar exec: 1.785 msec success   ######################################################################## # End of output for job 573459.v-qsvr-1.aidevcloud # Date: Tue Apr 14 05:18:12 PDT 2020 ########################################################################

And this is the result for JIT compiling:

######################################################################## # Date: Tue Apr 14 05:36:10 PDT 2020 # Job ID: 573469.v-qsvr-1.aidevcloud # User: u38134 # Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00 ########################################################################   :: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force ./vector-add Vector Size: 100000 ------------------------------------------- Device: Intel(R) Gen9 HD Graphics NEO Kernel exec: 0.149666 msec Cmd Group submission: 3.13661 msec Real exec: 175.897 msec VectorAddInDPCPP exec: 243.523 msec Scalar exec: 0.099 msec success   ######################################################################## # End of output for job 573469.v-qsvr-1.aidevcloud # Date: Tue Apr 14 05:36:15 PDT 2020 ########################################################################

 

And finally this my makefile for AOT:

CXX = dpcpp #CXXFLAGS = -O2 -g #LDFLAGS = -lOpenCL -lsycl EXE_NAME = vector-add SOURCES = src/vector-add.cpp   all: main   main: $(CXX) -fsycl-targets=spir64_gen-unknown-unknown-sycldevice -Xsycl-target-backend '-device skl' -o $(EXE_NAME) $(SOURCES)   run: ./$(EXE_NAME)   clean: rm -rf $(EXE_NAME)

 

Did I skip something here? We increased the device kernel exec performance but why the host performance (Scalar exec above) is suffering now?

 

GNL

0 Kudos
AnilErinch_A_Intel
1,289 Views

Hi

Hope you are staying safe.

Can you incorporate the loop unrolling as mentioned previously , along with the ahead of time compilation and check the results.

Thanks and Regards

Anil

0 Kudos
Reply