- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I tried the BaseKit-code-samples/DPC++Compiler/vector-add example as suggested in the DeveloperZone get started page:
https://devcloud.intel.com/oneapi/get-started/base-toolkit/
I measured the execution time of add_arrays_scalar() and add_arrays_parallel() functions with my RTM_START() and RTM_STOP() macros.
double getCPUFreq() {
#define BUFLEN 110
FILE* sysinfo;
char* ptr;
char buf[BUFLEN];
char key[] = "cpu MHz";
int keylen = sizeof(key) - 1;
double freq = -1;
sysinfo = fopen("/proc/cpuinfo", "r");
if (sysinfo != NULL) {
while (fgets(buf, BUFLEN, sysinfo) != NULL) {
if (!strncmp(buf, key, keylen)) {
ptr = strstr(buf, ":");
freq = atof(ptr + 1) * 1000000;
break;
}
}
fclose(sysinfo);
}
fprintf(stderr, "Freq = %f GHz\n", freq / 1000000000);
return freq;
}
#define RTM_START() start = (double)_rdtsc()
#define RTM_STOP() stop = (double)_rdtsc(); \
secs = ((double)(stop - start)) / (double)getCPUFreq();
and my main function, where I took the measurements:
int main() {
double start = 0.0, stop = 0.0, secs = 0.0;
IntArray addend_1, addend_2, sum_scalar, sum_parallel;
// Initialize arrays with values from 0 to array_size-1
initialize_array(addend_1);
initialize_array(addend_2);
initialize_array(sum_scalar);
initialize_array(sum_parallel);
printf("CPU Freq = %10.2lf\n", (double)getCPUFreq());
start = (double)_rdtsc();
// Add arrays in scalar and in parallel
add_arrays_scalar(sum_scalar, addend_1, addend_2);
stop = (double)_rdtsc();
secs = ((double)(stop - start)) / (double)getCPUFreq();
printf("Scalar execution time = %2.6lf seconds\n", secs);
add_arrays_parallel(sum_parallel, addend_1, addend_2);
// Verify that the two sum arrays are equal
for (size_t i = 0; i < sum_parallel.size(); i++) {
if (sum_parallel[i] != sum_scalar[i]) {
std::cout << "fail" << std::endl;
return -1;
}
}
std::cout << "success" << std::endl;
std::cout << "MGUNAL TEST 5 DONE" << std::endl;
return 0;
}
And finally my results:
########################################################################
# Date: Wed Mar 11 06:14:01 PDT 2020
# Job ID: 542630.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
CPU Freq = 954182000.00
Scalar execution time = 0.000085 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Parallel execution time = 3.516770 seconds
success
MGUNAL TEST 5 DONE
########################################################################
# End of output for job 542630.v-qsvr-1.aidevcloud
# Date: Wed Mar 11 06:14:03 PDT 2020
########################################################################
As seen above, parallel execution time is much more greater than the scaler execution time. So, I decided to use a SYCL host device instead of an accelerator.
/*
// FPGA device selector: Emulator or Hardware
#ifdef FPGA_EMULATOR
intel::fpga_emulator_selector device_selector;
#elif defined(FPGA)
intel::fpga_selector device_selector;
#else
// Initializing the devices queue with the default selector
// The device queue is used to enqueue the kernels and encapsulates
// all the states needed for execution
default_selector device_selector;
#endif
*/
host_selector device_selector;
Parallel execution is still significantly slower than the scaler one..
########################################################################
# Date: Thu Mar 12 00:34:12 PDT 2020
# Job ID: 542971.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
CPU Freq = 1000065000.00
Scalar execution time = 0.000077 seconds
Device: SYCL host device
Parallel execution time = 0.318517 seconds
success
########################################################################
# End of output for job 542971.v-qsvr-1.aidevcloud
# Date: Thu Mar 12 00:34:14 PDT 2020
########################################################################
I have no idea why the parallel execution takes so long? Am I doing something wrong?
- Tags:
- Pragma
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Instead of measuring the add_arrays_parallel() function as a whole, I separated it to chunks as shown below:
void add_arrays_parallel(IntArray &sum, const IntArray &addend_1, const IntArray &addend_2) {
double start = 0.0, stop = 0.0, secs = 0.0;
RTM_START();
std::unique_ptr<queue> q = initialize_device_queue();
RTM_STOP();
printf("Init device queue = %2.6lf seconds\n", secs);
RTM_START();
// The range of the arrays managed by the buffer
range<1> num_items{ array_size };
// Buffers are used to tell DPC++ which data will be shared between the host
// and the devices because they usually don't share physical memory
// The pointer that's being passed as the first parameter transfers ownership
// of the data to DPC++ at runtime. The destructor is called when the buffer
// goes out of scope and the data is given back to the std::arrays.
// The second parameter specifies the range given to the buffer.
buffer<cl_int, 1> addend_1_buf(addend_1.data(), num_items);
buffer<cl_int, 1> addend_2_buf(addend_2.data(), num_items);
buffer<cl_int, 1> sum_buf(sum.data(), num_items);
RTM_STOP();
printf("Buffer creation = %2.6lf seconds\n", secs);
RTM_START();
// queue::submit takes in a lambda that is passed in a command group handler
// constructed at runtime. The lambda also contains a command group, which
// contains the device-side operation and its dependencies
q->submit([&](handler &h) {
// Accessors are the only way to get access to the memory owned
// by the buffers initialized above. The first get_access template parameter
// specifies the access mode for the memory and the second template
// parameter is the type of memory to access the data from; this parameter
// has a default value
auto addend_1_accessor = addend_1_buf.template get_access<dp_read>(h);
auto addend_2_accessor = addend_2_buf.template get_access<dp_read>(h);
// Note: Can use access::mode::discard_write instead of access::mode::write
// because we're replacing the contents of the entire buffer.
auto sum_accessor = sum_buf.template get_access<dp_write>(h);
// Use parallel_for to run array addition in parallel. This executes the
// kernel. The first parameter is the number of work items to use and the
// second is the kernel, a lambda that specifies what to do per work item.
// The template parameter ArrayAdd is used to name the kernel at runtime.
// The parameter passed to the lambda is the work item id of the current
// item.
//
// To remove the requirement to specify the kernel name you can enable
// unnamed lamdba kernels with the option:
// dpcpp -fsycl-unnamed-lambda
h.parallel_for<class ArrayAdd>(num_items, [=](id<1> i) {
sum_accessor[i] = addend_1_accessor[i] + addend_2_accessor[i];
});
});
RTM_STOP();
printf("Queue submission + Accessors + parallel execution time = %2.6lf seconds\n", secs);
RTM_START();
// call wait_and_throw to catch async exception
q->wait_and_throw();
RTM_STOP();
printf("Q wait_and_throw() execution time = %2.6lf seconds\n", secs);
// DPC++ will enqueue and run the kernel. Recall that the buffer's data is
// given back to the host at the end of the method's scope.
}
When I build & run this for a SYCL host device, my results are:
########################################################################
# Date: Thu Mar 12 03:55:56 PDT 2020
# Job ID: 543019.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
CPU Freq = 939189000.00
Scalar execution time = 0.000086 seconds
Init device queue = 0.253235 seconds
Buffer creation = 0.000001 seconds
Queue submission + Accessors + parallel execution time = 0.000085 seconds
Q wait_and_throw() execution time = 0.000002 seconds
success
########################################################################
# End of output for job 543019.v-qsvr-1.aidevcloud
# Date: Thu Mar 12 03:55:57 PDT 2020
########################################################################
Then I found out that the initialize_device_queue() function is the slowest one. And the Queue submission + Accessors + parallel execution time is almost the same with the Scalar execution time...
Then I build & run this for a GPU device, and my results are:
########################################################################
# Date: Thu Mar 12 04:00:45 PDT 2020
# Job ID: 543025.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
CPU Freq = 1000062000.00
Scalar execution time = 0.000077 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Init device queue = 0.422673 seconds
Buffer creation = 0.000004 seconds
Queue submission + Accessors + parallel execution time = 1.751712 seconds
Q wait_and_throw() execution time = 0.001086 seconds
success
########################################################################
# End of output for job 543025.v-qsvr-1.aidevcloud
# Date: Thu Mar 12 04:00:47 PDT 2020
########################################################################
and the Queue submission + Accessors + parallel execution time is 1.75 seconds
What takes so long?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
There is some overhead when using an accelerator, so you have make the array size much larger to see the advantage of parallel operation.
In the file vector_add.cpp: Change the following, rebuild and rerun the example.
static const size_t ARRAY_SIZE = 10000;
to
static const size_t ARRAY_SIZE = 10000000;
and let us know the results
Regards
Anil
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
When I try to increase the array size as suggested, I get a seg-fault on the cloud and a stack overflow exception, on my computer using intel oneapi base toolkit (Windows) (FPGA Emulator)
// Problem size for this example
//constexpr size_t array_size = 10000;
constexpr size_t array_size = 100000000;
I could not find a way to increase the stack size and then decided to use a vector instead of an array.
// Define the ARRAY type for use in this example
// typedef std::array<cl::sycl::cl_int, array_size> IntArray;
typedef std::vector<cl::sycl::cl_int> IntArray;
and, create vectors using the v(size, val) constructor.
IntArray addend_1(array_size, 0), addend_2(array_size, 0), sum_scalar(array_size, 0), sum_parallel(array_size, 0);
I tried for the same size, and checked my result vector is filled as expected. Then, I increased the array_size to 100M, and this is the result:
########################################################################
# Date: Mon Mar 30 06:10:14 PDT 2020
# Job ID: 560373.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
Vector Size = 100000000
CPU Freq (GHz)= 4.000
Scalar execution time = 0.063809 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Parallel execution time = 1.711407 seconds
success
########################################################################
# End of output for job 560373.v-qsvr-1.aidevcloud
# Date: Mon Mar 30 06:10:20 PDT 2020
########################################################################
The code blocks I measured here:
RTM_START();
// Add arrays in scalar and in parallel
add_arrays_scalar(sum_scalar, addend_1, addend_2);
RTM_STOP();
printf("Scalar execution time = %2.6lf seconds\n", secs);
RTM_START();
add_arrays_parallel(sum_parallel, addend_1, addend_2);
RTM_STOP();
printf("Parallel execution time = %2.6lf seconds\n", secs);
Finally, I measured it again partially and this is the result:
########################################################################
# Date: Mon Mar 30 06:45:23 PDT 2020
# Job ID: 560414.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
Vector Size = 100000000
CPU Freq (GHz)= 4.000
Scalar execution time = 0.063832 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Init device queue = 1.273701 seconds
Buffer creation = 0.000002 seconds
Queue submission + Accessors + parallel execution time = 0.504512 seconds
Q wait_and_throw() execution time = 0.063734 seconds
Parallel execution time = 3.893053 seconds
success
########################################################################
# End of output for job 560414.v-qsvr-1.aidevcloud
# Date: Mon Mar 30 06:45:29 PDT 2020
########################################################################
The blocks I measured above, inside the add_arrays_parallel() function:
RTM_START();
std::unique_ptr<queue> q = initialize_device_queue();
RTM_STOP();
printf("Init device queue = %2.6lf seconds\n", secs);
RTM_START();
// The range of the arrays managed by the buffer
range<1> num_items{ array_size };
// Buffers are used to tell DPC++ which data will be shared between the host
// and the devices because they usually don't share physical memory
// The pointer that's being passed as the first parameter transfers ownership
// of the data to DPC++ at runtime. The destructor is called when the buffer
// goes out of scope and the data is given back to the std::arrays.
// The second parameter specifies the range given to the buffer.
buffer<cl_int, 1> addend_1_buf(addend_1.data(), num_items);
buffer<cl_int, 1> addend_2_buf(addend_2.data(), num_items);
buffer<cl_int, 1> sum_buf(sum.data(), num_items);
RTM_STOP();
printf("Buffer creation = %2.6lf seconds\n", secs);
RTM_START();
// queue::submit takes in a lambda that is passed in a command group handler
// constructed at runtime. The lambda also contains a command group, which
// contains the device-side operation and its dependencies
q->submit([&](handler &h) {
// Accessors are the only way to get access to the memory owned
// by the buffers initialized above. The first get_access template parameter
// specifies the access mode for the memory and the second template
// parameter is the type of memory to access the data from; this parameter
// has a default value
auto addend_1_accessor = addend_1_buf.template get_access<dp_read>(h);
auto addend_2_accessor = addend_2_buf.template get_access<dp_read>(h);
// Note: Can use access::mode::discard_write instead of access::mode::write
// because we're replacing the contents of the entire buffer.
auto sum_accessor = sum_buf.template get_access<dp_write>(h);
// Use parallel_for to run array addition in parallel. This executes the
// kernel. The first parameter is the number of work items to use and the
// second is the kernel, a lambda that specifies what to do per work item.
// The template parameter ArrayAdd is used to name the kernel at runtime.
// The parameter passed to the lambda is the work item id of the current
// item.
//
// To remove the requirement to specify the kernel name you can enable
// unnamed lamdba kernels with the option:
// dpcpp -fsycl-unnamed-lambda
h.parallel_for<class ArrayAdd>(num_items, [=](id<1> i) {
sum_accessor[i] = addend_1_accessor[i] + addend_2_accessor[i];
});
});
RTM_STOP();
printf("Queue submission + Accessors + parallel execution time = %2.6lf seconds\n", secs);
RTM_START();
// call wait_and_throw to catch async exception
q->wait_and_throw();
RTM_STOP();
printf("Q wait_and_throw() execution time = %2.6lf seconds\n", secs);
There is still a big difference between Scalar execution time = 0.063832 seconds and Queue submission + Accessors + parallel execution time = 0.504512 seconds blocks.
Parallel exec seems 10 times slower than the scalar one. Did I miss something here?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In addition to this,
I found this about the sycl profiling events,
https://codeplay.com/portal/08-27-19-optimizing-your-sycl-code-using-profiling
https://gist.github.com/GeorgeWeb/ff908516bfe57f107bc36822dbdfe145
void profile(event_list& eventList, const time_point_list& startTimeList) {
if (startTimeList.size() != eventList.size()) {
std::string errMsg =
"The number of events do not match the number of starting time "
"points.";
throw std::runtime_error("Profiling Error:\n" + errMsg);
}
T cgSubmissionTime = 0;
T kernExecutionTime = 0;
T realExecutionTime = 0;
const auto eventCount = eventList.size();
for (size_t i = 0; i < eventCount; ++i) {
auto curEvent = eventList.at(i);
curEvent.wait();
auto curStartTime = startTimeList.at(i);
const auto end = wall_clock_t::now();
time_interval_t<T, std::milli> curRealExecutionTime = end - curStartTime;
realExecutionTime += curRealExecutionTime.count();
const auto cgSubmissionTimePoint = curEvent.template get_profiling_info<
cl::sycl::info::event_profiling::command_submit>();
const auto startKernExecutionTimePoint =
curEvent.template get_profiling_info<
cl::sycl::info::event_profiling::command_start>();
const auto endKernExecutionTimePoint =
curEvent.template get_profiling_info<
cl::sycl::info::event_profiling::command_end>();
cgSubmissionTime +=
to_milli(startKernExecutionTimePoint - cgSubmissionTimePoint);
kernExecutionTime +=
to_milli(endKernExecutionTimePoint - startKernExecutionTimePoint);
}
set_command_group_submission_time(cgSubmissionTime);
set_kernel_execution_time(kernExecutionTime);
set_real_execution_time(realExecutionTime);
}
And applied the steps to my vector addition code:
void VectorAddInDPCPP_2(const IntArray &VA, const IntArray &VB, IntArray &VC, queue deviceQueue) {
// print out the device information used for the kernel code
std::cout << "Device: " << deviceQueue.get_device().get_info<info::device::name>() << std::endl;
// set up profiling data containers
using wall_clock_t = std::chrono::high_resolution_clock;
using time_point_t = std::chrono::time_point<wall_clock_t>;
int profiling_iters = 1;
std::vector<cl::sycl::event> eventList(profiling_iters);
std::vector<time_point_t> startTimeList(profiling_iters);
// create the range object for the arrays managed by the buffer
range<1> num_items{ array_size };
buffer<int, 1> bufferA(VA.data(), num_items);
buffer<int, 1> bufferB(VB.data(), num_items);
buffer<int, 1> bufferC(VC.data(), num_items);
// Submit a kernel to the queue, returns a SYCL event
for (size_t i = 0; i < profiling_iters; ++i) {
startTimeList.at(i) = wall_clock_t::now();
eventList.at(i) = deviceQueue.submit([&](handler &cgh) {
auto accessorA = bufferA.get_access<dp_read>(cgh);
auto accessorB = bufferB.get_access<dp_read>(cgh);
auto accessorC = bufferC.get_access<dp_write>(cgh);
cgh.parallel_for(num_items, [=](id<1> j) {
accessorC[j] = accessorA[j] + accessorB[j];
});
});
}
// exec profile
example_profiler<double> my_profiler(eventList, startTimeList);
std::cout << "Kernel exec: " << my_profiler.get_kernel_execution_time() << " msec" << std::endl;
std::cout << "Cmd Group submission: " << my_profiler.get_command_group_submission_time() << " msec" << std::endl;
std::cout << "Real exec: " << my_profiler.get_real_execution_time() << " msec" << std::endl;
}
Finally, this is the result for a 10K array, on the cloud:
########################################################################
# Date: Wed Apr 1 08:26:52 PDT 2020
# Job ID: 561887.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
Device: Intel(R) Gen9 HD Graphics NEO
Kernel exec: 0.150332 msec
Cmd Group submission: 2.66203 msec
Real exec: 167.013 msec
VectorAddInDPCPP exec: 232.501 msec
Scalar exec: 0.095 msec
success
########################################################################
# End of output for job 561887.v-qsvr-1.aidevcloud
# Date: Wed Apr 1 08:26:57 PDT 2020
########################################################################
As you can see above, the queue submission + kernel execution time is much less than the real execution time. What is the reason for waiting that long and is there a way to improve/get rid of this waiting time?
Melih
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
Hope things are going fine there.
For processing the loop efficiently in the FPGA in parallel fashion , loop unrolling can be applied. This ensures that different execution units are created to handle the summation. Without using this the parallel and scalar operation performance will not differ much , and parallel will also involve setup and related time.
So to extract performance out of the FPGA please perform a loop unrolling
#pragma unroll UNROLL_FACTOR
before the for loop exection.
You can try with different UNROLL_FACTORS , as twos multiples like 2 , 4 ,8 ,16 etc
and then compare the performance with the scalar version.
Please find an example below
cgh.single_task<class covariance>(
[=]()
{
/* Accessor related code HERE */
#pragma unroll UNROLL_FACTOR
For (int j=0;j<num_items, j++)
{
accessorC[j] = accessorA[j] + accessorB[j];
}
}
Thanks and Regards
Anil
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Everything is fine for now, I hope it's the same there..
I have some experience on DSP programming and I am familiar with some of the low level optimization methods.. At the last example I gave in the prev. post you can see that the CPU exec time is about 1600+ times faster. In this case, I think our problem is not that low level. I think the main problem here is the JIT Compiler overhead.
This picture is from the book: https://www.apress.com/us/data-parallel-c-advanced-chapters-just-released/17382670
On the cloud, I will try Ahead-Of-Time compiling, as suggested in here: https://software.intel.com/en-us/oneapi-dpcpp-compiler-dev-guide-and-reference-ahead-of-time-compilation
I will notify you with the results. And maybe we need to move this subject into another topic.
GNL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi again, it seems like ahead-of-time compiling drastically reduced our run-time performance for the device. On the other hand, it increased the host execution time..
This is the result for AOT compiling:
########################################################################
# Date: Tue Apr 14 05:18:06 PDT 2020
# Job ID: 573459.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
Vector Size: 100000
-------------------------------------------
Device: Intel(R) Gen9 HD Graphics NEO
Kernel exec: 0.088999 msec
Cmd Group submission: 3.9869 msec
Real exec: 9.85082 msec
VectorAddInDPCPP exec: 17.543 msec
Scalar exec: 1.785 msec
success
########################################################################
# End of output for job 573459.v-qsvr-1.aidevcloud
# Date: Tue Apr 14 05:18:12 PDT 2020
########################################################################
And this is the result for JIT compiling:
########################################################################
# Date: Tue Apr 14 05:36:10 PDT 2020
# Job ID: 573469.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
Vector Size: 100000
-------------------------------------------
Device: Intel(R) Gen9 HD Graphics NEO
Kernel exec: 0.149666 msec
Cmd Group submission: 3.13661 msec
Real exec: 175.897 msec
VectorAddInDPCPP exec: 243.523 msec
Scalar exec: 0.099 msec
success
########################################################################
# End of output for job 573469.v-qsvr-1.aidevcloud
# Date: Tue Apr 14 05:36:15 PDT 2020
########################################################################
And finally this my makefile for AOT:
CXX = dpcpp
#CXXFLAGS = -O2 -g
#LDFLAGS = -lOpenCL -lsycl
EXE_NAME = vector-add
SOURCES = src/vector-add.cpp
all: main
main:
$(CXX) -fsycl-targets=spir64_gen-unknown-unknown-sycldevice -Xsycl-target-backend '-device skl' -o $(EXE_NAME) $(SOURCES)
run:
./$(EXE_NAME)
clean:
rm -rf $(EXE_NAME)
Did I skip something here? We increased the device kernel exec performance but why the host performance (Scalar exec above) is suffering now?
GNL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
Hope you are staying safe.
Can you incorporate the loop unrolling as mentioned previously , along with the ahead of time compilation and check the results.
Thanks and Regards
Anil
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page