- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My i7-12700 laptop is supposed to support bandwidth of nearly 64 GBytes/sec. When I run parallel fill(dpl::) I'm only getting about 16 GBytes/sec.
On Intel 48-core Intel Xeon 8275CL AWS node, which has 240 GBytes/sec of bandwidth, parallel fill(dpl::) gets to nearly 40 GBytes/sec.
Is it possible to get closer to the full bandwidth of system memory?
Thank you,
-Victor
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for posting on Intel communities.
Could you please share with us the following details,
1. A sample reproducer
2. Could you please let us know how you are measuring bandwidth usage?
Thanks & Regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. Could you please provide us with an update on your issue?
Thanks & Regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is my benchmark implementation:
#define DPL_ALGORITHMS
#ifdef DPL_ALGORITHMS
// oneDPL headers should be included before standard headers
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/iterator>
#else
#include <algorithm>
#include <execution>
#include <iterator>
#endif
#include <iomanip>
#include <iostream>
#include <random>
#include <immintrin.h>
#include <sycl/sycl.hpp>
#include "utils.hpp"
using namespace sycl;
using namespace std;
using std::chrono::duration;
using std::chrono::duration_cast;
using std::chrono::high_resolution_clock;
using std::milli;
void print_results(const char* const tag, const vector<int>& sorted,
high_resolution_clock::time_point startTime,
high_resolution_clock::time_point endTime)
{
printf("%s: Lowest: %d Highest: %d Time: %fms\n", tag, sorted.front(), sorted.back(),
duration_cast<duration<double, milli>>(endTime - startTime).count());
}
void fill_benchmark()
{
std::vector<int> data(100000000);
auto startTime = high_resolution_clock::now();
std::fill(std::execution::seq, data.begin(), data.end(), 42);
auto endTime = high_resolution_clock::now();
print_results("Serial std::fill", data, startTime, endTime);
//startTime = high_resolution_clock::now();
//std::fill(std::execution::unseq, data.begin(), data.end(), 42);
//endTime = high_resolution_clock::now();
//print_results("SIMD Fill", data, startTime, endTime);
startTime = high_resolution_clock::now();
std::fill(std::execution::par, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("Parallel std::fill", data, startTime, endTime);
startTime = high_resolution_clock::now();
std::fill(std::execution::par_unseq, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("Parallel SIMD std::fill", data, startTime, endTime);
#ifdef DPL_ALGORITHMS
startTime = high_resolution_clock::now();
std::fill(oneapi::dpl::execution::seq, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("Serial dpl::fill", data, startTime, endTime);
startTime = high_resolution_clock::now();
std::fill(oneapi::dpl::execution::unseq, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("SIMD dpl::fill", data, startTime, endTime);
startTime = high_resolution_clock::now();
std::fill(oneapi::dpl::execution::par, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("Parallel dpl::fill", data, startTime, endTime);
startTime = high_resolution_clock::now();
std::fill(oneapi::dpl::execution::par_unseq, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("Parallel SIMD dpl::fill", data, startTime, endTime);
startTime = high_resolution_clock::now();
std::fill(oneapi::dpl::execution::dpcpp_default, data.begin(), data.end(), 42);
endTime = high_resolution_clock::now();
print_results("Parallel DPCPP_DEFAULT dpl::fill", data, startTime, endTime);
#endif
}
int main()
{
fill_benchmark();
return 0;
}
On 48-core instance in AWS Xeon 8275CL this runs at 28 GBytes/sec out of over 200 GBytes/sec.
On 14-core i7-12700H laptop nearly the same performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for sharing your code with us. However, when we tried to run the code, we encountered a fatal error, which prevented us from proceeding.
As a result, we attempted to comment out the problematic section, and we were able to obtain some output as a result.
Could you please provide us with the steps you followed to execute the code? Additionally, it would be helpful if you could share the specific command line you used to run the code.
Furthermore, we would like clarification on how you are measuring the usage of bandwidth. So that we can investigate the issue more from our end.
Thank you and regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So glad you're able to compile and run the code. Nice! You compiled and ran from command line.
I'm building and running in VisualStudio Community 2022, which uses the Intel compiler for this project.
The way I compute bandwidth for fill is the array is 100,000,000 32-bit integers = 100,000,000 integers * 4 bytes/integer = 400,000,000 bytes. To fill the array in your case took 47.7 milliseconds. To compute the bandwidth 400,000,000 bytes / 0.047 seconds = 8.385 Billion bytes/second.
I should have the code compute it
-Victor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for the explanation.
Could you please elaborate more on your issue?
Are you attempting to utilize the maximum available bandwidth for parallel fill(dpl::)
i.e std::fill(oneapi::dpl::execution::par, data.begin(), data.end(), 42)?
Also, could you please explain your purpose or intention behind adding std::fill() in your code?
Thanks & Regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm working on other parallel algorithms which are limited by write speed to system memory. Benchmarking std::fill pointed out an issue, either with code or with Intel CPU architecture where writes don't seem to use all of the available system memory bandwidth. I'm trying to get to the root cause of this performance issue. The code above is a benchmark that shows even std::fill with parallel execution can't seem to utilize all of the available bandwidth.
-Victor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are working on this issue internally. We will get back to you soon.
Thanks & Regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We apologize for the delay in our response. We wanted to inform you that we are actively working on your issue and will provide you with an update shortly.
Thank you for your patience and understanding.
Thanks & Regards,
Vankudothu Vaishnavi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the update Vankudothu. Looking forward to your findings.
-Victor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for your patience and understanding.
The fill() operation fills a given memory destination with a constant value. It is more efficient to route this operation through compute units rather than the memory subsystem. Can you please explain how you are certain that this is a test for memory bandwidth?
And also, Could you please try a copy operation (using std::copy()) with an array of the same size and check if it is faster on the same machine?
Thanks & regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. Could you please provide us the details we asked for?
Thanks & Regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. So, We will go ahead and close this thread.
If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Thanks & Regards,
Vankudothu Vaishnavi.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page